Text Processing in Rust

Create handy command-line utilities in Rust. By Mihalis Tsoukalos

This article is about text processing in Rust, but it also contains a quick introduction to pattern matching, which can be very handy when working with text.

Strings are a huge subject in Rust, which can be easily realized by the fact that Rust has two data types for representing strings as well as support for macros for formatting strings. However, all of this also proves how powerful Rust is in string and text processing.

Apart from covering some theoretical topics, this article shows how to develop some handy yet easy-to-implement command-line utilities that let you work with plain-text files. If you have the time, it'd be great to experiment with the Rust code presented here, and maybe develop your own utilities.

Rust and Text

Rust supports two data types for working with strings: String and str. The String type is for working with mutable strings that belong to you, and it has length and a capacity property. On the other hand, the str type is for working with immutable strings that you want to pass around. You most likely will see an str variable be used as &str. Put simply, an str variable is accessed as a reference to some UTF-8 data. An str variable is usually called a "string slice" or, even simpler, a "slice". Due to its nature, you can't add and remove any data from an existing str variable. Moreover, if you try to call the capacity() function on an &str variable, you'll get an error message similar to the following:


error[E0599]: no method named `capacity` found for type 
 ↪`&str` in the current scope

Generally speaking, you'll want to use an str when you want to pass a string as a function parameter or when you want to have a read-only version of a string, and then use a String variable when you want to have a mutable string that you want to own.

The good thing is that a function that accepts &str parameters can also accept String parameters. (You'll see such an example in the basicOps.rs program presented later in this article.) Additionally, Rust supports the char type, which is for representing single Unicode characters, as well as string literals, which are strings that begin and end with double quotes.

Finally, Rust supports what is called a byte string. You can define a new byte string as follows:


let a_byte_string = b"Linux Journal";

unwrap()

You almost certainly cannot write a Rust program without using the unwrap() function, so let's take a look at that here. Rust does not have support for null, nil or Null, and it uses the Option type for representing a value that may or may not exist. If you're sure that some Option or Result variable that you want to use has a value, you can use unwrap() and get that value from the variable.

However, if that value doesn't exist, your program will panic. Take a look at the following Rust program, which is saved as unwrap.rs:


use std::net::IpAddr;

fn main() {
    let my_ip = "127.0.0.1";
    let parsed_ip: IpAddr = my_ip.parse().unwrap();
    println!("{}", parsed_ip);

    let invalid_ip = "727.0.0.1";
    let try_parsed_ip: IpAddr = invalid_ip.parse().unwrap();
    println!("{}", try_parsed_ip);
}

Two main things are happening here. First, as my_ip is a valid IPv4 address, parse().unwrap() will be successful, and parsed_ip will have a valid value after the call to unwrap().

However, as invalid_ip is not a valid IPv4 address, the second attempt to call parse().unwrap() will fail, the program will panic and the second println!() macro will not be executed. Executing unwrap.rs will verify all these:


$ ./unwrap
127.0.0.1
thread 'main' panicked at 'called `Result::unwrap()` 
 ↪on an `Err`
value: AddrParseError(())', libcore/result.rs:945:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

This means you should be extra careful when using unwrap() in your Rust programs. Unfortunately, going into more depth on unwrap() and how to avoid panic situations is beyond the scope of this article.

The println! and format! Macros

Rust supports macros, including println! and format! that are related to strings.

A Rust macro lets you write code that writes other code, which is also known as metaprogramming. Although macros look a lot like Rust functions, they have a fundamental difference from Rust functions: macros can have a variable number of parameters, whereas the signature of a Rust function must declare its parameters and define the exact type of each one of those function parameters.

As you might already know, the println! macro is used for printing output to the UNIX standard output, whereas the format! macro, which works in the same way as println!, returns a new String instead of writing any text to standard output.

The Rust code of macros.rs will try to clarify things:


macro_rules! hello_world{
    () => {
        println!("Hello World!")
    };
}

fn double(a: i32) -> i32 {
    return a + a
}

fn main() {
    // Using the format!() macro
    let my_name = "Mihalis";
    let salute = format!("Hello {}!", my_name);
    println!("{}", salute);

    // Using hello_world
    hello_world!();

    // Using the assert_eq! macro
    assert_eq!(double(12), 24);
    assert_eq!(double(12), 26);
}

What knowledge do you get from macros.rs? First, that macro definitions begin with macro_rules! and can contain other macros in their implementation. Note that this is a very naïve macro that does nothing really useful. Second, you can see that format! can be very handy when you want to create your own strings using your own format. Third, the hello_world macro created earlier should be called as hello_world!(). And finally, this shows that the assert_eq!() macro can help you test the correctness of your code.

Compiling and running macros.rs produces the following output:


$ ./macros
Hello Mihalis!
Hello World!
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `24`,
 right: `26`', macros.rs:22:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Additionally, you can see an advantage of the assert_eq! macro here: when an assert_eq! macro fails, it also prints the line number and the filename of the assertion, which can't be done using a function.

Working with Strings

Now let's look at how to perform basic text operations in Rust. The Rust code for this example is saved in basicOp.rs and is the following:


fn accept(s: &str) {
    println!("{}", s);
}

fn main() {
    // Define a str
    let l_j: &str= "Linux Journal";
    // Or
    let magazine: &'static str = "magazine";
    // Use format! to create a String
    let my_str = format!("Hello {} {}!", l_j, magazine);
    println!("my_str L:{} C:{}", my_str.len(), 
     ↪my_str.capacity());

    // String character by character
    for c in my_str.chars() {
        print!("{} ", c);
    }
    println!();

    for (i, c) in my_str.chars().enumerate() {
        print!("{}:{} ", c, i);
    }
    println!();

    // Convert string to number
    let n: &str = "10";
    match n.parse::<i32>() {
      Ok(n) => println!("{} is a number!", n),
      Err(e) => println!("{} is NOT a number!", e),
    }

    let n1: &str = "10.2";
    match n1.parse::<i32>() {
      Ok(n1) => println!("{} is a number!", n1),
      Err(e) => println!("{}: {}", n1, e),
    }

    // accept() works with both str and String
    let my_str = "This is str!";
    let mut my_string = String::from("This is string!");
    accept(&my_str);
    accept(&my_string);

    // my_string has capacity
    println!("my_string L:{} C:{}", my_string.len(), 
     ↪my_string.capacity());
    my_string.push_str("OK?");
    println!("my_string L:{} C:{}", my_string.len(), 
     ↪my_string.capacity());

    // Convert String to str
    let s_str: &str = &my_string[..];
    // Convert str to String
    let s_string: String = s_str.to_owned();
    println!("s_string: L:{} C:{}", s_string.len(), 
     ↪s_string.capacity());
}

So, first you can see two ways for defining str variables and creating a String variable using the format! macro. Then, you can see two techniques for iterating over a string character by character. The second technique also returns an index to the string that you process. After that, this example shows how to convert a string into an integer, if it's possible, with the help of parse::<i32>(). Next, you can see that the accept() function accepts both an &str and a String parameter even though its definition mentions an &str parameter. Following that, this shows the capacity and the length properties of a String variable, which are two different things. The length of a String is the size of the String, whereas the capacity of a String is the room that is currently allocated for that String. Finally, you can see how to convert a String to str and vice versa. Other ways for getting a String from an &str variable include the use of .to_string(), String::from(), String::push_str(), format!() and .into().

Executing basicOp.rs generates the following output:


$ ./basicOp
my_str L:29 C:32
H e l l o   L i n u x   J o u r n a l   m a g a z i n e !
H:0 e:1 l:2 l:3 o:4  :5 L:6 i:7 n:8 u:9 x:10  :11 J:12 o:13 
 ↪u:14 r:15
n:16 a:17 l:18  :19 m:20 a:21 g:22 a:23 z:24 i:25 n:26 e:27 
 ↪!:28
10 is a number!
10.2: invalid digit found in string
This is str!
This is string!
my_string L:15 C:15
my_string L:18 C:30
s_string: L:18 C:18

Finding Palindrome Strings

Now, let's look at a small utility that checks whether a string is a palindrome. The string is given as a command-line argument to the program. The logic of palindrome.rs is found in the implementation of the check_palindrome() function, which is implemented as follows:


pub fn check_palindrome(input: &str) -> bool {
    if input.len() == 0 {
        return true;
    }
    let mut last = input.len() - 1;
    let mut first = 0;

    let my_vec = input.as_bytes().to_owned();

    while first < last {
        if my_vec[first] != my_vec[last] {
            return false;
        }

        first +=1;
        last -=1;
    }
    return true;
}

The key point here is that you convert the string to a vector using a call to as_bytes().to_owned() in order to be able to access it as an array. After that, you keep processing the input string from both its left and its right side, one character from each side for as long as both characters are the same or until you pass the middle of the string. In that case, you are dealing with a palindrome, so the function returns "true"; otherwise, the function returns "false".

Executing palindrome.rs with various types of input generates the following kind of output:


$ ./palindrome 1
1 is a palindrome!
$ ./palindrome
Usage: ./palindrome string
$ ./palindrome abccba
abccba is a palindrome!
$ ./palindrome abcba
abcba is a palindrome!
$ ./palindrome acba
acba is not a palindrome!

Pattern Matching

Pattern matching can be very handy, but you should use it with caution, because it can create nasty bugs in your software. Pattern matching in Rust happens with the help of the match keyword. A match statement must catch all the possible values of the used variable, so having a default branch at the end of the block is a very common practice. The default branch is defined with the help of the underscore character, which is a synonym for "catch all". In some rare situations, such as when you examine a condition that can be either true or false, a default branch is not needed. A pattern-matching block can look like the following:


let salute = match a_name
{
"John" => "Hello John!",
"Jim" => "Hello Boss!",
"Jill" => "Hello Jill!",
_ => "Hello stranger!"
};

What does that block do? It matches one of the three distinct cases, if there is match, or it goes to the match all cases, which is last. If you want to perform more complex tasks that require the use of regular expressions, the regex crate might be more appropriate.

A Version of wc in Rust

Now let's look at the implementation of a simplified version of the wc(1) command-line utility. The Rust version of the utility will be saved as wc.rs, will not support any command-line flags, will consider every command-line argument as a file, and it can process multiple text files. The Rust version of wc.rs is the following:


use std::env;
use std::io::{BufReader, BufRead};
use std::fs::File;

fn main() {
    let mut lines = 0;
    let mut words = 0;
    let mut chars = 0;

    let args: Vec<_> = env::args().collect();
    if args.len() == 1 {
        println!("Usage: {} text_file(s)", args[0]);
        return;
    }

    let n_args = args.len();
    for x in 1..n_args {
        let mut total_lines = 0;
        let mut total_words = 0;
        let mut total_chars = 0;

        let input_path = ::std::env::args().nth(x).unwrap();
        let file = BufReader::new(File::open(&input_path)
↪.unwrap());
        for line in file.lines() {
            let my_line = line.unwrap();
            total_lines = total_lines + 1;
            total_words += my_line.split_whitespace().count();
            total_chars = total_chars + my_line.len() + 1;
        }

        println!("\t{}\t{}\t{}\t{}", total_lines, total_words,
 ↪total_chars, input_path);
        lines += total_lines;
        words += total_words;
        chars += total_chars;
    }

    if n_args-1 != 1 {
        println!("\t{}\t{}\t{}\ttotal", lines, words, chars);
    }
}

First, you should know that wc.rs is using buffered input for processing its text files. Apart from that, the logic of the program is found in the inner for loop that reads each input file line by line. For each line it reads, it counts the characters and words. Counting the characters of a line is as simple as calling the len() function. Counting the words of a line requires splitting the line using split_whitespace() and counting the number of elements in the generated iterator.

The other thing you should think about is resetting the total_lines, total_words and total_chars counters after processing a file. The lines, words and chars variables hold the total number of lines, words and characters read from all processed text files.

Executing wc.rs generates the following kind of output:


$ rustc wc.rs
$ ./wc
Usage: ./wc text_file(s)
$ ./wc wc.rs
    40      124     1114    wc.rs
$ ./wc wc.rs palindrome.rs
    40      124     1114    wc.rs
    39      104     854     palindrome.rs
    79      228     1968    total
$ wc wc.rs palindrome.rs
      40     124    1114 wc.rs
      39     104     854 palindrome.rs
      79     228    1968 total

The last command executed wc(1) in order to verify the correctness of the output of wc.rs.

As an exercise, you might try creating a separate function for counting the lines, words and characters of a text file.

Matching Lines That Contain a Given String

In this section, you'll see how to show the lines of a text file that match a given string—both the filename and the string will be given as command-line arguments to the utility, which is named match.rs. Here's the Rust code for match.rs:


use std::env;
use std::io::{BufReader,BufRead};
use std::fs::File;

fn main() {
let mut total_lines = 0;
    let mut matched_lines = 0;
    let args: Vec<_> = env::args().collect();

if args.len() != 3 {
  println!("{} filename string", args[0]);
        return;
  }

let input_path = ::std::env::args().nth(1).unwrap();
let string_to_match = ::std::env::args().nth(2).unwrap();
let file = BufReader::new(File::open(&input_path).unwrap());
for line in file.lines() {
total_lines += 1;
let my_line = line.unwrap();
if my_line.contains(&string_to_match) {
println!("{}", my_line);
            matched_lines += 1;
}
}

println!("Lines processed: {}", total_lines);
println!("Lines matched: {}", matched_lines);
 }

All the dirty work is done by the contains() function that checks whether the line that is currently being processed contains the desired string. Apart from that, the rest of the Rust code is pretty trivial.

Building and executing match.rs generates output like this:


$ ./match tabSpace.rs t2s
fn t2s(input: &str, n: i32) {
        t2s(&input_path, n_space);
Lines processed: 56
Lines matched: 2
$ ./match tabSpace.rs doesNotExist
Lines processed: 56
Lines matched: 0

Converting between Tabs and Spaces

Next, let's develop a command-line utility that can convert tabs to spaces in a text file and vice versa. Each tab is replaced with four space characters and vice versa.

This utility requires at least two command-line parameters: the first one should indicate whether you want to replace tabs with spaces or the other way around. After that, you should give the path of at least one text file. The utility will process as many text files as you want, just like the wc.rs utility presented earlier in this article.

You can find tabSpace.rs's logic in the following two Rust functions:


fn t2s(input: &str) {
let file = BufReader::new(File::open(&input).unwrap());
for line in file.lines() {
        let my_line = line.unwrap();
        let new_line = my_line.replace("\t", "    ");
        println!("{}", new_line);
    }
}

fn s2t(input: &str) {
let file = BufReader::new(File::open(&input).unwrap());
for line in file.lines() {
        let my_line = line.unwrap();
        let new_line = my_line.replace("    ", "\t");
        println!("{}", new_line);
    }
}

All the work is done by replace(), which replaces every occurrence of the first pattern with the second one. The return value of the replace() function is the altered version of the input string, which is what's printed on your screen.

Executing tabSpace.rs creates output like the following:


$ ./tabSpace -t basicOp.rs > spaces.rs
Processing basicOp.rs
$ mv spaces.rs basicOp.rs
$ ./tabSpace -s basicOp.rs > tabs.rs
Processing basicOp.rs
$ ./tabSpace -t tabs.rs > spaces.rs
Processing tabs.rs
$ diff spaces.rs basicOp.rs

The previous command verifies the correctness of tabSpace.rs. First, any tabs in basicOp.rs are converted into spaces and saved as spaces.rs, which afterward becomes the new basicOps.rs. Then, the spaces of basicOps.rs are converted into tabs and saved in tabs.rs. Finally, the tabs.rs file is processed, and all of its tabs are converted into spaces (spaces.rs). The last version of spaces.rs should be exactly the same as basicOps.rs.

It would be a very interesting exercise to add support for tabs of variable size in tabSpace.rs. Put simply, the number of spaces of a tab should be a variable that will be given as a command-line parameter to the utility.

Conclusion

So, is Rust good at text processing and working with text in general? Yes it is! Additionally, it should be clear that text processing is closely related to file I/O and (sometimes) to pattern matching and regular expressions.

The only rational way to learn more about text processing in Rust is to experiment on your own, so don't waste any more time, and give it a whirl.

Resources

About the Author

Mihalis Tsoukalos is a UNIX administrator, a programmer, a DBA and a mathematician who enjoys technical writing. He is the author of Go Systems Programming and Mastering Go. You can reach him at http://www.mtsoukalos.eu and @mactsouk.