Learning Rust Series, part 13: Data Structure Selection

13 June 2021

Case study: data structure selection

At this point you have learned about the core data structures of Rust. You have also seen some of the algorithms that use them.

This post presents a case study with exercises. I want to let you think about choosing data structures and practice using them.

Word frequency analysis

As usual, you should at least attempt the exercises before you read my solutions.

Exercise 13.1. Write a program that reads a file and breaks each line into words. Then strips whitespace and punctuation from the words, and converts them into lowercase.

Hint: split_whitespace might be a handy method, as would collect()…

Exercise 13.2. Go to Project Gutenberg. Once there, download your favorite out-of-copyright book in plain text format.

Change your program from the previous exercise to read the book you downloaded. Make it skip over the header information at the beginning of the file, and process the rest of the words as before.

Then change the program to count the total number of words in the book, and the number of times each word occurs.

Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?

Exercise 13.3. Change the program from the previous exercise again. Make it print the 20 most used words in the book.

Exercise 13.4. Change the previous program again. This time to read a word list (see post 9 in this series) and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are obscure?

Random numbers

Given the same inputs, most computer programs generate the same outputs every time. We call such programs deterministic . Determinism is usually a good thing. We expect the same calculation to yield the same result. For some applications, though, we want the computer to be unpredictable. Games are an obvious example, but there are more. Making a program nondeterministic turns out to be difficult. There are ways to make it at least seem nondeterministic. One of them is to use algorithms that generate pseudorandom numbers. Since we use deterministic computations to generate them, they are not random. But by looking at the numbers it is all but impossible to distinguish them from random.

The rand module provides functions that generate pseudorandom numbers. I will call them “random” from here on.

The function rand::thread_rng() returns a random number generator you can call:

use rand::Rng;

fn main() {
    let mut generator = rand::thread_rng();
    let random_number: u8 = generator.gen();
    let another_random_number: u16 = generator.gen();
}

The method gen() returns a float between 0.0 and 1.0 (including 0.0 but not 1.0). Each time you call gen(), you get the next number in a long series.

The function gen_range(a, b) takes two parameters a and b and returns an integer between a and b (including abut not b).

The rand::Rng module had functions to generate random values from continuous distributions. Including Normal, Cauchy, Binomial, LogNormal, exponential, gamma, and a lot more. But they are now almost all deprecated. Instead, use the crate rand_distr or statrs to get access to them.

The documentation: rand_distr And here statrs.

Exercise 13.5. Write a function named choose_from_histogram. It takes a histogram as defined in post 11 and returns a random value from the histogram. Chosen with probability in proportion to frequency. For example, for this histogram: ['a':2, 'b':1] your function should return a with probability 2/3 and b with probability 1/3.

Word histogram

You should attempt the previous exercises before you go on.

Here is a program that reads a file and builds a histogram of the words in the file:

use std::collections::HashMap;
use std::fs::file;
use std::io::prelude::*;

fn main() {
    // read file
    let filename = "emma.txt";
    let punctuation = "!\"#$%&\'”“–()*+,-./:;<=>?@[\\]^_`{|}~";
    match file::open(filename) {
        Err(error) => {
            println!("Couldn't open file {}: {}", filename, error);
        }
        Ok(mut file) => {
            let mut content = String::new();
            match file.read_to_string(&mut content) {
                Ok(_) => {
                    let washed = strip_characters(&content, punctuation);
                    let histo = histogram(washed);
                    let mut hist_vector: Vec<(String, u32)> =
                        histo.iter().map(|(k, v)| (k.clone(), *v)).collect();
                    let total_words: u32 = hist_vector.iter().map(|(_, v)| v).sum();
                    println!("Total number of words: {}", total_words);
                    println!("Number of different words: {}", histo.len());
                    for i in 0..10 {
                        println!("{:?}", hist_vector[i]);
                    }
                }
                Err(error) => println!("Couldn't read file, error: {}.", error),
            }
        }
    };
}

fn histogram(s: String) -> HashMap<String, u32> {
    let mut d = HashMap::new();
    let words = s
        .split_whitespace()
        .map(|x| x.to_owned())
        .collect::<Vec<String>>();
    for word in words {
        *d.entry(word.to_ascii_lowercase()).or_insert(1) += 1;
    }
    return d;
}

fn strip_characters(original: &str, to_strip: &str) -> String {
    original
        .chars()
        .map(|c| if !to_strip.contains(c) { c } else { ' ' })
        .collect()
}

This program reads emma.txt, which contains the text of Emma by Jane Austen. It reads the whole file into a String. That String is then passed to strip_characters to wash it from punctuation.

The function strip_characters splits every word into characters. Ith the checks each character against a string of characters in to_strip. If it finds the character in to_strip, it replaces it with a space, otherwise it keeps it as it is. All the characters are then collected back into a String and returned.

The compiler understands that we want a String, since we declared that to be the type of the return value. So we don’t need to declare the type that we want collect() to return. (It could be a good idea anyway, to clarify for human readers, your choice!)

Back in main, the washed string is then sent to the function histogram. From there it gets returned as a HashMap with every word as a key, and the number of times it occurs as the value.

The function histogram uses the string method split_whitespace. It breaks the string into a list of words. It works through the list of words and converts every word to lowercase. It then updates the histogram either by creating a new item or incrementing an existing one.

To count the total number of words in the file, we first create a vector from the HashMap. We use the sum() method to add up all the individual frequencies into a total sum. There are a couple of quirks in that conversion, so let’s look at it in detail:

let mut hist_vector: Vec<(String, u32)> = histo.iter()
.map(|(k, v)| (k.clone(), *v))
.collect();
let total_words: u32 = hist_vector.iter().map(|(_, v)| v).sum();

The first weirdness is in how we treat the key. We don’t get the actual values from iter(), we get references to them. But that’s not enough for us, we want the new vector to own its content. Since the key is a String and doesn’t have the trait Copy, we have to clone it to get our own copy.

The value is a u32, and that does have Copy, so we can get away with only dereferencing it with a *.

Try dereferencing the key instead of cloning it, and inspect the error message. Try to figure what happens.

The number of different words is the number of items in the vector or the histogram. We can use either one, histo.len() is as good a choice as hist_vector.len(), but with fewer characters it’s easier to type…

And the results:

Total number of words: 165070
Number of different words: 7384

Don’t worry if you get numbers that are different. The texts on the Gutenberg project get updated from time to time, as people find and correct errors.

Most common words

To find the most common words, we can make use of the vector hist_vector we made, and sort it.

Then we print out the top ten:

hist_vector.sort_by(|a, b| b.1.cmp(&a.1));
for i in 0..10 {
   println!("{:?}", hist_vector[i]);
}

These are the results:

("the", 5376)
("to", 5322)
("and", 4964)
("of", 4408)
("i", 3192)
("a", 3188)
("it", 2541)
("her", 2489)
("was", 2401)
("she", 2364)

Again, don’t worry about the exact numbers, you should be in the same ballpark though. Also, the last few words may come in a different order for you. Try printing out the top fifteen if you’re not getting the same words in the top ten. The missing words could be among them.

The code works as it is, and it would be tempting to leave it like this. That would be a mistake. It is hideous, with four levels of indentation, and a main method that pretty much does all the work. Let’s see if we can make it a little less crowded, by splitting it up.

First of all, why not extract the part of the code that reads and processes the file. Lets put the part that puts it into a String. with the punctuation removed to a separate function. It needs the filename, and the punctuation string. Should we send them as parameters, or store them in the function?

Well, the obvious choice for the filename is to send it as a parameter. That way we could reuse the function for other files. We’ll do the same with punctuation, and if we don’t need any cleansing, we send along an empty string.

So, the function header so far is fn process_file(filename: &str, punctuation: &str) -> ? {, now, what should the return type be, I put in a ?, but what should it be? It would be easy to think that we should return a String, because that is what we want it to return.

But this function will do I/O-operations, and they can fail. If they do, we want to return the error and allow for error handling in the code that called our function.

Rust has a type for this occasion, and we’ve used it many times already, but then only as consumers. We’ve never actually created one ourselves. So what can go wrong, and what is the error type that will have to create?

We’ll try to open a file for reading, and then we’ll try to actually read it. What error types do these actions generate? A quick look in the documentation for file::open tells us that its error type is io::Error. The same goes for file.read_to_string().

So we need to have a return type of Result, that can either contain a String, or an io::Error. That means our return type will be: Result<String, io::Error>

We copy the code from main and change it to add return statements where appropriate. Other than that there’s nothing new here, apart from one thing. When we return the Ok value with the enclosed String, we don’t need to write “return” and add a semicolon. It’s the last line of the “happy path” (the path with no errors) through the function, so we only write the return value.

fn process_file(filename: &str, punctuation: &str) -> Result<String, io::Error> {
    match file::open(filename) {
        Err(error) => {
            println!("Couldn't open file {}: {}", filename, error);
            return Err(error);
        }
        Ok(mut file) => {
            let mut content = String::new();
            match file.read_to_string(&mut content) {
                Ok(_) => {
                    let washed = strip_characters(&content, punctuation);
                    Ok(washed)
                }
                Err(error) => {
                    println!("Couldn't read file, error: {}.", error);
                    return Err(error);
                }
            }
        }
    }
}

Now main is more readable and less busy:

fn main() {
    let input_filename = "emma.txt";
    let punctuation = "!\"#$%&\'”“–()*+,-./:;<=>?@[\\]^_`{|}~";
    let content = process_file(input_filename, punctuation);
    match content {
        Err(error) => {
            println!("Couldn't process file {}: {}", input_filename, error);
        }
        Ok(content) => {
            let histo = histogram(content);
            let mut hist_vector: Vec<(String, u32)> =
                histo
                .iter()
                .map(|(k, v)| (k.clone(), *v))
                .collect();
            hist_vector.sort_by(|a, b| b.1.cmp(&a.1));
            let total_words: u32 = hist_vector.iter().map(|(_, v)| v).sum();
            println!("Total number of words: {}", total_words);
            println!("Number of different words: {}", histo.len());
            for i in 0..10 {
                println!("{:?}", hist_vector[i]);
            }
        }
    }
}

We also changed the name of filename to input_filename, in case we want to add a way to also output the results to a file.

Optional parameters

If you’ve programmed in other languages, you might’ve used function parameters with defaults. You might also seen functions that take optional parameters.

This is not possible in Rust. There is no support for optional parameters. And no direct way to specify default values for excluded parameters.

You could construct a function that takes parameters of the type Option. Then you could use a match to assign them a concrete value based on wether the caller provides a Some(value) or a None.

That would create a weird interface to your function. Your best bet is to avoid making the caller jump through hoops like that.

HashMap subtraction

Finding the words from the book that are not in the word list from CROSSWD.TXT is a problem you might recognize. It’s a set subtraction. That is, we want to find all the words from one set (the words in the book) that are not in the other (the words in the list).

fn subtract(hm1: HashMap<String, u32>, hm2: HashMap<String, u32>) -> HashSet<String> {
    let hs1: HashSet<String> = hm1.keys().cloned().collect();
    let hs2: HashSet<String> = hm2.keys().cloned().collect();
    hs1.difference(&hs2).cloned().collect()
}

subtract(hm1,hm2) takes hashmaps hm1 and hm2 and returns a new HashSet. It contains all the keys from hm1 that are not in hm2. Since we don’t care about the values, we can use a HashSet, that only contains the keys. The method difference as defined for HashSet returns an iterator with the items that are in hs1 but not in hs2.

We clone all the items in the iterator, and collect them in a new HashSet. The compiler recognizes that we want a HashSet from the collect() method. Since this is the last line and it has no semicolon, it has to be a return value. And the return value is set to be a HashSet.

If we had written hs2.difference(&hs1) we would get all the items in hs2 that are not in hs1. Another difference is the symmetric_difference. It returns an iterator over the words that are only present in one of the sets, but not both.

There is also a method called intersection. It returns an iterator with all the items that are present in both sets. And a union that combines all the items of each set into one iterator.

To find the words in the book that are not in CROSSWD.TXT, we can use process_file and histogram. We build histograms for both files, and then subtract them:

fn check_two_files(file1: String, file2: String) -> Result<HashSet<String>, io::Error> {
    let punctuation = "!\"#$%&\'”“–()*+,-./:;<=>?@[\\]^_`{|}~";
    let mut hm1 = HashMap::new();
    let mut hm2 = HashMap::new();
    let content1 = process_file(&file1, punctuation);
    match content1 {
        Err(error) => {
            println!("Couldn't process {}: {}", file1, error);
            return Err(error);
        }
        Ok(content1) => {
            hm1 = histogram(content1);
        }
    };
    let content2 = process_file(&file2, punctuation);
    match content2 {
        Err(error) => {
            println!("Couldn't process {}: {}", file2, error);
            return Err(error);
        }
        Ok(content2) => {
            hm2 = histogram(content2);
        }
    };
    return Ok(subtract(hm1, hm2));
}

It’s tempting to skip all that matching and use unwrap(). And to skip returning a Result, and return the HashSet. But that could get us into trouble later, if we decide to reuse the code. Now we know that we check everything and everything works as it should. And errors propagate as they should, so the caller can handle them as he or she wants.

The real work is in these lines:

let mut hm1 = HashMap::new();
    let mut hm2 = HashMap::new();
    let content1 = process_file(&file1, punctuation);
    let content2 = process_file(&file2, punctuation);
    ...
    hm1 = histogram(content1);
    ...
    hm2 = histogram(content2);
    ...
    subtract(hm1,hm2)

We create two mutable HashMaps, and we process the two files which gives us two Strings. We turn each of them into a separate HashMap via histogram. Then we use subtract to get the words that are in the first and not in the second.

We use it like this:

let hs = check_two_files("emma.txt".to_string(), "CROSSWD.TXT".to_string());
    match hs {
        Err(error) => println!("Something went wrong: {}", error),
        Ok(hs) => {
            println!("{:?}", hs);
        }
    };

Here are some of the words that we found:\

"secresy", "unexamined", "mitchell", "campbells", "endeavours", "elizabeth", "splendour"

At least some of them, like “secresy”, “unexamined” and “endeavours” should have been in CROSSWD.TXT. but they aren’t.

Random words

To choose a random word from the histogram, the simplest algorithm is to build a vector. We fill it with many copies of each word according to the observed frequency. And then choose one from the vector at random.

This works, but it is not very efficient; the vector will be as big as the original book. An alternative is:

Use histogram to build a HashMap, and then use the keys method on that to build a vector of the words in the book.

Get a random number between 1 and the number of items in the vector, and get that word.

Exercise 13.7. Write a program that uses this algorithm to choose a random word from the book.

Debugging

When you are debugging a program, and especially if you are working on a hard bug, there are five things to try:

Reading: Examine your code, read it back to yourself, and check that it says what you meant to say.

Running: Experiment by making changes and running different versions. Often if you display the right thing at the right place in the program, the problem becomes obvious. But sometimes you have to build scaffolding.

Ruminating: Take some time to think! What kind of error is it: syntax, runtime, or semantic? What information can you get from the error messages, or from the output of the program? What kind of error could cause the problem you’re seeing? What did you change last, before the problem appeared?

Rubberducking: Explain the problem to someone else. You often find the answer before you finish asking the question. Often you don’t even need the other person. You could talk to a rubber duck. And that’s the origin of the wellknown strategy called rubber duck debugging. Some of us even keep a rubber duck on the desktop for this very purpose.

Retreating: At some point, the best thing to do is back off. Undo recent changes, until you get back to a program that works and that you understand. Then you can start rebuilding.

Beginners often get stuck on one of these activities and forget the others. Each activity comes with its own failure mode.

For example, reading your code might help if the problem is a typographical error. But not if the problem is a conceptual misunderstanding. If you don’t understand what your program does, you can read it 100 times and never see the error. Because the error is in your head.

Running experiments can help, especially if you run small, simple tests. Take care to thinking about and reading your code. If you don’t, you can fall into a pattern I call “random walk programming”. It is the process of making random changes until the program does the right thing. Needless to say, random walk programming can take a long time.

You have to take time to think. Debugging is like an experimental science. You should have at least one hypothesis about what the problem is. If there are two or more possibilities, try to think of a test that would get rid of one of them.

But even the best debugging techniques will fail if there are too many errors. Or if the code you are trying to fix is too big and complicated. Sometimes the best option is to retreat. Simplify the program until you get to something that works and that you understand. Beginning programmers are often reluctant to retreat. They can’t stand to delete a line of code (even if it’s wrong). If it makes you feel better, copy your program into another file before you start stripping it down. Then you can copy the pieces back one at a time. Finding a hard bug requires reading, running, ruminating, and sometimes retreating. If you get stuck on one of these activities, try the others.