May 16, 2021

Case study: word play

This post presents a case study. It involves solving word puzzles by searching for words that have certain properties. For example, we’ll find the longest palindromes in English. We’ll also search for words whose letters appear in alphabetical order. And I’ll present another program development plan: reduction to a previously solved problem.

Reading word lists

For the examples in this post we need a list of English words. There are lots of word lists available on the Web. The one most suitable for our purpose is one of the word lists in the Moby lexicon project. It is a list of 113,809 words that are considered valid in crossword puzzles and other word games. In the Moby collection, the filename is CROSSWD.TXT. This file is in plain text, so you can open it with a text editor, but you can also read it from Rust. You need to use the crate std::io where you’ll find the methods we need to open and read files.

The documentation, that you should have as a bookmark in your browsers, is here: Module std::io

Now, reading a file is an operation that may or may not succeed. The file may not be where it’s supposed to be, or someone has changed the file name. It might even have some strange contents (i.e. not text). So we need to take that into consideration. Rust has a special type, called Result, that makes this a lot easier. And that is what all those file reading methods return.

We mentioned the Result type in the previous post, and how to “decode” it with the help of ? or expect("error message"). We also used a temporary variable that we could test in a match-statement. We also mentioned the method .unwrap() that you should avoid here. There is a real possibility that file reads can fail, and in those cases you should always avoid unwrap().

Use ? or expect if you must, but strive towards always handling potential errors as close to the source as you can. This is particularly true in main. It’s ok to use ? in functions you call, as long as you deal with potential errors in your own code somewhere.

Rust is fast, and we want things to go as fast as possible. So we don’t want several trips to the hard drive to fetch a line at a time. Instead we read the whole file in one fell swoop, and put it into something called a buffer. Then we can get a line at a time at our leisure from that, instead of going back to the file system for more.

For that we use a BufReader. Unfortunately, File doesn’t implement that. We need to import it separately with use std::io::BufReader;

Documentation for BufReader

The buffer is a structure in memory where the contents of the file gets stored until we need it.

We also need the module std::io::prelude, and we’ll import everything in it, with std::io::prelude::*. It adds a global import of useful things for I/O.

Documentation for the prelude module.

Oh, and of course we need access to the filesystem as well, and we get that from the module std::fs, and we’ll be using std::fs::File.

Documentation for std::fs::File

So, let’s read the file then, and print out a few lines on stdout.

use std::fs::File;
use std::io::prelude::*;
use std::io::BufReader;

fn main() {
    let file = File::open("CROSSWD.TXT");
    match file {
        Ok(file) => print_ten(file),
        Err(error) => panic!("Couldn't read file, error: {}", error),
        }
}

fn print_ten(file: File) {
    let buf_reader = BufReader::new(file);
    let mut my_iterator = buf_reader.lines().map(|line| line.unwrap_or_default());
    for _i in 1..11 {
            println!("{}", my_iterator.next().unwrap_or_default());
    }
}

You might notice a couple of new things here that we haven’t touched upon earlier.

First, we need to consider that the file might not be where we need it to be, and it might not contain what we hope it will. We handle the first problem with the match pattern where we check what came back in the Result we get from File::open(). If it’s an error, there’s not much we can do, so we just panic and put out an error message.

If there’s a file in there, we’ll try to print some lines. Here I use a programming method I call programming through wishful thinking. I wish I had a function called print_ten() that could print ten lines from a file. I call that function with the file if I get one from the call to File::open().

Now I need to figure out what the return type is from that call to panic!, because my print_ten() needs to have the same return type. If it doesn’t, the compiler will complain that I have differing return types in my match. It turns out that the panic! macro returns (), so my print_ten() should have that return type.

Now I know the name, the argument and the return type of my function, let’s write it.

If the file has content, we’ll read it into a BufReader and then try to get some lines from it.

If you check the documentation for the BufReader, you wont find a method for that. What you will find is a Trait with the name BufRead, and the documentation for the trait std::io::BufRead holds the clues we need.

There we find the method lines().

What we are getting back from buf_reader.lines() is an iterator over the lines of the file we’re reading. So we take that iterator and bind it to my_iterator. But first we want to unpack all the lines inside it. Each line is wrapped inside a Result, and we want an iterator that gives us the actual lines.

The map() method takes each item in the iterator and applies a function on it, the function inside the parentheses. That function looks a little weird though, so what is it doing?

To start with, we need to give a variable name to the item that we’re applying the function to, and in this case we chose the name line. We are free to choose any name here, but it makes sense to call it line, since we are trying to extract a line from the file. That’s what the |line| means: take the item we get and bind it to the variable name “line”.

Then we need to decide how we want it. Do we want to own it inside the function, or borrow it? Since we want to mutate it we should own it. So we move it into the function, if we wanted to borrow it we would write |&line| instead. Now we can mutate the line as we please.

Then we take the item, which is of type Result<String>, which means it is a String contained inside a Result, and unwrap it to extract the String. We do the unwrapping with unwrap_or_default() which is a variant on unwrap().

It will return the contained String if we get an Ok(<String>). And instead of a panic!, we get an empty string if the Result is of type Err, because the default for a String is an empty string.

Phew, all that in so few lines of code. And we’re still not done, theres more new stuff, on this line:

my_iterator.next().unwrap_or_default()

Iterators have a special method called next(). It returns the next item in the Iterator, except it wraps it in a Result. Very handy in a for loop like the one we use here.

Then we take the item, which you now know to be a Result<String>. Which is to say either Ok(<String>) or an Err(<Error>). So we use the now familiar unwrap_or_default() to unpack it so we can print it out.

Piece of cake, right?

So, now you know how to extract text lines from a text file. Time to test yourself!

Exercises

There are solutions to these exercises in the next section. You should at least attempt each one before you read the solutions.

We’re cheating, of course, and pretend that all text is in ASCII, so you can use the method chars() to create an iterator of the characters in every word. Do that, and check the documentation for that method as well as the Iterator trait:

String.chars() documentation

Documentation for the Iterator trait

Exercise 9.1. Write a program that reads and prints only the words with more than 20 characters in CROSSWD.TXT (not counting whitespace).

Exercise 9.2. In 1939 Ernest Vincent Wright published a 50,000 word novel called Gadsby that does not contain the letter “e”. Since “e” is the most common letter in English, that’s not easy to do.

Write a function called has_no_e that returns true if the given word doesn’t have the letter “e” in it.

Write a program that reads CROSSWD.TXT and prints only the words that have no “e”. Compute the percentage of words in the list that have no “e”.

Exercise 9.3. Write a function named avoids that takes a word and a string of forbidden letters. It should return true if the word doesn’t use any of the forbidden letters.

Write a program that prompts the user to enter a string of forbidden letters. Then prints the number of words in CROSSWD.TXT that don’t contain any of them. Can you find a combination of 5 forbidden letters that excludes the smallest number of words?

Exercise 9.4. Write a function named uses_only that takes a word and a string of letters. It should return true if the word contains only letters in the list. Can you make a sentence using only the letters acefhlo? Other than “Hoe alfalfa”?

Exercise 9.5. Write a function named uses_all that takes a word and a string of required letters. It should return true if the word uses all the required letters at least once. How many words are there that use all the vowels aeiou? How about aeiouy?

Exercise 9.6. Write a function called is_abecedarian that takes a word as a parameter. It should return true if the letters in the word appear in alphabetical order (double letters are ok). How many abecedarian words are there?

Search

All the exercises in the previous section have something in common. You can solve each of them with a simple search pattern. The simplest example is:

fn has_no_e(word: &str) -> bool {
    for letter in word.chars() {
        if letter == 'e' {
            return false;
        }
    }
    true
}

The for loop traverses the characters in word.chars(). Why not just word? Because word is a &str, and that is not an iterator. But we can get an iterator that contains all the characters in the word, via the method chars().

If we ﬁnd the letter “e”, we can immediately return false; otherwise we have to go to the next letter. If we exit the loop in the normal way, that means we didn’t ﬁnd an “e”, so we return true.

The function avoids is a more general version of has_no_e but it has the same structure:

fn avoids(word: &str, forbidden: &str) -> bool {
    for letter in word.chars() {
        if forbidden.contains(letter) {
            return false;
        }
    }
    true
}

We can return false as soon as we find a forbidden letter; if we get to the end of the loop, we return true.

The function uses_only is similar except that we reverse the sense of the condition:

fn uses_only(word: &str, available: &str) -> bool {
    for letter in word.chars() {
        if !available.contains(letter) {
            return false;
        }
    }
    true
}

Instead of a list of forbidden letters, we have a list of available letters. If we find a letter in word that is not in available, we can return false.

The function uses_all is similar except that we reverse the role of the word and the string of letters:

fn uses_all(word: &str, required: &str) -> bool {
    for letter in required.chars() {
        if !word.contains(letter) {
            return false;
        }
    }
    true
}

Instead of traversing the letters in word, the loop traverses the required letters. If any of the required letters do not appear in the word, we can return false.

If you have been thinking like a computer scientist, you have recognized that uses_all was an instance of a previously solved problem. So you would have written:

fn uses_all(word: &str, required: &str) -> bool {
    uses_only(required, word)
}

This is an example of a program development plan called reduction to a previously solved problem. It means that you recognize the problem you are working on as an instance of a solved problem and apply an existing solution.

9.4 Looping with indices

I wrote the functions in the previous section with for loops because I only needed the characters in the strings. I didn’t have to do anything with the indices.

For is_abecedarian we have to compare adjacent letters, which is a little tricky:

fn is_abecedarian(word: &str) -> bool {
    if word.len() > 1 {
        let mut iterator = word.chars();
        let mut initial: char = iterator.next().unwrap();
        for c in iterator {
            if c < initial {
                return false;
            }
            initial = c;
        }
    }
    true
}

We start by deciding that an empty string and a string with only one letter is abecedarian. This means that for such short strings we return true.

Then we insert a test before that where we check if the length of the word is greater than one. And set things up to iterate through the characters of the word.

We’ll be using the method chars() to create an iterator to loop over, and each item in that iterator is of the type char. We need to dig out the first character from word as a char to be able to compare the first character to the second.

We use a little trick on the iterator. When we use the method next() we get the next char, and the first time we get the first character. The trick here is that this consumes that character from the iterator. When we then start looping over our iterator, the first character is no longer in it. This is good, because if it was, we’d be making one comparison more than we need.

Then we start the loop, and compare character pairs.
If the next character is less than the current one, it means it comes before theat letter in the alphabet. Then we have discovered a break in the abecedarian trend, and we return false.

After the comparison, we move the latest character, c, into initial and go back to the beginning of the loop.

If we get to the end of the loop without finding a fault, then the word passes the test.

Debugging

Testing programs is hard. The functions in this post are easy to test because you can check the results by hand. Even so, it is somewhere between difficult and impossible to choose a set of words that test for all possible errors.

Taking has_no_e as an example, there are two obvious cases to check. Words that have an ‘e’ should return false, and words that don’t should return true. You should have no trouble coming up with one of each.

Within each case, there are some less obvious subcases. Among the words that have an “e”, you should test words with an “e” at the beginning, the end, and somewhere in the middle. You should test long words, short words, and very short words, like the empty string. The empty string is an example of a special case, which is one of the non-obvious cases where errors often lurk.

In addition to the test cases you generate, you can also test your program with a word list like CROSSWD.TXT. By scanning the output, you might be able to catch errors. But be careful: you might catch one kind of error (words that should not be included, but are) and not another (words that should be included, but aren’t).

In general, testing can help you find bugs, but it is not easy to generate a good set of test cases. And even if you do, you can’t be sure your program is correct. According to a legendary computer scientist:

Program testing can be used to show the presence of bugs, but never to show their absence! — Edsger W. Dijkstra

Learning Rust Series, part 9: Word play (Case Study)

Case study: word play

Reading word lists

Exercises

Search

Debugging