30 May 2021

HashMaps

This post presents another built-in type called a HashMap. They are the building blocks of many efficient and elegant algorithms.

A HashMap is a mapping of keys to values

A HashMap is like a vector, but more general. In a vector, the indices have to be integers; in a HashMap they can be (almost) any type.

A HashMap contains a collection of indices, which we call keys, and a collection of values. Each key is associated with a single value. The association of a key and a value is called a key-value pair or sometimes an item.

In mathematical language, a HashMap represents a mapping from keys to values. You can also say that each key “maps to” a value. As an example, we’ll build a HashMap that maps from English to Spanish words, so the keys and the values are all strings.

To use a HashMap, you first have to declare that you intend to use it, so at the top of your file you need the line:

use std::collections::HashMap;

The function HashMap::new() creates a new HashMap with no items.

To add items to the HashMap, you can use the method insert(), that takes a key and a value. The key can be almost any type, as long as it implements traits Eq and Hash. If you define your own types, you can almost always do that with a single line. #[derive(PartialEq, Eq, Hash)] above your definition of the custom type.

Let’s create a small HashMap that functions as a dictionary, or at least a lookup table. We’ll use it to translate words from english to spanish, and insert the first key-value pair, one -> uno:

use std::collections::HashMap;

fn main(){
    let mut eng2span = HashMap::new();
    eng2span.insert("one","uno");
    println!("{:?}", eng2span);
}

This creates an item that maps from the key "one" to the value "uno". When we print the HashMap, we see a key-value pair with a colon between the key and value:

{"one": "uno"}

You can insert several items at once, using the extend() method:

fn main() {
    let mut eng2span = HashMap::new();
    eng2span.insert("one", "uno");
    let my_vec = vec![("two", "dos"), ("three", "tres"), ("four", "cuatro")];
    let temp: HashMap<_, _> = my_vec.into_iter().collect();
    eng2span.extend(temp);
    println!("{:?}", eng2span);
}

You can also create a HashMap from a literal with several tuples (we’ll handle tuple in another post):

let test: HashMap<String, i32> = [("a".to_string(), 1), ("b".to_string(), 2)]
        .iter()
        .cloned()
        .collect();

A caveat: if you try to inline the conversion of the vector to a hashmap when you insert into eng2span, you’ll run into trouble. The compiler will complain that it can’t make out the types:

eng2span.extend(my_vec.into_iter().collect());
  |              ^^^^^^ cannot infer type for type parameter `A` declared on the trait `Extend`

The easiest way to handle that is to use a temporary variable. Then make the type declaration explicit, like we did above:

let temp: HashMap<_, _> = my_vec.into_iter().collect();

The printout might surprise you though:

{"four": "cuatro", "one": "uno", "three": "tres", "two": "dos"}

The order of the key-value pairs isn’t always the same. If you type the same example on your computer, you might get a different result. In general, the order of items in a HashMap is unpredictable. In fact, if you run the same program several times, you’ll get different results several times.

But that’s not a problem because the elements of a HashMap are never indexed with integer indices. Instead, you use the keys to look up the corresponding values:

The key "one" always maps to the value "uno" so the order of the items doesn’t matter. If the key isn’t in the HashMap, you get a None back, since the method get() returns an Option:

    let my_word = match eng2span.get("one") {
        Some(word) => word,
        None => &"",
    };
    println!("{}", my_word);

This will print out uno.

The len() function works on HashMaps; it returns the number of key-value pairs:

    println!("{}", eng2span.len());

This prints out 4.

There is a contains_key operator that works on HashMaps. It tells you whether something appears as a key in the HashMap (appearing as a value is not good enough).

To see whether something appears as a value in a HashMap, you can use the method values(). It returns a Values object, that is an Iterator over all values. Then you can clone them and collect them into a vector and use the contains() method on that:

fn main() {
    let mut eng2span = HashMap::new();
    eng2span.insert("one", "uno");
    let my_vec = vec![("two", "dos"), ("three", "tres"), ("four", "cuatro")];
    let temp: HashMap<_, _> = my_vec.into_iter().collect();
    eng2span.extend(temp);
    // look for 'cuatro'
    let contains_cuatro = eng2span
        .values()
        .cloned()
        .collect::<Vec<&str>>()
        .contains(&"cuatro");
    if contains_cuatro {
        println!("Found cuatro");
    } else {
        println!("No cuatro here…");
    }
}

Yeah, it could be easier, but this works.

HashMap as a collection of counters

Suppose you get a string and you want to count how many times each letter appears. There are several ways you could do it:

You could create 26 variables, one for each letter of the alphabet. Then you could traverse the string. And for each character, increment the corresponding counter, using a chained conditional.
You could create a vector with 26 elements. Then you could convert each character to a number (using the built-in method as_bytes()). Then use the number as an index into the vector, and increment the appropriate counter.
You could create a HashMap with characters as keys and counters as the values. The first time you see a character, you would add an item to the HashMap. After that you would increment the value of the existing item.

Each of these options performs the same computation. But each of them implements it in a different way.

An implementation is a way of performing a computation. Some implementations are better than others. An advantage of the HashMap implementation is that we don’t have to know which letters appear in the string. And we only have to make room for the letters that do appear.

Here’s an example of what that might look like:

fn histogram(s: String) -> HashMap<char, u32> {
    let mut d = HashMap::new();
    for c in s.chars() {
        *d.entry(c).or_insert(1) += 1;
    }
    return d;
}

Here we see something we haven’t seen before, the * in front of d.entry(c).or_insert(1) += 1 inside the for loop. This is because d.entry(c).or_insert(1) has the return type &mut integer and we can’t use the += on that type. But if we dereference that we get something of the type mut integer, and on that we can use +=.

The or_insert(1) is a very convenient method often used. It saves us from the trouble of setting up an if statement where we check if the entry exists or not. If it doesn’t exist, it gets inserted with the value 1 (the value in the parentheses). If it does exist we increment the value it holds.

Looping and HashMaps

If you use a HashMap in a for statement, it traverses the keys of the HashMap. For example, this code prints each key and the corresponding value:

for (k, v) in eng2span.iter() {
        println!("{:?}: {:?}", k, v);
    }

Here’s what the output looks like:

"two": "dos"
"one": "uno"
"three": "tres"
"four": "cuatro"

Again, the keys are in no particular order. To traverse the keys in sorted order, you can collect everything into a vector. Then use the built-in function sort() on it:

let mut sorted = eng2span.iter().collect::<Vec<_>>();
    sorted.sort();
    println!("{:?}", sorted);

Reverse lookup

Given a HashMap hm and a key k, it is easy to find the corresponding value v = hm.get(&k). We call this operation a lookup.

But what if you have v and you want to find k? You have two problems: first, there might be more than one key that maps to the value v. Depending on the application, you might be able to pick one, or you might have to make a vector that contains them all. Second, there is no simple syntax to do a reverse lookup; you have to search.

Basically, you gather the values, with hm.values(), and the keys separately with hm.keys. For each value you test each key in turn. If it produces the value you have you add it to a vector. You proceed until you have found all that will produce the value you already have. The vector will then contain all the relevant keys.

A reverse lookup is much slower than a forward lookup. If you have to do it often, or if the HashMap gets big, the performance of your program will suffer.

HashMaps and vectors

Vectors can appear as values in a HashMap. For example, if you get a HashMap that maps from letters to frequencies, you might want to invert it. That is, create a HashMap that maps from frequencies to letters. There might be several letters with the same frequency. So each value in the inverted HashMap should be a vector of letters. Here is a function that inverts a HashMap, but doesn’t take into account that the same value can occur for many keys:

fn invert<'a>(map: HashMap<&'a str, &'a str>) -> HashMap<&'a str, &'a str> {
    let mut invert = HashMap::new();
    for (key, value) in map.into_iter() {
        invert.insert(value, key);
    }
    return invert;
}

That 'amarker after the function name, and on the types for keys and values is a lifetime description. It tells the compiler that these things should all have the same lifetime. It’s necessary here, the compiler will complain (and suggest you add them) if you leave them out. We discussed lifetimes in a previous post.

Vectors can be values in a HashMap, as this example shows. They can also be keys, if the values they contain implements a trait called Hash. All the primitive types do that, and it is easy to add the required traits to a custom type if you want. You do that by adding #[derive(PartialEq, Eq, Hash)] on the line above your definition of your custom type.

I mentioned earlier that a HashMap is implemented using a hashtable. That means that the keys have to be hashable .

A hash is a function that takes a value (of any kind) and returns an integer. HashMaps use these integers, called hash values, to store and look up key-value pairs.

Memos

You might have played with the fibonacci function from the post on functions and their return values. If so, you might have noticed that the bigger the argument you provide, the longer the function takes to run. Furthermore, the run time increases fast.

A better solution than the one we had back then, is to keep track of what values we already have. We can do that by storing them in a HashMap. We call a value we already have computed and store for later use a memo. Here is a “memoized” version of fibonacci:

fn fib_memo (cache: &mut HashMap<u32, u32>, arg: u32) -> u32 {
    match cache.get(&arg).map(|entry| entry.clone()) {
        Some(result) => result,
        None => {
            let result = match arg {
                0 => 0,
                1 => 1,
                n => fib_memo(cache, n - 1) + fib_memo(cache, n - 2),
            };
            cache.insert(arg, result.clone());
            result
        }
    }
}

fn main () {
    assert_eq!(fib_memo(&mut HashMap::new(), 40), 102334155);
}

The variable cache is a HashMap that keeps track of the Fibonacci numbers we already know. It starts with two items: 0 maps to 0 and 1 maps to 1.

Whenever we call fib_memo, it checks cache. If the result is already there, it can return immediately. Otherwise it has to compute the new value, add it to the HashMap, and return it.

If you run this version of fibonacci and compare it with the original, you should find that it is much faster.

Global variables

Rust doesn’t have the concept of global variables as some other languages do. There is a way to create static variables. But the creators of Rust discourage their use. Global variables are generally considered unsafe.

Debugging

As you work with bigger datasets it can become unwieldy to debug by printing and checking the output by hand. Here are some suggestions for debugging large datasets:

Scale down the input: If possible, reduce the size of the dataset. For example if the program reads a text file, start with the first 10 lines. Or with the smallest example you can find. You can either edit the files themselves, or (better) change the program so it reads only the first few lines.
If there is an error, you can reduce to the smallest value that manifests the error. Then increase it as you ﬁnd and correct errors.
Check summaries and types: Instead of printing and checking the entire dataset, consider printing summaries. For example, the number of items in a HashMap or the total of a vector of numbers.
Write self-checks: Sometimes you can write code to check for errors. For example, if you are computing the average of a vector of numbers. You can check that the result is not greater than the largest element in the vector or less than the smallest. We call this a “sanity check” because it detects results that are “insane”.
Another kind of check compares the results of two different computations to see if they are consistent. This is called a “consistency check”.
Format the output: Formatting debugging output can make it easier to spot an error.

Again, time you spend building scaffolding can reduce the time you spend debugging.