Strings

Strings are not like integers, floats, and booleans. A string is a sequence, which means it is an ordered collection of other values. In this post you’ll see how to access the characters (graphemes) that make up a string. You’ll also learn about some of the methods strings provide.

To complicate things, there are more than one type of string used in Rust. You have the primitive type str that’s built in, and then you have the type String in the std library.

The str is usually called a slice. Rust reserves the word “string” for the String type. But a lot of the time we use the word “string”for both. Unfortunate, but there you have it.

To make matters even worse, there are other string types in std, such as OsString, OsStr, CString and CStr. We’ll leave them well enough alone in this post, and focus on String and str.

We need to figure out how to create them first, and why the ways are different for String and str.

Take a look at this code:

// Define an empty String
let s = String::new();

// Define an empty String that is mutable
let s_mutable = String::new();

// create a str from a string literal
let data = "initial content";

// (this is actually the same as
let other_data: &'static = "initial content";

// where the 'static is a special lifetime value,
// that means that it is valid as long as the
// program runs, it gets baked into the binary on
// compilation

// create a String from a str
let data_string = data.to_string();

// create a String from a string literal
let another_string = String::from("initial content");

// Define a borrowed slice (&str)
let to: &str = "world";
// Or
let from: &'static str = "Rust";

// create a String from a borrowed slice (&str)
let borrowed: &str = "I'm from a &str";
let owned: String = borrowed.to_owned();

// Use format! to create a new String from strs,
// you can also combine Strings, and mix both
// strs and Strings to form the new String
let my_str = format!("Hello {} from {}! {}", to, from, owned);

That’s a lot of ways to create a String or str. That is because they come from all sorts of places, and we can’t always use them the way we want.

If you want to mutate a string, it needs to be a String. Slices are always immutable. So, to change a string you get as a str or &str, you need to create a String from it first, using .to_string(), to_owned()or format!(). The format!() macro works pretty much like println!. The difference is that format!() returns a value for you to bind to a variable. And println! outputs the result on stdin.

A string is a sequence

A String is a sequence of UTF-8 characters, or graphemes. In other languages you can access the characters one at a time with the bracket operator, like so:

fn main(){
    let fruit = "paraguayo";
    let letter = fruit[2];
}

Trying to compile this gives this result:

error[E0277]: the type `str` cannot be indexed by `{integer}`
  --> src/main.rs:13:18
   |
13 |     let letter = fruit[2];
   |                  ^^^^^^^^ string indices are ranges of `usize`
   |
   = help: the trait `std::slice::SliceIndex<str>` is not implemented for `{integer}`
   = note: you can use `.chars().nth()` or `.bytes().nth()`
           see post in The Book <https://doc.rust-lang.org/book/ch08-02-strings.html#indexing-into-strings>
   = note: required because of the requirements on the impl of `std::ops::Index<{integer}>` for `str`

In Rust you can’t. And the reason for that is rather simple: it is too complicated to figure out what such a notation should return. Since Rust uses UTF-8 as the standard format for strings.

Consider the word “namaste”, which we write as “नमस्ते” in the Devanagari script. What we see as a “character” in the Devanagari version of the word isn’t. In fact, if we look at the raw bytes in the String, it would look like this: [224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135].

That’s 18 bytes in total, but it kinda looks like four characters, doesn’t it? Well, it isn’t.

If we look at the Unicode scalar values we instead get this: [‘न’, ‘म’, ‘स’, ‘्’, ‘त’, ‘े’].

So, what should this mean:

let my_string = String::from("नमस्ते");
let x = my_string[2];
println!("{}", x);

What should we print? 168, स, or स्ते?

We don’t even know what the return type should be, a byte value, a character, a grapheme cluster or a string slice! And don’t get me started on emojis…

We’ll get to the grapheme business (and emojis) later in the post. For now we’ll ignore it, and pretend all strings only contain the letters in the ASCII-standard. We’ll regard it as a kind of subset of the UTF-8 character space.

len() and capacity()

len() is a built-in function that returns the number of characters actually in a string:

fn main(){
    let my_string = String::from("This is a string!");
    println!("My strings length is {}.", my_string.len());
}

This will print out the result My strings length is 17.. Seems reasonable, doesn’t it?

What about that capacity bit then, what is that?

Let’t try it out:

fn main(){
    let my_string = String::from("This is a string!");
    println!("My strings length is {}, and its capacity is {}.", my_string.len(), my_string.capacity());
}

Well, it prints out My strings length is 17, and its capacity is 17.. It doesn’t seem there’s any difference between them, is there?

But if we try this instead:

fn main() {
    let mut my_string = String::from("This is string!");
    println!("my_string L:{} C:{}", my_string.len(), my_string.capacity());
    my_string.push_str("OK?");
    println!("my_string L:{} C:{}", my_string.len(), my_string.capacity());
}

Now the output is

my_string L:15 C:15
my_string L:18 C:30

What happened? We added a string of three characters, and the length increased by three characters. But the capacity increased by fifteen!

Well, the capacity is always larger or equal to the length, and it tells you how many characters it has room for. If you add characters to the String, and the result is longer than its current capacity, Rust has to move it in memory. Sometimes the memory allocation process makes the capacity larger than the current length.

You can avoid this, by setting the capacity from the start. Let’s say we know that we will add the string OK? later, so we need a capacity of 18. Then we can do this:

fn main() {
    let mut my_string = String::with_capacity(18);
    my_string.push_str("This is string!");
    println!("my_string L:{} C:{}", my_string.len(), my_string.capacity());
    my_string.push_str("OK?");
    println!("my_string L:{} C:{}", my_string.len(), my_string.capacity());
}

Now we get this output:

my_string L:15 C:18
my_string L:18 C:18

No surprises there, we see that the capacity is 18 on the first occasion, but that’s because we asked for that. The length is still 15, because that is the number of characters in the string.

The interesting bit is the last line. Now the capacity and the length are the same. That means that Rust didn’t need to reallocate my_string, so the code actually ran a tiny bit faster.

How much the capacity grows isn’t generally predictable. And you can get different results on different systems.

You should almost never care about capacity though. It is there for people with very specific constraints in memory and performance. This way they can get the exact results they need.

Traversal with a loop

A lot of computations involve processing a string one character at a time. Often they start with the first character, select each in turn, do something to it, and continue until the end. We call this pattern of processing a traversal. One way to write a traversal is with a for loop. The problem we face is that we can’t index into a String, so we use trickery. We’ll use Strings method chars(). It returns the letters inside the string as a Char, a special type that has the trait Iterator.

We haven’t talked about traits earlier, but we will later. For now we’ll say it’s a way to define behaviours for a type. You can add traits to types you create, to make them compatible with other types and their methods.

Iterator is a very common trait for most collection types, alas not found in String. But chars() will give us one, and we can iterate over it like this:

fn main(){
    let my_str = String::from("Hello Rustaceans!");
    // iterate over the Chars type with trait Iterator
    // that we get from the chars() method:
    for c in my_str.chars() {
        print!("{} ", c);
    }
    println!();
}

This loop traverses the string and displays each letter followed by a space. All on the same line (notice that we use print! in the loop). Then we print an empty line to advance the output to a new line (that’s the println!() after the loop).

We can also use another method called enumerate(), chained on after chars(). The method enumerate() is a method that comes with the trait Iterator. It gives us the index for each item inside the iterator:

fn main(){
    let my_str = String::from("Hello Rustaceans!");
    // iterate over the Chars type with trait Iterator
    // that we get from the chars() method:
    for c in my_str.chars() {
        print!("{} ", c);
    }
    println!();
    for (i, c) in my_str.chars().enumerate() {
        print!("{}:{} ", c, i);
    }
    println!();
}

This gives the printout:

H e l l o   R u s t a c e a n s !
H:0 e:1 l:2 l:3 o:4  :5 R:6 u:7 s:8 t:9 a:10 c:11 e:12 a:13 n:14 s:15 !:16

Exercise: Write a function that takes a string as an argument and displays the letters backward, one per line.

String slices

We call a segment of a string a slice.

You can create a slice in a way that sort of looks like indexing works, but it’s a bit deceptive. We can create slices from a String by asking for characters in a range, like so:

let the_string = String::from("The quick brown fox");
let a_short_piece: &str = &the_string[4..9];

This creates a new &str with the content “quick” (Rust starts all indexes at 0, and the end index is never included).

We can even create a full &str copy of a String by leaving out the index numbers completely:

let the_string = String::from("The quick brown fox");
let a_full_copy: &str = &the_string[..];

Now, what would you expect these variations to yield:

let the_string = String::from("The quick brown fox");
let a_piece: &str = &the_string[4..];
let another_piece: &str = &the_string[..5];

Test them out in the playground, or on your own computer!

Extra test for the adventurous: Try a string of emojies instead of text, and see what happens with a couple of different ranges. You should see some interesting errors. Don’t worry, the last section will show you how to deal with segmentation of unicode strings.

Strings are immutable except when they are not

Strings of type str are immutable, because they live on the stack. The compiler has to know how big they are at compile time, so you can’t change them. You can declare trings of type &str as mutable though, since the &str is a “fat pointer”, and Rust knows how big those are.

If you want to change the contents of a string, you should create a copy in the form of a String and claim ownership of it. Then you can do pretty much anything you want with it.

Searching

There are several methods to search for contents inside a string. And also for replacing or removing parts. We’ll touch on a couple here.

let my_string = String::new();
my_string.contains(pattern) // returns bool
my_string.ends_with(pattern) // returns bool
my_string.find(pattern) // returns Option<usize>, byte index of first character in pattern
my_string.rfind(pattern) // sames as find but finds byte index of last character in pattern
my_string.get(range) // returns Option with slice or None
my_string.get_mut(range) // returns Option with mutable slice or None
my_string.is_empty() // returns bool
my_string.matches(pattern) // returns "Matches", iterator over the places that match pattern
my_string.replace("old", "new") // returns new String where matching "old" is replaced with "new"
my_string.replacen("old", "new", n: usize) // as replace, but only the n first matches are replaced
my_string.split(pattern) // returns an iterator over all the substrings that get separated by the pattern
my_string.starts_with(pattern) // returns bool
my_string.trim() // returns slice where leading and trailing whitespace is removed

To figure out how to search for substrings inside strings, we first need to sort out the return values. Boolean return values are pretty self-explanatory. They are true or false depending on wether the substring is in the string or not. Provided you are asking for its existence or non-existence:

my_string.contains(pattern) // returns bool
my_string.is_empty() // returns bool
my_string.ends_with(pattern) // returns bool
my_string.starts_with(pattern) // returns bool

The pattern in these instances is the substring that we’re searching for.

Another set of rather self-explaining methods are these:

my_string.replace("old", "new") // returns new String where matching "old" is replaced with "new"
my_string.replacen("old", "new", n: usize) // as replace, but only the n first matches are replaced

Given a string mystring we can search for a substring and replace it with another. In this case the one we’re looking for is “old” and the one we want to put in its place is “new”. The second example is interesting. In that we can choose to only switch out the first few instances. We tell Rust how many with the third parameter n in the method replacen().

The next group of methods return iterators of two different types. But since they both have the trait iterator we can treat them in the same way.

my_string.matches(pattern)
// returns a "Matches", an iterator
// over the places that match the "pattern"
my_string.split(pattern)
// returns an iterator over all the
// substrings that get separated by the pattern

We can use the first method .matches(pattern) to get an iterator. The iterator collects all the instances of the pattern in the string

let my_string = "abcXXXabcYYYabc";
let v: Vec<&str> = my_string.matches("abc").collect();

The collect() method on the end is a convenience method. It is there to gather the contents of an iterator into a collection. We can inspect and manipulate that collection further later. The collect() method can create a lot of different collection types. So we need to specify what type of collection we want. And in this case we want a vector with string slices, Vec<&str>.

We can use the same method to deal with the results from the split() method:

let v: Vec<&str> = "Mary had a little lamb"
    .split(' ')
    .collect();

Here we inlined the whole thing, and what we get out of it is a vector that looks like this: ["Mary", "had", "a", "little", "lamb"].

This particular split, on a space as the pattern, is so common that it has two methods on its own. The methods split_whitespace() and split_ascii_whitespace.

So, we have four methods with the strange return type of Option. This is a special type, to deal with the fact that Rust doesn’t have a null value. Every method must return something, and it is up to the caller to do something with that value. Wether it is the value he hopes for or not.

my_string.find(pattern) // returns Option<usize>, byte index of first character in pattern
my_string.rfind(pattern) // sames as find but finds byte index of last character in pattern
my_string.get(range) // returns Option with slice or None
my_string.get_mut(range) // returns Option with mutable slice or None

The type Option will contain one of two possible values: A Some containing the actual return value. Or, if it found no value, the Option will contain the value None, and we have to check for and deal with both.

In tutorials you will often find that writers use the method unwrap(). It extracts the value from Some and ignores the None case. Don’t do that. If you do, and you get a None back, Rust will panic and your program terminates. There is another method you can use instead, if you don’t want to do a proper pattern matching on the returned Option. That is unwrap_or(default), where default is the value you want to return if you get a None back.

In general, it is a lot better to use pattern matching:

let my_index = my_string.find("brown");
match my_index {
    Some(n) => println!("found 'brown', index {}.", n),
    None => println!("Didn't find 'brown'"),
}

A shortcut, that passes error handling up the chain in your application is the operator ?.

let my_index = my_string.find("brown")?;

If the result is a Some, it will unpack the value, like unwrap(). But it will not panic on a None, it will finish the function it is in, and return a None from it. That in turn may cause the program to panic. If the function itself has a return type of either Result or Option, the caller must handle the error. Then there will be no panic.

Another way to extract the value from an Option (or Result), is to use the method expect("Error message").

let my_index = my_string.find("brown").expect("No brown found");

This will unwrap the value from Some. If you get a Noneit will panic with the error message in the string you provide in the parentheses.

The use of the unwrap() method is generally discouraged, but it does have its uses. When you call a function that returns an Option or Result, and you are absolutely, positively certain it will not fail or return None, go ahead and use it. We’ll see examples of that in later posts.

But if there’s even just a theoretical chance that you’ll get an Error or a None back, do a manual unpacking with the match pattern. Use the one I showed above, and decide on how to handle the Error or None if it arrives.

Looping and counting

Here’s an easy way to count the number of times the letter a appears in a string:

let word = "banana";
let count_of_as = word.find("a").count();

Where did count() come from? It isn’t a method on either str or String, so what gives?

We said earlier that the find method returns an iterator. The Iterator trait has a method called count(), so we can chain it on after the .find("a").

Iterators are common return types, and you will use them a lot. Don’t bother to memorize all the methods on them. Instead, get used to looking them up in the documentation: Trait Iterator in the Rust documentation

Other String methods

Strings provide methods that perform a variety of useful operations. For example, the method to_uppercase() takes a string and returns a new string with all uppercase letters. I’m pretty sure you can guess what to_lowercase() does…

Instead of trying to memorize them all, bookmark these pages in the documentation:

The Primitive Type str

The String Struct

String comparison

my_string.eq_ignore_ascii_case(other_string)
my_string == other_string // uses PartialEq trait
my_string.eq(other_string) // returns bool
my_string.ne(other_string) // returns bool
my_string < other_string // bool (through PartialOrd trait)
my_string.lt(other_string) // same as above
my_string <= other_string // bool
my_string.le(other_string)// same as above
my_string > other_string // bool
my_string.gt(other_string)// same as above
my_string >= other_string // bool
my_string.ge(other_string)// same as above

These relational operators work on strings. The only one that is actually defined on str and String themselves is the first one, my_string.eq_ignore_ascii_case(other_string). The others are part either of a trait called PartialEq, or PartialOrd that both string types have.

Rust does not handle uppercase and lowercase letters the same way people do. All the uppercase letters come before all the lowercase letters.

A common way to address this problem is to first convert strings to a standard format, such as all lowercase. Then performing the comparison.

Handling graphemes

What if we want to handle text correctly, without having to worry about the size of UTF-8 “characters”? And deal with graphemes directly instead? Can we do that?

We certainly can! There is a crate by the name unicode_segmentation you can pull in. It handles graphemes. It lets you collect them into a vector to make it easy to deal with individual graphemes. It even works with emojis…

Make sure you update Cargo.toml with the correct dependencies:

[dependencies]
unicode-segmentation = "1.7.1"

Don’t use the numbers in the example above. Go to the crates page on crates.io instead and get the most current version!

Then you cant test this program. See if it correctly gathers and stores the text you enter in the terminal, try some emojis while you’re at it:

use unicode_segmentation::UnicodeSegmentation;
use std::io::*;

fn main() {
    let mut input = String::new();
    stdin().read_line(&mut input).expect("couldn't read from stdin");
    let my_unicode: &str = &input.trim();
    let my_vector = UnicodeSegmentation::graphemes(my_unicode, true).collect::<Vec<&str>>();
    println!("{:?}", my_vector);
}

The program gathers all the “characters” you enter in the terminal, until you hit return. It stores them all in a vector and then prints the vector back to you so you can inspect it and see that it worked.

About the author

For the last three decades, I've worked with a variety of technologies. I'm currently focused on fullstack development. On my day to day job, I'm a senior teacher and course developer at a higher vocational school in Malmoe, Sweden. I'm also an occasional tech speaker and a mentor. Do you want to know more? Visit my website!