Name identification

Now that you are able to retrieve a web page, you will try to identify names written in it. A name looks like "John Doe": an uppercase letter followed by lowercase letters followed by a space followed by an uppercase letter followed by lowercase letters.

Rather than looking for such a pattern yourself, you might want to use the "regex" crate which implements regular expressions.

Using the "regex" crate

Add the "regex" crate to your Cargo.toml:

$ cargo add regex

From now on, you are able to use the entities defined in this crate, in particular the regex::Regex type. To be able to refer it as Regex instead of regex::Regex, you might want to put a use regex::Regex; near the beginning of your program.

Of course we want our regular expression to match French names containing accented characters. You can use the following regular expression patterns to identify uppercase and lowercase letters:

  • \p{Uppercase} will match any Unicode uppercase letter (for example it would match the Greek delta letter "Δ")
  • \p{Lowercase} will match any Unicode lowercase letter, such as the Greek delta letter "δ"
  • + after a pattern means "one or more"

You can then deduce that a name component can be represented as \p{Uppercase}\p{Lowercase}+. This would match "John", "Doe", "François" or "Strauß". The whole regular expression can then be \p{Uppercase}\p{Lowercase}+ \p{Uppercase}\p{Lowercase}+ (two components separated by a space).

Creating the Regex

A regular expression is created using Regex::new(), which returns a Result. Since we know that our expression is valid, we can call unwrap() on the result:

#![allow(unused)]
fn main() {
    let re = Regex::new("\p{Uppercase}\p{Lowercase}+ \p{Uppercase}\p{Lowercase}+").unwrap();
}

However, it you do this, you will likely get an error: \p in the string is interpreted as an escaped p. As \n represents a "line feed", \p represents… nothing known. This is an invalid escape sequence in a string.

Fortunately, Rust has raw strings, in which no escape character is recognized. The syntax of a raw string is r"…".

Also, you can also put double quotes in a raw string if you need it, by using a pound sign to the double quote delimeters:

#![allow(unused)]
fn main() {
   let s = r#"This is a string with some "quotes" in it"#;
}

But what if you need to put a double quote followed by a pound sign ("#) in the raw string? This is easy, you can change the start and end marker and increase the number of pound signs provided they match:

#![allow(unused)]
fn main() {
   let s = r###"You need a quote + 3 pound signs to end the string"###;
   let t = r###"You can put "## inside without any problem!"###;
}

Writing the extract_names() function

Using the .find_iter() method on a regex::Regex object, you can iterate over the matches found in the input string as shown in the following example which displays all sequences of uppercase characters (with at least two of them) found in the file "text.txt":

fn main() {
    // Read the content of the "text.txt" file into variable s
    let s: String = std::fs::read_to_string("text.txt").unwrap(); 
    // Match at least two consecutive uppercase character
    let re = regex::Regex::new(r#"\p{Uppercase}{2,}"#).unwrap();
    println!("All uppercase sequences found:");
    for m in re.find_iter(&s) {
        // Inside the loop m is a regex::Match, which as a .as_str() method
        println!("  - {}", m.as_str());
    }
}

Exercise 2.a: write a function with signature fn extract_names(s: &str) -> Vec<String> which returns all plausible names in a string.

In this function, you will (those are mere suggestions, you are free to do it any other way you see fit):

  • create a re variable containing a Regex with the pattern seen above
  • create an empty vector (using vec![]) and store it into a mutable variable (this is the vector you will return at the end of the function)
  • iterate over re.find_iter() to find all the matches
  • on every match m you can call .as_str() to get a &str representing the text of the match (for example "John Doe"), that you can transform into a string using .to_owned() or String::from() (as usual)
  • push the String in the Vec<String> that you plan to return (.push() is the method you want to use)
  • return the vector

Try your function with the following main() function:

fn main() {
    let names = extract_names("Yesterday, John Doe met François Grüß in a tavern");
    println!("{names:?}");
}

Since String implements Debug, Vec<String> also implements Debug and can be displayed using the {:?} placeholder which is handy for debugging.

Deduplicating the names

Unfortunately, we may end up with duplicates. If a text contains the same name several times, it will be present several times in the output.

A vector is not really the best structure to represent a set of objects, like a set of names. We do not care about the order, only about the presence of a name.

Rust has a std::collections::HashSet<T> type in its standard library (std). Provided that you added a use std::collections::HashSet; near the beginning of your program, you can:

  • create a new HashSet<T> with HashSet::new()
  • insert an element in a set h with h.insert(element); nothing happens if the element is already present in the set
  • iterate over the elements of a set h like you did with a vector: for element in h { … } (if h is a HashSet<T>, element will be a &T in the loop)
  • display a HashSet<T> using {:?} as long as its element type T implements Debug

Exercise 2.b: make your extract_names() function return a HashSet<String> (instead of a Vec<String>).

Using the following main() function, you will see that there can be no duplicates:

fn main() {
    let names = extract_names("John Doe, François Grüß, John Doe");
    println!("{names:?}");
}

Returning the names in a page

Exercise 2.c: add a function with signature fn names(url: &str) -> Result<HashSet<String>, Error> which returns all plausible names in a web page.

Check your function by displaying the names present on "https://www.liberation.fr".


Extra: if you have more time

If you have more time, you can fix the name detector such that it accepts multipart names, such as "Jean-Pierre Elkabbach".