Name identification
Now that you are able to retrieve a web page, you will try to identify names written in it. A name looks like "John Doe": an uppercase letter followed by lowercase letters followed by a space followed by an uppercase letter followed by lowercase letters.
Rather than looking for such a pattern yourself, you might want to use the "regex" crate which implements regular expressions.
Using the "regex" crate
Add the "regex" crate to your Cargo.toml
:
$ cargo add regex
From now on, you are able to use the entities defined in this crate, in particular the regex::Regex
type. To be able to refer it as Regex
instead of regex::Regex
, you might want to put a use regex::Regex;
near the beginning of your program.
Of course we want our regular expression to match French names containing accented characters. You can use the following regular expression patterns to identify uppercase and lowercase letters:
\p{Uppercase}
will match any Unicode uppercase letter (for example it would match the Greek delta letter "Δ")\p{Lowercase}
will match any Unicode lowercase letter, such as the Greek delta letter "δ"+
after a pattern means "one or more"
You can then deduce that a name component can be represented as \p{Uppercase}\p{Lowercase}+
. This would match "John", "Doe", "François" or "Strauß". The whole regular expression can then be \p{Uppercase}\p{Lowercase}+ \p{Uppercase}\p{Lowercase}+
(two components separated by a space).
Creating the Regex
A regular expression is created using Regex::new()
, which returns a Result
. Since we know that our expression is valid, we can call unwrap()
on the result:
#![allow(unused)] fn main() { let re = Regex::new("\p{Uppercase}\p{Lowercase}+ \p{Uppercase}\p{Lowercase}+").unwrap(); }
However, it you do this, you will likely get an error: \p
in the string is interpreted as an escaped p
. As \n
represents a "line feed", \p
represents… nothing known. This is an invalid escape sequence in a string.
Fortunately, Rust has raw strings, in which no escape character is recognized. The syntax of a raw string is r"…"
.
Also, you can also put double quotes in a raw string if you need it, by using a pound sign to the double quote delimeters:
#![allow(unused)] fn main() { let s = r#"This is a string with some "quotes" in it"#; }
But what if you need to put a double quote followed by a pound sign ("#
) in the raw string? This is easy, you can change the start and end marker and increase the number of pound signs provided they match:
#![allow(unused)] fn main() { let s = r###"You need a quote + 3 pound signs to end the string"###; let t = r###"You can put "## inside without any problem!"###; }
Writing the extract_names()
function
Using the .find_iter()
method on a regex::Regex
object, you can iterate over the matches found in the input string as shown in the following example which displays all sequences of uppercase characters (with at least two of them) found in the file "text.txt":
fn main() { // Read the content of the "text.txt" file into variable s let s: String = std::fs::read_to_string("text.txt").unwrap(); // Match at least two consecutive uppercase character let re = regex::Regex::new(r#"\p{Uppercase}{2,}"#).unwrap(); println!("All uppercase sequences found:"); for m in re.find_iter(&s) { // Inside the loop m is a regex::Match, which as a .as_str() method println!(" - {}", m.as_str()); } }
Exercise 2.a: write a function with signature fn extract_names(s: &str) -> Vec<String>
which returns all plausible names in a string.
In this function, you will (those are mere suggestions, you are free to do it any other way you see fit):
- create a
re
variable containing aRegex
with the pattern seen above - create an empty vector (using
vec![]
) and store it into a mutable variable (this is the vector you will return at the end of the function) - iterate over
re.find_iter()
to find all the matches - on every match
m
you can call.as_str()
to get a&str
representing the text of the match (for example "John Doe"), that you can transform into a string using.to_owned()
orString::from()
(as usual) - push the
String
in theVec<String>
that you plan to return (.push()
is the method you want to use) - return the vector
Try your function with the following main()
function:
fn main() { let names = extract_names("Yesterday, John Doe met François Grüß in a tavern"); println!("{names:?}"); }
Since String
implements Debug
, Vec<String>
also implements Debug
and can be displayed using the {:?}
placeholder which is handy for debugging.
Deduplicating the names
Unfortunately, we may end up with duplicates. If a text contains the same name several times, it will be present several times in the output.
A vector is not really the best structure to represent a set of objects, like a set of names. We do not care about the order, only about the presence of a name.
Rust has a std::collections::HashSet<T>
type in its standard library (std
). Provided that you added a use std::collections::HashSet;
near the beginning of your program, you can:
- create a new
HashSet<T>
withHashSet::new()
- insert an element in a set
h
withh.insert(element)
; nothing happens if the element is already present in the set - iterate over the elements of a set
h
like you did with a vector:for element in h { … }
(ifh
is aHashSet<T>
,element
will be a&T
in the loop) - display a
HashSet<T>
using{:?}
as long as its element typeT
implementsDebug
Exercise 2.b: make your extract_names()
function return a HashSet<String>
(instead of a Vec<String>
).
Using the following main()
function, you will see that there can be no duplicates:
fn main() { let names = extract_names("John Doe, François Grüß, John Doe"); println!("{names:?}"); }
Returning the names in a page
Exercise 2.c: add a function with signature fn names(url: &str) -> Result<HashSet<String>, Error>
which returns all plausible names in a web page.
Check your function by displaying the names present on "https://www.liberation.fr"
.
Extra: if you have more time
If you have more time, you can fix the name detector such that it accepts multipart names, such as "Jean-Pierre Elkabbach".