Huffman decoder
You will now implement a Huffman decoder able to decode a bit stream into a set of bytes. Your implementation will be implemented as a tree as this is a direct mapping of the Huffman decoder concepts explained previously.
Huffman tree structure
Your Huffman tree structure will reside in the decoders/huffman
module.
✅ Create a
decoders
module.
✅ Create a
huffman
module in thedecoders
module.
✅ Create a
HuffmanDecoder
enumerated type in thedecoders/huffman
module.
The HuffmanDecoder
represents a node in the tree. It is either Absent
, a Symbol
(with a u8
payload) or a subtree represented by a Tree
variant and two HuffmanDecoder
values representing the left (0) and right (1) alternatives.
Note that you cannot put an object of type HuffmanDecoder
inside a HuffmanDecoder
itself, as the container object would have to have a greater size than the contained one, which is not possible as they have the same type. Also, you cannot use references either to represent the left and right alternatives: references do not represent ownership. Who would be the owner of the left and right alternatives?
You will have to use a type which acts like a reference (in that you don't have to store the whole object) and owns a subtree.
Inserting a symbol in a tree
✅ Write a
HuffmanDecoder::insert()
method which inserts a symbol in the tree at the leftmost possible free position for a given width, and returnstrue
if it has succeded.
impl HuffmanDecoder {
pub fn insert(&mut self, symbol: u8, width: u8) -> bool {
todo!()
}
}
Its behaviour will be:
- If
width
is 0, the symbol must be inserted right at the top if the top is currentlyAbsent
. Otherwise this is a failure, andfalse
must be returned. - Otherwise, the symbol must be inserted with a decreased width on the left branch, and if it fails on the right branch. Branches must be created (as absent) if the node doesn't contain them yet. If the top of the tree is a
Symbol
, justpanic!()
as this can should never happen. It would mean that it is impossible to insert the node without resolving to a symbol before the prefix length is reached.
For example, the tree shown in the Huffman code example can be built by running the following function:
fn build_example_tree() -> HuffmanDecoder {
let mut tree = HuffmanDecoder::Absent;
assert!(tree.insert(b'A', 2));
assert!(tree.insert(b'C', 2));
assert!(tree.insert(b'B', 1));
tree
}
Printing a tree (optional)
You may want to implement
Debug
on yourHuffmanDecoder
type to list its content. While you may deriveDebug
automatically, you may want a better representation, showing the prefixes.To do so, you can build an iterator which iterates over the nodes of your tree and their respective prefix. This iterator will keep a list of nodes to visit and their prefix internally, and yield pairs of
(String, u8)
which represent a prefix and the associated symbol.✅ Build a
HuffmanDecoderIterator<'a>
structure which contains a list of nodes yet to visit, for example as aVec<(&'a HuffmanDecoder, String)>
.✅ Implement
Iterator
for this structure: take the first node to visit by popping it from the vector. If it is a symbol, return it along with its prefix, if it is a subtree, note that you have to visit both branches (choose the order right) and try again with the node in your list, if there is nothing just go to the next node in your list. ✅ ImplementDebug
forHuffmanDecoder
, taking advantage of thestd::fmt::Formatter::debug_struct()
method to structure your output.If you do so, executing the following code:
fn main() { println!("{:?}", build_example_tree()); }
should print something like:
HuffmanDecoder { 00: 65, 01: 67, 1: 66 }
(because 65, 67, 66 are the ASCII codes for
A
,C
andB
respectively)
Building a tree from prefix widths
In a Zstandard file, a Huffman table will be described by succession of prefix widths (number of bits) for each symbol. After some decoding process that will be shown later, you will end up with a list of up to \(2^8\) integers, each representing the number of bits needed to represent each symbol. Missing ones mean, or prefix width set to 0, mean that this particular symbol will not be present in the output stream and thus will not appear in the Huffman table.
✅ Build a
HuffmanDecoder::from_number_of_bits()
constructor which takes a vector containing the number of bits for each symbol and returns a newly builtHuffmanDecoder
.
What you must do is:
- Build a list of every symbol and its width.
- Filter out zero width entries as those symbols will never appear in the output. They are present only because some later symbols in the list have a non-zero width.
- In a newly created tree, insert every symbol at the right width, starting from the largest width and smallest symbol values within the same width. Make sure that you read the note on sort stability in Rust documentation.
Once you have done so, you should check that the following code builds the same tree as before:
fn main() {
// 0 repeated 65 times, 2, 1, 2
let widths: Vec<_> = std::iter::repeat(0).take(65).chain([2, 1, 2]).collect();
println!("{:?}", HuffmanDecoder::from_number_of_bits(widths));
}
Representation through weights
Inside a Zstandard file, prefix widths are represented as weights:
- A weight of 0 represents a symbol absent from the output.
- If the longest prefix width is \(L\), a weight of \(w\) represents a prefix width of \(L+1-w\).
Moreover, three small additional optimizations are done to the list of weights present in a Zstandard file:
- All 0 weights at the end of the list are absent. For example, a list of
[2, 1, 0, 1, 0, 0…]
will not contain the zeroes after the second 1. - The latest non-0 weight in the list is also absent. It can be computed by looking at the existing weights. For example, a list of
[2, 1, 0, 1]
will be represented as[2, 1, 0]
. - The longest prefix width is not indicated in the file. It must be deduced from the list of weights.
✅ Build a
HuffmanDecoder::from_weights
constructor which takes a vector of weights, represented asu8
, and returns a result with the newly builtHuffmanDecoder
or an error. You must create a new localError
enum as you have done in other modules, as well as theResult
type alias.
To compute the missing weight, which must be added to the list of weights received by the method, you can use the following properties:
- If you sum \( 2^{w-1} \) for all weights \(w\) different from 0, you will get a power of 2.
- No \( 2^{w-1} \) expression can be strictly greater than the sum of all others.
This should let you determine the missing last weight by ensuring that when added, the sum of \( 2^{w-1} \) rounds to the smallest strictly greater power of 2.
Once you have determined the missing weight, you can also deduce the number of bits needed to represent the weights. This can be easily done by taking the base 2 logarithm of the power of 2 you have reached.
You can now compute the prefix widths for all weights, and call your previously written HuffmanDecoder::from_number_of_bits()
to build the Huffman decoder.
The following test code should build the same decoder as before:
fn main() {
// 0 repeated 65 times, 1, 2
let weights: Vec<_> = std::iter::repeat(0).take(65).chain([1, 2]).collect();
println!("{:?}", HuffmanDecoder::from_weights(weights));
}
💡 You may have noticed that in our example above the vector of weights above contains a lot of 0 at the beginning. In fact, in a real weights array, many weights will be similar to each other, as they will all take a value between 0 and 8, inclusive. Does this ring a bell? Indeed, as we will see later, the weights are often themselves encoded using a FSE encoder which thrives in the presence of a small number of symbols.
In the next section, we will learn how to feed bits to our Huffman decoder and make it decode them.