Project summary for academic year 2023-2024
In this project, you will be tasked by writing a compression/decompression program for the Zstandard format.
The project will be written in Rust. It is expected that you will be using a safe programming style, and that you will be able to use what has been shown in class in order to strengthen your program.
What you will learn
During the course of this project, you will learn:
- How to parse binary data according to a specification.
- How to write a decoder based on Huffman tables.
- How to write a decoder based on arithmetic coding (FSE).
- How to design and use safe data structures.
- How to write unit and functional tests.
- How to increase confidence in your code by using fuzzing techniques.
The Zstandard file format
What does data compression mean?
Data compression is a technique which takes advantage of the redundancy present in the data to represent it in less bytes than the original data. Compressed data takes less space in storage and takes less time to transmit on a communication channel.
Some compression techniques voluntarily lose less significant data to obtain better compression ratio, such as JPEG which compress images but may discard some colors nuances or other details. Some others are said to be lossless, and can reconstruct the original uncompressed data from the compressed one.
You probably already know some of those lossless common compression formats: zip, gzip, rar, gif, png, 7z or zstd. The one we are interested in is Zstandard, commonly abbreviated as zstd.
Zstandard is only a compression format. It does not take care of organizing files into archives as some utilities such as zip or rar do. It most ressembles gzip or 7z in that it only cares about one input and one output.
Zstandard in a nutshell: frames, blocks, literals and sequences
The Zstandard format (zstd) is described in RFC 8878. As for any standardized compressed format, only what goes in the compressed file and how it must be decompressed belong to the standard. A compressor is free to do whatever it needs as long as its output conforms to the standardized format.
A compressed zstd file is made of one or more frames, written after one another. Each frame can be decoded independently of the others. For example, you could take two compressed zstd files and concatenate them: it would make a valid zstd file, whose output would be the concatenation of the two original files.
Let's check this:
# Create two files f1.txt and f2.txt
$ echo "This is the content of the first file" > f1.txt
$ echo "This is the content of the second file" > f2.txt
# Compress each file with zstd
$ zstd f1.txt
f1.txt :134.21% ( 38 B => 51 B, f1.txt.zst)
$ zstd f2.txt
f2.txt :133.33% ( 39 B => 52 B, f2.txt.zst)
$ ls -l
-rw-r--r-- 1 sam users 38 Aug 11 12:36 f1.txt
-rw-r--r-- 1 sam users 51 Aug 11 12:36 f1.txt.zst
-rw-r--r-- 1 sam users 39 Aug 11 12:36 f2.txt
-rw-r--r-- 1 sam users 52 Aug 11 12:36 f2.txt.zst
# Concatenate f1.txt.zst and f2.txt.zst into f3.txt.fzst
$ cat f1.txt.zst f2.txt.zst > f3.txt.zst
$ ls -l f3.txt.zst
-rw-r--r-- 1 sam users 103 Aug 11 12:37 f3.txt.zst
# Uncompress f3.txt.zst and display the result
$ unzstd f3.txt.zst # or, equivalently: zstd -d f3.txt.zst
f3.txt.zst : 77 bytes
$ ls -l f3.txt
-rw-r--r-- 1 sam users 77 Aug 11 12:37 f3.txt
$ cat f3.txt
This is the content of the first file
This is the content of the second file
You may have noted that the compression has not been very efficient: f1.txt
was expanded (instead of shrinking) from 38 bytes to 51 bytes, and f2.txt
from 39 bytes to 52 bytes. Also, f3.txt.zst
took 103 bytes on disk, while the cleartext (decompressed) version is only 77 bytes long. This can be easily explained:
- The contents of
f1.txt
andf2.txt
did not contain obvious repetitions that could have been compressed. As a consequence, the compressed files include the uncompressed content and add some overhead pertaining to the zstd format. - The content of
f3.txt.zst
has been made by concatenating two inefficiently compressed files.
But what if we compress f3.txt
on its own? After all, it contains some repetitions ("This is the content of the " and " file\n"):
# Compress f3.txt. Add -f to overwrite the existing f3.txt.zst without confirmation
$ zstd -f f3.txt
f3.txt : 88.31% ( 77 B => 68 B, f3.txt.zst)
$ ls -l f3.txt.zst
-rw-r--r-- 1 sam users 68 Aug 11 12:37 f3.txt
Interesting! This time, the file has really been compressed from 77 bytes down to 68 bytes.
If we look inside the compressed file f3.txt.zst
, we can get some idea of what zstd
did to compress the file:
$ strings f3.txt.zst
This is the content offirst file
second file
We can get an idea of what zstd will be doing to decompress this file:
- Extract "This is the content of" from the compressed file
- Copy " the " from the already decoded output
- Extract "first file\n" from the compressed file
- Copy "This is the content of the " from the already decoded output
- Extract "second file\n" from the compressed file
How does the zstd
program know what to extract and what to copy? A zstd
frame is made of blocks. Each block contains literals, that is text that will be extracted into the decompressed file. Those literals are followed by a list of sequences, which are basic instructions of the form "extract the first M unused bytes from the literals into the decompressed file, then copy N bytes from the already decoded output starting P bytes before the current point.". If after executing all the sequences of the block some literals have not been extracted into the decompressed file, they are extracted as a final step.
In our case, here is the state of the literals and of the output after every sequence. The following notations will be used:
- The next byte to use in the literals section is shown with a caret (^).
- Spaces are represented by an underscore (_) to make them more visible in the output.
- Minus signs (-) show the part from the already decoded output which is copied again, while plus signs (+) show its destination.
- "\n" (line feed character) is shown with a dot (.).
LITERALS OUTPUT
| This_is_the_content_offirst_ | |
| ^ | |
| file.second_file. | |
| | |
Executing the sequence (22, 5, 15)
(extract 22 bytes from literal, then copy 5 bytes from 15 positions behind):
LITERALS OUTPUT
| This_is_the_content_offirst_ | This_is_the_content_of_the_ |
| ^ | ----- +++++ |
| file.second_file. | |
| | |
Executing the sequence (11, 27, 27)
:
LITERALS OUTPUT
| This_is_the_content_offirst_ | This_is_the_content_of_the_first_file.T |
| | --------------------------- + |
| file.second_file. | his_is_the_content_of_the_ |
| ^ | ++++++++++++++++++++++++++ |
We have reached the end of the sequence, the rest of the literals are extracted into the output:
LITERALS OUTPUT
| This_is_the_content_offirst_ | This_is_the_content_of_the_first_file.T |
| | |
| file.second_file. | his_is_the_content_of_the_second_file. |
| ^ | |
๐๐พ If you undestand how frames, blocks, literals and sequences work together, you already know 50% of what you need to understand for completing this project.
Of course, for the compression algorithm to be effective, some work will be needed to generate or parse a compressed zstd file:
- Literals can be present as-is (as in this example), or encoded more tightly using Huffman coding.
- The Huffman trees are described using weight tables, a concise representation.
- The Huffman weight tables may be themselves encoded more tightly using finite state entropy (FSE) coding.
- Sequences parameters may be encoded tightly using FSE followed by an additional expansion step.
- FSE tables coefficients use a concise bit-level representation.
- Data decoded through FSE or Huffman coding is read backwards, a bit after another.
- An optional checksum may be used to verify the integrity of the decoded data.
Repository setup
We will create a Git repository for you so that you can develop your application in a private space. Access will be available only to you (and your teammate if any) and to the lecturers.
Repository creation
Please ensure that you already have an account on https://gitlab.telecom-paris.fr/, or create one by signing in with the "SSO & Fรฉdรฉration d'identitรฉ Edugain" button, using your academic credentials.
โ Send a mail to the class mailing-list to request the creation of a new repository. Make sure you include the GitLab usernames of every team member in your mail.
Note: you can find your GitLab username by clicking on the top-right area in GitLab โ it starts with a "@".
Repository layout
Your project is expected to be located at the root of your repository.
โ Once you have cloned the repository we have created for you, you can initialize the project structure by using the following commands at the repository root:
$ ls -a
. .. .git README.md
$ cargo init . --name net7212
Created binary (application) package
You can then check that everything works as expected:
$ cargo run
Compiling net7212 v0.1.0 (/home/johndoe/courses/net7212/project/)
Finished dev [unoptimized + debuginfo] target(s) in 0.32s
Running `target/debug/net7212`
Hello, world!
โ It is time to add the Rust source code to git:
$ ls -a
. .. .git .gitignore Cargo.toml Cargo.lock README.md src target
$ git add .
$ git status
On branch main
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: .gitignore
new file: Cargo.lock
new file: Cargo.toml
new file: src/main.rs
$ git commit -m "Initialize cargo project net7212"
$ git push
If several persons work on the same project, the initialization above has to be done only once. Other participants may now clone the project or update their copy.
Note: the location (root of the git repository) and the name (net7212) of the cargo project are mandatory so that our automated project tools can work properly. Also, while it is fine to use branches to develop specific features, the work to be graded is expected to be found in the main
branch.
Getting help
You might need to get help during the project. Do not worry, this is expected.
๐ If anything here is unclear, please send a mail to the list (see below) to ask for clarifications. We might indeed want to update the site to make it clearer.
Useful resources
You might want to use the following resources first:
- The Zstandard RFC is the authoritative reference for Zstandard. Every question you might have about the file format should find an answer there.
- Nigel Tao has authored a seven part blog series explaining in details various parts of the standard.
Mailing-list
๐ง A mailing-list has been created for all course and project related exchanges. Please use it. If you do not remember its address, please look in your received mail folder and look for the name of the class (NET7212).1
You can ask a question about anything related to the project or course on the mailing-list. When sending a mail, make sure that:
- You use a sensible
Subject:
field (e.g., "How can I convert a String to lowercase?" rather than "Help!!!"). - You include enough information so that people can actually help you ("it does not work" will not help others helping you) ๐ฎ.
- If you want us to look at your code, ensure it is committed and pushed to GitLab and state clearly which branch or commit we should look at.
Also, do not hesitate to send a mail on the mailing-list if you think you have the answer to a question asked by others, or if you think our answer is not the best one.
โ๐ป When sending or answering a mail, always keep the mailing-list as a recipient: everyone in the class deserves to see the question and the answer.
โ Do not send an email about the project or the course content directly to the lecturers in private emailโwe insist in sharing the knowledge between all participants.
During the lab sessions
The lab sessions are supposed to be usedโฆ for working on the lab topic, not the project. So unless you are done with the lab assignment, please refrain from requesting help about the project. However, if you want to discuss something related to the project with the lecturer, feel free to take some time off the lab session to talk about a particular point.
We don't want to give the address on this publicly accessible web site to prevent being spammed.
Implementing a decompressor
Your goal, in this part, is to implement a zstd file decompressor which ultimately accepts any zstd compressed file.
The decompressor will be built incrementally, and tests will be added at each step. More and more zstd compressed files will be decodable as you progress.
In this first part, you will build a decompressor which is able to parse and decode a subset of Zstandard files, thus achieving a first level of compression capabilities. In the second part, you will later implement the full standard.
General principles
You are free to organize your project as your wish, but some general principles will be respected (unless you convince us by discussing it on the mailing-list).
๐ก Do not hesitate to come back to this page frequently to ensure you respect those design principles.
โ ๏ธ You are not supposed to implement everything described in the current page at the beginning of the project. Various parts will be described in more details in subsequent sections.
Library vs. program
While we ultimately want to produce an executable program that will take arguments from the command line, we also want all capabilities we offer to be available as a library for at least two reasons:
- We want other programs to be able to include our decoder as a dependency.
- We want to be able to test various functions using unit tests.
Your project will contain, in the same crate:1
- a library
- an executable
The executable should be reduced to handling the command line arguments and calling the right library functions. Its job should be kept to a minimum.
Error handling
Every module whose functions return a Result
should have its own Error
type, and possibly its own Result
shortcut. Using the thiserror
crate in the library allows easy embedding and conversion of one error type to another:
// File "mymod.rs"
#[derive(Debug, thiserror::Error)]
pub enum Error {
// Encapsulate an I/O error without adding any more context
// and add a `impl From<std::io::Error> for Error` implementation.
#[error(transparent)]
Io(#[from] std::io::Error)
// Custom error originating from this module
#[error("bad magic found {found:#06x} instead of expected {expected:#06x}")]
BadMagic { expected: u32, found: u32 }
// Encapsulate an error from another module while adding context
#[error("corrupted Huffman table weights")]
CorruptedWeights(#[source] fse::Error)
}
// Define a local `Result` type alias which can be used with one argument,
// the default for the second type being defined as the local `Error` type.
pub type Result<T, E = Error> = std::result::Result<T, E>;
// Example function returning a `Result` without repeating the `Error` type.
// The return type can be written as:
// - `Result<Vec<u8>>` in this module, or `Result<Vec<u8>, Error>`
// - `mymod::Result<Vec<u8>>` or `Result<Vec<u8>, mymod::Error>` from outside
fn read_file(filename: &str) -> Result<Vec<u8>> {
Ok(std::file::read(filename)?)
}
The executable can implement generic error handling by using, for example, anyhow, or the combination of eyre and color-eyre for a more pleasant and colorful error output2.
Consuming the input
We will need to use several parser to handle the compressed content:
- A main forward byte-oriented one, which delivers the bytes in the input in order.
- A forward bit-oriented sub-parser to decode the coefficients of a FSE table.
- A backward bit-oriented sub-parser to decode FSE-encoded or Huffman-encoded data.
The main parser should have the following properties:
- It can preload data on-demand, for example if all input bytes have been consumed, or if the decoder needs to access bytes located later in the input.
- It can lend a reference to the data to a sub-parser without copying it.
- It may discard data that has been either consumed directly or through a sub-parser.
๐ก At the beginning of the project, having the data to decode in memory as a slice of bytes might be sufficient to satisfy the requirements of the main parser.
Producing the output
Producing the output is simpler than parsing the input, as all transactions are byte-oriented and going forward. Our output producer, dedicated to a frame, should have the following properties:
- It should emit data as soon as possible but not sooner: decoding a frame should only output its uncompressed data when previous frames have finished outputing theirs.
- It should discard data as soon as possible, but not sooner: later blocks in the same frame may need to copy already-produced data.
Decoding a frame
Within a frame, later blocks might refer to decoding tables or data structures used in previous ones. For example, if two successive blocks use the same Huffman weights to encode the literals in both blocks, the second block will only contain a "use the previous Huffman decoder" marker.
You will need to maintain a shared state between successive blocks of the same frame. This state must contain:3
- The three repeat offsets.
- The latest decoders used to decode the constituents of a sequence (literals lengths, offsets, and match lenghts).
- The latest Huffman decoder used to decode literals.
- Some high-level information about the frame such as the window size.
- The current checksum of the data being decoded in this frame.
Do not worry if you do not yet understand the Rust-specific parts of the project specification at first (e.g., crates, Result
, thiserror
, etc.), you will learn about all of them later during the course.
If your program depends on the color-eyre
crate, you may omit the eyre
dependency if you wish, as color-eyre
depends on eyre
and reexports it. You can use color_eyre::eyre;
in your code then use eyre::Result
.
Do not worry if some concepts seem obscure at the beginning of the project, you will discover and understand them later while reading the zstd standard.
Parsing input data
It is now time to try and extract frames. In order to later be able to handle frames in parallel as much as possible, it is necessary to parse the input in two steps:
- Do the minimum work needed to extract a frame from the input (parsing phase).
- Later, as frames have been separated, do the rest of the work to decode the frame compressed content (decoding phase).
As zstd compressed input does not contain a table of contents indicating where each frame starts, parsing a frame implies having already parsed all frames coming beforehand.
Forward byte parser
The first parser you need is a forward byte parser, which can deliver bytes from the input in order. The parser must also remember not to deliver bytes it has delivered already. To do this, a slice of bytes is enough as it references a piece of memory.
โ Create a
parsing
module (in fileparsing.rs
) and create aForwardByteParser
type in it.โ ๏ธ Do not forget to reference this module in
lib.rs
as a module with public visibility.
Since a slice does not own its content, it must be accompanied by a lifetime parameter, for example:
pub struct ForwardByteParser<'a>(&'a [u8]);
Initializing a forward byte parser from an existing slice is straightforward:
impl<'a> ForwardByteParser<'a> {
pub fn new(data: &'a [u8]) -> Self {
Self(data)
}
}
Consuming a byte from the input
Consuming a byte from the input implies returning the first byte, if it exists (the input may be empty) and storing a slice with the first byte removed:
impl<'a> ForwardByteParser<'a> {
pub fn u8(&mut self) -> Option<u8> {
let (first, rest) = self.0.split_first()?;
self.0 = rest;
Some(*first)
}
}
While returning an Option<u8>
seems handy, it will not be very useful: when you need a byte from the input, not obtaining it should be an error that can be propagated further.
โ Create an
Error
type and aResult
alias in theparsing
module as described in the general principles. Modify youru8()
method so that it returns an error if the input is empty.
This error alternative (for example NotEnoughBytes { requested: usize, available: usize }
) must be generic and will be reused in other methods. When displayed, the error should print: not enough bytes: 1 requested out of 0 available.
Writing a unit test
Does the parser work as expected? A unit test should be written along with the code.
โ Create a
tests/parsing.rs
file in your repository and include the following tests.
use net7212::parsing::{self, ForwardByteParser};
#[test]
fn forward_byte_parser_u8() {
// Check that bytes are delivered in order
let mut parser = ForwardByteParser::new(&[0x12, 0x23, 0x34]);
assert_eq!(0x12, parser.u8().unwrap());
assert_eq!(0x23, parser.u8().unwrap());
assert_eq!(0x34, parser.u8().unwrap());
assert!(matches!(
parser.u8(),
Err(parsing::Error::NotEnoughBytes {
requested: 1,
available: 0,
})
));
}
Running the tests with cargo test
should yield a success.
๐ก Notice that we use
matches!()
instead ofassert_eq!()
to compare the error, as theparsing::Error
type does not implementPartialEq
(and does not need to).
Adding more methods
Parsing a frame will require reading a 4-byte unsigned integer (u32
) in little-endian format, or extracting an arbitrary number of bytes. You can add utility functions now to your parser, such as (more will be needed later):
impl<'a> ForwardByteParser<'a> {
/// Return the number of bytes still unparsed
pub fn len(&self) -> usize { todo!() }
/// Check if the input is exhausted
pub fn is_empty(&self) -> bool { todo!() }
/// Extract `len` bytes as a slice
pub fn slice(&mut self, len: usize) -> Result<&'a [u8]> { todo!() }
/// Consume and return a u32 in little-endian format
pub fn le_u32(&mut self) -> Result<u32> { todo!() }
}
Tests must also be added for those methods to ensure they act as expected. Those tests will also act as non-regression tests, allowing us to later modify the body of those methods without fear.
Parsing some frames
Now that you know how to read bytes, integers, and slices from the input, you can think about parsing frames.
There exist two kinds of frames in the Zstandard RFC:
- Zstandard frames: they contain compressed data that must be decompressed.
- Skippable frames: they contain extra sections not related to compressed data and serve no purpose while decompressing. However they must be parsed in order to be skipped and removed from the input.
Adding a frame type
โ Create a
frame
module, and aFrame
enumerated type in it with two variants,ZstandardFrame
andSkippableFrame
.
The content of those variants will be filled later on.
Parsing a frame
Each frame starts with a magic number, which can be, in little-endian format:
0xFD2FB528
for a Zstandard frame0x184D2A5?
for a Skippable frame, where?
stands for any hexadecimal digit from0
toF
.
Any other value is an error.
Skippable frames
A skippable frame is easy to extract from the input: right after the magic number, a 32 bit little-endian value gives the length of the frame data, then the data follows.
โ Write a new struct type
SkippableFrame
to represent a skippable frame with two fields:magic
which is the found magic number, anddata
which represents the unparsed data corresponding to the skippable frame as a byte slice. Make theFrame::SkippableFrame
variant take aSkippableFrame
parameter.
As a SkippableFrame
will reference data using a byte slice, it will necessarily involve a lifetime generic parameter. Since the Frame
enumeration has a SkippableFrame
in its Frame::SkippableFrame
variant, it will itself need this lifetime generic parameter.
โ Write a
Frame::parse()
constructor that returns a frame: a validFrame::SkippableFrame
withinOk
if a skippable frame is found,todo!()
(for the time being) if a Zstandard frame is found, or an errorError::UnrecognizedMagic(magic)
otherwise.
This requires creating a local Error
type, and a Result
type alias. Notice that the Error
type will need a variant to encapsulate a parsing::Error
, as the parsing method will use ForwardByteParser
methods to extract data.
use crate::parsing::ForwardByteParser;
#[derive(Debug, thiserror::Error)]
pub enum Error {
// โฆ
}
type Result<T, E = Error> = std::result::Result<T, Error>;
impl<'a> Frame<'a> {
pub fn parse(input: &mut ForwardByteParser<'a>) -> Result<Self> {
todo!()
}
}
โ Write tests to check that skippable frames are decoded correctly.
Here is an example for starting tests/frame.rs
. You must complete this with tests ensuring that a proper error is returned if the input is truncated, similar to the one checking the error if the magic is not recognized.
use net7212::frame::{self, Frame};
use net7212::parsing::ForwardByteParser;
#[test]
fn parse_skippable_frame() {
let mut parser = ForwardByteParser::new(&[
// Skippable frame with magic 0x184d2a53, length 3, content 0x10 0x20 0x30
// and an extra byte at the end.
0x53, 0x2a, 0x4d, 0x18, 0x03, 0x00, 0x00, 0x00, 0x10, 0x20, 0x30, 0x40,
//^--- magic (LE) ----^ ^------ 3 (LE) -------^ ^--- content ---^ ^-- extra
]);
let Frame::SkippableFrame(skippable) = Frame::parse(&mut parser).unwrap() else {
panic!("unexpected frame type")
};
assert_eq!(0x184d2a53, skippable.magic);
assert_eq!(&[0x10, 0x20, 0x30], skippable.data);
assert_eq!(1, parser.len());
}
#[test]
fn error_on_unknown_frame() {
let mut parser = ForwardByteParser::new(&[0x10, 0x20, 0x30, 0x40]);
assert!(matches!(
Frame::parse(&mut parser),
Err(frame::Error::UnrecognizedMagic(0x40302010))
));
}
The main program
You are now able to parse skippable frames. Let's write a program which displays, for each skippable frame, its magic number and content. Also, the program will check that the file is a well-formed zstd file.
Command-line arguments
We want our program to act like this:
$ cargo run -- --help
Usage: net7212 [OPTIONS] <FILE_NAME>
Arguments:
<FILE_NAME> File name to decompress
Options:
-i, --info Dump information about frames instead of outputing the result
-h, --help Print help
-V, --version Print version
$ cargo run -- --info skippables.zst
SkippableFrame(
SkippableFrame {
magic: 0x184d2a53,
data: [
0x10,
0x20,
0x30,
],
},
)
SkippableFrame(
SkippableFrame {
magic: 0x184d2a51,
data: [
0x42,
],
},
)
(you can download skippables.zst)
Create the main program
โ Create the
src/main.rs
file containing a main function.
โ Use
clap
to implement command-line arguments handling.
You can read a file using std::fs::read()
. Do not print anything yet, we will do it step by step.
Decoding a frame
As previously explained, a decoding phase must happen after the parsing of a frame when --info
is not used as an argument to the main program.
โ Implement a
decode(self)
method on aFrame
returning the decoded data in aVec<u8>
.
Note that self
means that the Frame
object no longer exist after decode()
has been called. This is expected, as some decoding operations will consume frame data.
So, at this time:
- You can decode a
SkippableFrame
as it decodes to nothing (it is skipped after all). - You may use
todo!()
while decoding aZstandardFrame
as not even the parsing is implemented.
Iterating over the frames
Since all frames must be parsed in sequence, one can use an iterator to transform a forward byte parser into a succession of Frame
objects.
โ Add a
FrameIterator
in theframe
module, which encapsulates aForwardByteParser
.
โ Implement the
Iterator
trait forFrameIterator
, theItem
beingResult<Frame>
.
The iterator's next()
method will:
- Return
None
if the parser has no content left to indicate the end of the iterator. - Return
Some(Frame::parse(&mut self.parser))
otherwise.
Note that this will automatically signal an error if extra characters are present at the end of an otherwise valid file, as the next iteration will attempt to parse another frame which will be invalid.
Writing the main program body
โ Complete the main program so that it can either print information or decode frames from a zstd files in sequence.
Of course you must stop if the frame iterator returns an error.
Do not forget that:
- You can derive the
Debug
trait automatically on many types. - You can display something implementing the
Debug
trait with the{:?}
construct. - You can use an alternate representation with the
{:#?}
construct (for example indented structures). - You can use an alternate representation and print numbers in hexadecimal with
{:#x?}
.
Also, when --info
is not used, the decoded data (even though there is nothing to display yet) must be sent to the standard output. You can use the
write_all()
method to write on std::io::stdout()
after locking it.
Note that if you use the color_eyre
crate, you can make your main()
function return a color_eyre::eyre::Result<()>
and let color_eyre
display a pretty error if there is one. You can also use the anyhow
crate as an alternative.
Adding tracing information
Since your program will use the standard output when decoding a file, you must not use it yourself to print debugging information. You should use the log
crate to add traces if you need them, or the tracing
/tracing_subscriber
couple.
โ Check that you can decode a file made only of skippable frames, as well as an empty file (which is a valid zstd file with zero frames). Check that you signal an error if you try to decode an invalid file.
๐น Where do we stand?
At this point, if you have done everything right:
- You have written a main program which can decode a file containing only skippable frames and output the decoded version (i.e., nothing) on the standard output.
- Your main program can also print information about the frames from a file it knows how to decode using
--info
. - All your functions and methods have unit tests so that you can fearlessly refactor your code and extend it later.
If not, you have probably missed some things in the previous steps, please go back and fix them.
Handling Zstandard frames
Zstandard frames consist of three parts arranged in the following sequence:
- A header that provides information about the frame parameters.
- One or more blocks, referred to as blocks.
- An optional checksum.
To begin decoding and testing the code, it is necessary to parse and decode a few basic blocks. Let's begin with this.
Handling simple blocks
Blocks are the basic components forming a frame. There exist three types of blocks:
- A raw block contains some data, without any compression, that must be copied while decoding.
- A RLE block is a block which contains a byte
b
and a lengthl
. Decoding it duplicates theb
bytel
times. - A compressed block contains much more complex information and is the basis of Zstandard compression capabilities.
For a start, you will implement the decoding of the first two block types.
Structure
โ In a
block
module, add an enumeratedBlock
type with two variants (for now):RawBlock
which encapsulates a slice, andRLEBlock
which contains abyte
and arepeat
field.
โ Implement
Block::parse()
which accepts a&mut ForwardByteParser
and returns a result containing a(Block, bool)
pair. Thebool
istrue
if the block is the last block of the frame.
Do not forget to add a dedicated Error
type in the block
module, as well as a Result
type alias. If the block type is invalid (3 is reserved), a Error::ReservedBlockType
should be returned. Also, you can call unimplemented!()
if you encounter a block of type 2 (compressed block) as we do not implement it yet.
โ Implement the
Block::decode(self)
method which decodes a block (and destroys theBlock
object by takingself
).
Right now, decode()
does not need any other parameter. This will no longer be the case when we implement compressed blocks, as decoding a compressed block requires accessing a decoding context containing information about earlier operations. Also, decode()
needs to return a Result
: even if decoding cannot fail right now, it will later when more complex blocks are implemented.
โ Add tests.
Here are two simple tests, you must add others, including for checking that errors are correctly detected:
use net7212::block::Block;
use net7212::parsing::ForwardByteParser;
#[test]
fn decode_raw_block_last() {
let mut parser = ForwardByteParser::new(&[
// Raw block, last block, len 4, content 0x10, 0x20, 0x30, 0x40,
// and an extra 0x50 at the end.
0x21, 0x0, 0x0, 0x10, 0x20, 0x30, 0x40, 0x50
]);
let (block, last) = Block::parse(&mut parser).unwrap();
assert!(last);
assert!(matches!(block, Block::RawBlock(&[0x10, 0x20, 0x30, 0x40])));
assert_eq!(1, parser.len());
let decoded = block.decode().unwrap();
assert_eq!(vec![0x10, 0x20, 0x30, 0x40], decoded);
}
#[test]
fn decode_rle_block_not_last() {
let mut parser = ForwardByteParser::new(&[
// RLE block, not last, byte 0x42 and repeat 0x30004,
// and an extra 0x50 at the end.
0x22, 0x0, 0x18, 0x42, 0x50
]);
let (block, last) = Block::parse(&mut parser).unwrap();
assert!(!last);
assert!(matches!(
block,
Block::RLEBlock {
byte: 0x42,
repeat: 196612
}
));
assert_eq!(1, parser.len());
let decoded = block.decode().unwrap();
assert_eq!(196612, decoded.len());
assert!(decoded.into_iter().all(|b| b == 0x42));
}
Handling the frame header
The frame header is made of a frame header descriptor and some other fields, which may or may not be present. The frame header descriptor is a set of flags which indicates which optional fields are present or not.
โ Implement a
frame::FrameHeader
structure type containing the fields of a frame header.
โ Implement
FrameHeader::parse()
in order to parse the frame header, and write tests to check the parsing behaviour.
Handling the frame itself
You are reaching the end of the frame parsing! It is time to put the various bits together.
Building the type
โ Make the
frame::ZstandardFrame
structure contain three fields: the frame header, a vector of blocks, and an optional checksum.
Parsing
โ Implement
ZstandardFrame::parse()
and call it fromFrame::parse()
in the presence of a Zstandard frame.
This method will need to:
- parse the header;
- parse the blocks (there is at least one), and continue doing so as long as the parsed block is not the last one;
- parse the checksum if it is present according to the frame descriptor header.
Decoding
โ Implement
ZstandardFrame::decode()
and call it fromFrame::decode()
in the presence of a Zstandard frame.
Decoding a frame means decoding each block in turn and concatenating the result.
Running the main program
Our main program should not need any change to print information or decode a zstd file containing only the supported block types.
โ Check that you can decode the
welcome.zst
file which is made of two frames, the second one containing four blocks.
๐ก You can examine the content of the compressed file by using
hexdump
or a similar program:$ hexdump -C welcome.zst 00000000 57 2a 4d 18 30 00 00 00 3e 3e 3e 20 59 4f 55 20 |W*M.0...>>> YOU | 00000010 57 49 4c 4c 20 4e 4f 54 20 53 45 45 20 54 48 49 |WILL NOT SEE THI| 00000020 53 20 50 41 52 54 20 49 4e 20 54 48 45 20 4f 55 |S PART IN THE OU| 00000030 54 50 55 54 20 3c 3c 3c 28 b5 2f fd 24 7e 4a 01 |TPUT <<<(./.$~J.| 00000040 00 23 58 01 00 0a 23 20 57 65 6c 63 6f 6d 65 20 |.#X...# Welcome | 00000050 74 6f 20 54 65 6c 65 63 6f 6d 20 50 61 72 69 73 |to Telecom Paris| 00000060 20 7a 73 74 64 20 65 78 61 6d 70 6c 65 20 23 0a | zstd example #.| 00000070 4a 01 00 23 09 00 00 0a 9e 2e 5d 9f |J..#......].| 0000007c
๐น Where do we stand?
At this point, if you have done everything right:
- You have written a main program which can decode a file containing skippable frames, or Zstandard frames with raw or RLE blocks.
- Your program can output the decoded version on the standard output.
- Your main program can also print information about the frames from a file it knows how to decode using
--info
. - All your functions and methods have unit tests so that you can fearlessly refactor your code and extend it later.
If not, you have probably missed some things in the previous steps, please go back and fix them.
The decompressor: part 2
Your decompressor is already able to handle a subset of the Zstandard file format. However, it does not yet take advantage of the most efficient compression techniques used by the standard.
Zstandard utilizes two efficient compressors: Huffman codes and FSE.
- Huffman codes are used to encode byte-oriented sequences with \(2^8\) possible output symbols. These codes are employed in the literals section, which consists of portions of data to be copied to the output stream during decompression.
- FSE codes are used to encode data with a smaller number of output symbols. They are used for instructions that guide the overall decoding process, known as sequences, as well as coefficients used to initialize the Huffman decoder.
๐ก Ensure you understand fully the Huffman and FSE examples, and that you obtain the same result while decoding the bit strings by hand.
Huffman code
The Huffman code is a prefix code. Bits are read one-by-one. Either the current sequence of bits designates a symbol to add to the output (for example a byte), or it is incomplete and more bits need to be read. Most frequent symbols get assigned shorter bit sequences, while less frequent symbols get assigned longer bit sequences. Once a symbol has been decoded, the decoder restarts from scratch and handles the next input bits.
Of course, no bit sequence for a symbol may correspond to the prefix of another bit sequence.
Practical Huffman code example
For example, if we only had 3 symbols A
, B
and C
in our ouput, and B
is more frequent than A
and C
, Huffman coding could use a dictionary such as:
Input bits | Output symbol |
---|---|
00 | A |
01 | C |
1 | B |
The input bit sequence 10010111
would get parsed as 1_00_1_01_1_1
and decoded as BABCBB
. 8 input bits decode as 6 output symbols, taking approximately 1.33 bit by symbol.
Note that while B
is represented as a single 1
, all other bit sequences start with 0
. No ambiguity can arise while adding bits as needed. The table can be represented as a prefix tree:
In Zstandard, longer prefixes start with zeroes and shorter prefixes with ones. Also, for the same prefix width, symbols are ordered, which is why A
uses 00
because it comes before C
at 01
.
FSE code
While the Huffman decoders starts from scratch once a symbol has been emitted, a FSE (finite-state entropy) decoder keeps an internal state. Upon entering a state, the decoder emits an output symbol, then reads a number of input bits to switch to another state, and so on. States are consecutively numbered starting from 0. Each state contains three pieces of information:
- the output symbol to emit
- the number of bits to read to determine the next state
- the base line (a number) to add to the number read from the new bits to get the new state number
A FSE code has an accuracy log (AL) which represents the base 2 logarithm of the number of states. For example, a FSE decoder with 32 states will have an AL of 5, because \(2^5\) is 32. The AL is also the number of bits to read to determine the initial state.
Practical FSE example
Let us assume that we have a FSE table with 4 states. The accuracy log (AL) is 2.
State | Output symbol | Base line (BL) | Number of bits to read |
---|---|---|---|
0 | A | 1 | 0 |
1 | D | 2 | 1 |
2 | B | 0 | 1 |
3 | A | 2 | 1 |
Let us decode the 1110000101
bit sequence (11_1_0_0_0_0_1_0_1
for clarity of this example). AL bits are read from the string, giving 11
. The initial state is 3 (0b11
):
- Entering state 3 outputs symbol
A
. 1 bit (1
) is added to the BL2
: new state is (0b1 + 2
) - Entering state 3 outputs symbol
A
. 1 bit (0
) is added to the BL2
: new state is 2 (0b0 + 2
). - Entering state 2 outputs symbol
B
. 1 bit (0
) is added to the BL0
: new state is 0 (0b0 + 0
). - Entering state 0 outputs symbol
A
. 0 bits are read, the new state is the BL1
: new state is 1. - Entering state 1 outputs symbol
D
. 1 bit (0
) is added to the BL2
: new state is 2 (0b0 + 2
). - Entering state 2 outputs symbol
B
. 1 bit (0
) is added to the BL0
: new state is 0 (0b0 + 0
). - Entering state 0 outputs symbol
A
. 0 bits are read, the new state is the BL: new state is 1. - Entering state 1 outputs symbol
D
. 1 bit (1
) is added to the BL2
: new state is 3 (0b1 + 2
). - Entering state 3 outputs symbol
A
. 1 bit (0
) is added to the BL2
: new state is 2 (0b0 + 2
). - Entering state 2 outputs symbol
B
. 1 bit (1
) is added to the BL0
: new state is 1 (0b1 + 0
). - Entering state 1 outputs symbol
D
. No 1 bit is available, stopping
The decoded symbol sequence is AABADBADABD
. One may note that:
- One state may encode more than one output symbols, since the number of bits to read to get to the following state may be 0.
- Contrary to a Huffman code, some symbols need less than one bit in the compressed representation. Indeed, here we have a 10 bits input string decoding into a 11 symbols output, that is approximately 0.9 bit per symbol.
- Several states may encode the same symbol (states 0 and 3 both
emit
A
).
Huffman decoder
You will now implement a Huffman decoder able to decode a bit stream into a set of bytes. Your implementation will be implemented as a tree as this is a direct mapping of the Huffman decoder concepts explained previously.
Huffman tree structure
Your Huffman tree structure will reside in the decoders/huffman
module.
โ Create a
decoders
module.
โ Create a
huffman
module in thedecoders
module.
โ Create a
HuffmanDecoder
enumerated type in thedecoders/huffman
module.
The HuffmanDecoder
represents a node in the tree. It is either Absent
, a Symbol
(with a u8
payload) or a subtree represented by a Tree
variant and two HuffmanDecoder
values representing the left (0) and right (1) alternatives.
Note that you cannot put an object of type HuffmanDecoder
inside a HuffmanDecoder
itself, as the container object would have to have a greater size than the contained one, which is not possible as they have the same type. Also, you cannot use references either to represent the left and right alternatives: references do not represent ownership. Who would be the owner of the left and right alternatives?
You will have to use a type which acts like a reference (in that you don't have to store the whole object) and owns a subtree.
Inserting a symbol in a tree
โ Write a
HuffmanDecoder::insert()
method which inserts a symbol in the tree at the leftmost possible free position for a given width, and returnstrue
if it has succeded.
impl HuffmanDecoder {
pub fn insert(&mut self, symbol: u8, width: u8) -> bool {
todo!()
}
}
Its behaviour will be:
- If
width
is 0, the symbol must be inserted right at the top if the top is currentlyAbsent
. Otherwise this is a failure, andfalse
must be returned. - Otherwise, the symbol must be inserted with a decreased width on the left branch, and if it fails on the right branch. Branches must be created (as absent) if the node doesn't contain them yet. If the top of the tree is a
Symbol
, justpanic!()
as this can should never happen. It would mean that it is impossible to insert the node without resolving to a symbol before the prefix length is reached.
For example, the tree shown in the Huffman code example can be built by running the following function:
fn build_example_tree() -> HuffmanDecoder {
let mut tree = HuffmanDecoder::Absent;
assert!(tree.insert(b'A', 2));
assert!(tree.insert(b'C', 2));
assert!(tree.insert(b'B', 1));
tree
}
Printing a tree (optional)
You may want to implement
Debug
on yourHuffmanDecoder
type to list its content. While you may deriveDebug
automatically, you may want a better representation, showing the prefixes.To do so, you can build an iterator which iterates over the nodes of your tree and their respective prefix. This iterator will keep a list of nodes to visit and their prefix internally, and yield pairs of
(String, u8)
which represent a prefix and the associated symbol.โ Build a
HuffmanDecoderIterator<'a>
structure which contains a list of nodes yet to visit, for example as aVec<(&'a HuffmanDecoder, String)>
.โ Implement
Iterator
for this structure: take the first node to visit by popping it from the vector. If it is a symbol, return it along with its prefix, if it is a subtree, note that you have to visit both branches (choose the order right) and try again with the node in your list, if there is nothing just go to the next node in your list. โ ImplementDebug
forHuffmanDecoder
, taking advantage of thestd::fmt::Formatter::debug_struct()
method to structure your output.If you do so, executing the following code:
fn main() { println!("{:?}", build_example_tree()); }
should print something like:
HuffmanDecoder { 00: 65, 01: 67, 1: 66 }
(because 65, 67, 66 are the ASCII codes for
A
,C
andB
respectively)
Building a tree from prefix widths
In a Zstandard file, a Huffman table will be described by succession of prefix widths (number of bits) for each symbol. After some decoding process that will be shown later, you will end up with a list of up to \(2^8\) integers, each representing the number of bits needed to represent each symbol. Missing ones mean, or prefix width set to 0, mean that this particular symbol will not be present in the output stream and thus will not appear in the Huffman table.
โ Build a
HuffmanDecoder::from_number_of_bits()
constructor which takes a vector containing the number of bits for each symbol and returns a newly builtHuffmanDecoder
.
What you must do is:
- Build a list of every symbol and its width.
- Filter out zero width entries as those symbols will never appear in the output. They are present only because some later symbols in the list have a non-zero width.
- In a newly created tree, insert every symbol at the right width, starting from the largest width and smallest symbol values within the same width. Make sure that you read the note on sort stability in Rust documentation.
Once you have done so, you should check that the following code builds the same tree as before:
fn main() {
// 0 repeated 65 times, 2, 1, 2
let widths: Vec<_> = std::iter::repeat(0).take(65).chain([2, 1, 2]).collect();
println!("{:?}", HuffmanDecoder::from_number_of_bits(widths));
}
Representation through weights
Inside a Zstandard file, prefix widths are represented as weights:
- A weight of 0 represents a symbol absent from the output.
- If the longest prefix width is \(L\), a weight of \(w\) represents a prefix width of \(L+1-w\).
Moreover, three small additional optimizations are done to the list of weights present in a Zstandard file:
- All 0 weights at the end of the list are absent. For example, a list of
[2, 1, 0, 1, 0, 0โฆ]
will not contain the zeroes after the second 1. - The latest non-0 weight in the list is also absent. It can be computed by looking at the existing weights. For example, a list of
[2, 1, 0, 1]
will be represented as[2, 1, 0]
. - The longest prefix width is not indicated in the file. It must be deduced from the list of weights.
โ Build a
HuffmanDecoder::from_weights
constructor which takes a vector of weights, represented asu8
, and returns a result with the newly builtHuffmanDecoder
or an error. You must create a new localError
enum as you have done in other modules, as well as theResult
type alias.
To compute the missing weight, which must be added to the list of weights received by the method, you can use the following properties:
- If you sum \( 2^{w-1} \) for all weights \(w\) different from 0, you will get a power of 2.
- No \( 2^{w-1} \) expression can be strictly greater than the sum of all others.
This should let you determine the missing last weight by ensuring that when added, the sum of \( 2^{w-1} \) rounds to the smallest strictly greater power of 2.
Once you have determined the missing weight, you can also deduce the number of bits needed to represent the weights. This can be easily done by taking the base 2 logarithm of the power of 2 you have reached.
You can now compute the prefix widths for all weights, and call your previously written HuffmanDecoder::from_number_of_bits()
to build the Huffman decoder.
The following test code should build the same decoder as before:
fn main() {
// 0 repeated 65 times, 1, 2
let weights: Vec<_> = std::iter::repeat(0).take(65).chain([1, 2]).collect();
println!("{:?}", HuffmanDecoder::from_weights(weights));
}
๐ก You may have noticed that in our example above the vector of weights above contains a lot of 0 at the beginning. In fact, in a real weights array, many weights will be similar to each other, as they will all take a value between 0 and 8, inclusive. Does this ring a bell? Indeed, as we will see later, the weights are often themselves encoded using a FSE encoder which thrives in the presence of a small number of symbols.
In the next section, we will learn how to feed bits to our Huffman decoder and make it decode them.
Backward bit parser
The Huffman decoder you have just built needs to be fed some bits. In the same way that you have built a forward byte parser, you will now have to build a backward bit parser.
Why backward? Because of the way Zstandard encoders work, starting from the end of the stream to compress and writing the various bits to the compressed stream.
However, writing a stream of bits does not necessarily mean that the encoder will fill an integral number of bytes. To signal the end of the encoded bitstream, the encoder writes an extra 1
then completes the byte with 0
bits if needed.
Parsing a bitstream backwards
What does it mean to parse a bitstream backwards?
- You must start from the last byte and progress in reverse order.
- In every byte, you start reading bits from the most significant bit to the least significant bit.
- When you start producing the bitstream, you must skip all initial zeroes and the first one you encounter.
For example, let us start from the [0x05, 0x74]
byte sequence found in a Zstandard file. In binary, this is written as [0b00000101, 0b01110100]
.
The bit string will be consumed starting from the byte marked ฮฑ
then the byte marked ฮฒ
, following the arrow direction inside each byte:
00000101 01110100
|---ฮฒ--> |---ฮฑ-->
^
The caret (^
) represents the next bit to read. First, the initial zeroes must be zapped until the first one is found and discarded. This happens necessarily within the initial byte. After discarding the header, the caret as advanced two positions (one initial zero and the initial one):
00000101 --110100
|---ฮฒ--> |---ฮฑ-->
^
Now, let us read a 3 bits value. We get 110
and the caret advances 3 positions.
00000101 -----100
|---ฮฒ--> |---ฮฑ-->
^
Reading 7 bits will give us 0b1000000
(3 bits from the current byte, 4 bits from the byte just before).
----0101 --------
|---ฮฒ--> |---ฮฑ-->
^
Reading 4 bits yields 0b0101
: we have reached the end of the stream.
-------- --------
|---ฮฒ--> |---ฮฑ-->
As another example, here is how the two bitstreams from the examples would be encoded in the Zstandard file:
- Huffman code:
10010111
would be[0b10010111, 0b00000001]
, or[0x97, 0x01]
. Seven0
from the last byte will be skipped before the initial1
is found and skipped. - FSE code:
1110000101
would be[10000101, 00000111]
, or[0x85, 0x07]
. Five0
and the initial1
will be skipped.
โ Implement a
BackwardBitParser
type in theparsing
module.
The following methods would be useful for such a parser:
impl<'a> BackwardBitParser<'a> {
/// Create a new backward bit parser. The header is skipped automatically or
/// an error is returned if the initial 1 cannot be found in the first 8 bits.
pub fn new(data: &'a [u8]) -> Result<Self> { todo!() }
/// True if there are no bits left to read
pub fn is_empty(&self) -> bool { todo!() }
/// Get the given number of bits, or return an error.
pub fn take(&mut self, len: usize) -> Result<u64> { todo!() }
}
Using the backward bit parser
โ Add a method
HuffmanDecoder::decode()
which takes a mutable reference to aBackwardBitParser
and returns the decoded symbol or an error.
You will probably have to add a new variant to HuffmanDecoder::Error
in order to propagate a parsing error signaled by the BackwardBitParser
.
Testing our parser and decoder together
Why not test our backward bit parser and our Huffman decoder together using the example given here?
#[test]
fn huffman_project_example() {
// 0 repeated 65 times, 1, 2
let weights: Vec<_> = std::iter::repeat(0).take(65).chain([1, 2]).collect();
let decoder = HuffmanDecoder::from_weights(weights).unwrap();
let mut parser = BackwardBitParser::new(&[0x97, 0x01]).unwrap();
let mut result = String::new();
while !parser.is_empty() {
let decoded = decoder.decode(&mut parser).unwrap();
result.push(decoded as char); // We know they are valid A, B, or C char
}
assert_eq!(result, "BABCBB");
}
Forward bit parser
FSE decoder distribution table
The FSE decoder is used at two places in the Zstandard decoding process:
- to decode the Huffman decoder tables: two similar FSE decoders produce output symbols alternatively using the same input bitstream;
- to decode the sequences of instructions used to rebuild the output in a compressed block: three FSE decoders will be used simultaneously.
The principles of the FSE decoding process have been explained here.
The FSE decoder states are built from1:
- An accuracy log \(AL\)
- A distribution table \(\mathcal{D}\) which describes how symbols are distributed amongst the states. Every entry in the table is a number in the \([-1, R]\) range.
Since all the states will be filled, and since \(-1\) is just a special value which means that a symbol occupies only one state, we have \[R = \sum_{d \in \mathcal{D}}|d| = 2^{AL}\].
The various arguments will be read from a bitstream coming from the file being decoded:
- \(AL\) is read first and computed from 4 bits, which also gives us the states table size \(R\)
- The distribution table is read, one symbol after the other. Since the sum of absolute values in the distribution table cannot exceed \(R\), you always know the maximum value that can be read. This lets you determine the number of bits you need to read from the stream in order to decode this value.
- You will know when you have a full table, as the sum of absolute values will be equal to \(R\). It means that you know when to stop reading the bitstream.
For reasons which will become apparent later, the data pertaining to the FSE table description are parsed with a forward bit parser. This is a parser similar to the backward bit parser you have written already, but bits are consumed forward.
Parsing a bitstream forwards
What does it mean to parse a bitstream forwards?
- You must start from the first byte and progress in regular order.
- In every byte, you start reading bits from the least significant bit to the most significant bit.
- If you read several bits at a time to get an integer value, the first bit you get is the least significant one.
For example, let us start from the [0x05, 0x74]
byte sequence found in a Zstandard file. In binary, this is written as [0b00000101, 0b01110100]
.
The bit string will be consumed starting from the byte marked ฮฑ
then the byte marked ฮฒ
, following the arrow direction inside each byte:
00000101 01110100
<---ฮฑ--| <---ฮฒ--|
^
Now, let us read a 2 bits value. We get 0b01
(remember that the first bit read is the least significant one in the result) and the caret advances 2 positions:
000001-- 01110100
<---ฮฑ--| <---ฮฒ--|
^
Similarly, reading 9 bits yields 0b100000001
(or 0b100_000001
to make it clearer which part comes from each input byte), and the caret advances 9 positions:
-------- 01110---
<---ฮฑ--| <---ฮฒ--|
^
Reading four more bits yields 0b1110
:
-------- 0-------
<---ฮฑ--| <---ฮฒ--|
^
If we don't want to read any further, the remaining bits are discarded. Note that contrary to the backwards bit parser case, we did not have to skip an initial header in order to start decoding data.
โ Implement a
ForwardBitParser
type in theparsing
module.
The following methods would be useful for such a parser:
impl<'a> ForwardBitParser<'a> {
/// Create a new backward bit parser. The header is skipped automatically or
/// an error is returned if the initial 1 cannot be found in the first 8 bits.
pub fn new(data: &'a [u8]) -> Result<Self> { todo!() }
/// True if there are no bits left to read
pub fn is_empty(&self) -> bool { todo!() }
/// Get the given number of bits, or return an error.
pub fn take(&mut self, len: usize) -> Result<u64> { todo!() }
}
Using the forward bit parser
You will now read a FSE decoder table
โ In a
decoders::fse
module, add aparse_fse_table()
method taking a mutable reference to aForwardBitParser
and returning either a(accuracy_log, distribution)
table or an error if decoding the table failed.
The signature would be:
pub fn parse_fse_table(parser: &mut ForwardBitParser) -> Result<(u8, Vec<i16>)> {
todo!()
}
The process to decode a FSE table is described with many details in the Zstandard RFC:
- Read a 4 bits value and add 5. This gives you the accuracy log \(AL\), as well as the table size \(R = 2^{AL}\). You may reject large accuracy log (greater than 9 for example) as you will not encounter them in the real life.
- As long as the sum of the absolute values of the distribution table entries is strictly lower than \(R\), you must:
- Compute the maximum value \(m = R-\sum_{d\in\mathcal{D}}|d|\) that the next entry may reach.
- You need to read a value between 0 and \(m+1\) from the stream โ you will later subtract 1 from this value to get a number between -1 and \m\). Compute the number of bits \(b\) you need to read such a value. โ ๏ธ How many values are between 0 and \(m+1\), inclusive?
- You may need to read either \(b-1\) bits or \(b\) bits to get the value \(v\) from the bitstream. You can do this in several ways:
- Add a mechanism to your forward bit parser to reinject a bit that you haven't needed (hard)
- Add a
peek()
method which returns the next bit in the bitstream without consuming it (easy). - Determine from the \(b-1\) value that you have read if you need an extra bit or not (easiest).
- \(p = v-1\) is the new table coefficient, add it to the table. If the total exceeds \(R\), return an error.
- If \(p = 0\), you must read 2 extra bits in the bitstream to get the number of extra zeroes to insert in the table. If you get 3 (
0b11
), you must repeat the process for extra zeroes.
Does your implementation work?
Add some tests for your implementation, such as:
#[test]
fn parse_fse_table() {
let mut parser = ForwardBitParser::new(&[0x30, 0x6f, 0x9b, 0x03]);
let (accuracy_log, table) = parse_fse_table(&mut parser).unwrap();
assert_eq!(5, accuracy_log);
assert_eq!(&[18, 6, 2, 2, 2, 1, 1][..], &table);
assert_eq!(parser.len(), 6);
}
This test shows that:
- the table which is described has \(32 = 2^5\) states;
- those 32 states can outputs symbols 0 to 6;
- the approximate respective probabilities of the symbols in the output stream will be \(\frac{18}{32}\) for symbol 0, \(\frac{6}{32}\) for symbol 1, โฆ (meaning that 18 states will output symbol 0, 6 states will output symbol 1, โฆ);
- 6 bits were unused in the provided bistream.
Note that the way adopted by the Zstandard RFC is not the only way to represent a FSE table. In particular, the way chosen here, through a distribution of probabilities, assumes that any symbol (present in the uncompressed stream) can follow any symbol, while more specialized tables could allow the full scope of FSE decoders as shown in the introductory FSE example.
FSE decoder
Now that the FSE distribution table is available, the FSE table itself must be built. As you saw in the example, those 7 coefficients will be used to build a 32 states table.
โ In the
decoders::fse
module, add aFseTable
type containing the states (from aState
type) of a FSE decoder.
The symbols decoded by the FSE decoder are not necessarily bytes, it is suggested that you use u16
as the symbol type to ensure you will be able to decode all sequences. Tables may contain up to \(2^{20}\) states, but you are allowed to set a limit. No common Zstandard file uses more than \(2^{14}\) states, so using a u16
for the baseline looks reasonable.
Building the FSE decoder tables
A FSE decoder table is built from a distribution of symbols, i.e. the number of states that will decode to each symbol. The states corresponding to a symbol will be roughly evenly spaced in the table, making it possible to branch to a new state to emit any other symbol afterwards after reading only a small number of bits.
For every symbol, the distribution list contains:
- \(-1\) if the symbol is not frequent at all. In this case, it will be placed at the end of the states table, and will read all AL (accuracy log) bits to get the following state, making all further states reachable.
- \(0\) if the symbol does not appear in the output. No state will be built for this symbol.
- \(n > 0\): the symbol will be emitted by \(n\) states spread through the table. Enough bits will be read by every state so that by carefully choosing the selected state for the current symbol it is possible to reach any other state in the table.
โ Build a
FseTable::from_distribution()
method building the FSE states table given the accuracy log and the distribution list.
One possible signature is:
impl FseTable {
pub fn from_distribution(accuracy_log: u8, distribution: &[i16]) -> Self {
todo!()
}
}
The steps to build the table are described below, and are also described in the RFC.
Symbols with "less than 1" probability
Some symbols have a "less than 1" probability, represented with the -1
distribution value. They will be emitted by one state only. Since you must be able to jump to any state after emitting such a symbol, its baseline will be 0 and the number of bits to read will be the accuracy log.
โ
After building an empty table, fill it starting from the end with states emitting those symbols with a -1
distribution value (the lowest symbol with a -1
distribution value will be stored last in the table).
The rest of the table must now be filled with states emitting symbols with a positive distribution value, in order.
Spreading the entries for a given symbol with a positive distribution value
The states corresponding to a symbol will be spread throughout the remaining table entries. The idea behind spreading them, rather agglomerating them near each other, is that for a given symbol you will need less bits to reach any other state.
Aside: why are the entries for a symbol spread in a FSE table?
Let us assume that your output uses four symbols
A
,B
,C
, andD
, and thatA
density is \(\frac 12\),B
density is \(\frac 14\), andC
andD
both have density \(\frac 18\). The chosen accuracy log is 3, so there exist 8 states.The table may then be built as-is:
State Output symbol Base line (BL) Number of bits to read 0 A
0 1 1 B
0 2 2 A
2 1 3 C
0 3 4 A
4 1 5 B
4 2 6 A
6 1 7 D
0 3 Let us assume we want to emit a symbol then jump to a given state, there always exists a state which has this property. An encoder would start from the end of the stream of symbols to encode, and go backward to find an appropriate state. For example, let us encode "AACBAABD" and build the stream (which will be parsed by a backward bit parser when decoding) in the process. Underscores (_) will be used to separate bytes once they have been filled entirely.
- Only state 7 emits
D
. We want the previous state to jump there and stop.- State 5 emits
B
and jumps to 7 when reading11
. Stream:11
.- State 4 emits
A
and jumps to 5 when reading1
. Stream:111
.- State 4 emits
A
and jumps to 4 when reading0
. Stream:0111
.- State 5 emits
B
and jumps to 4 when reading00
. Stream:000111
.- State 3 emits
C
and jumps to 4 when reading100
. Stream:00000111_1
.- State 2 emits
A
and jumps to 3 when reading1
. Stream:00000111_11
.- State 2 emits
A
and jumps to 2 when reading0
. Stream:00000111_011
.- The initial value is taken by reading a number of bits equal to the accuracy log (3):
010
will designate state 2. Stream is:00000111_010011
.- A
1
is added to mark the beginning is the stream, then zeroes are added as necessary to complete the last byte (here only one is needed). Stream is00000111_01010011
, or[0b00000111, 0b01010011]
, or[0x07, 0x53]
.The advantage of spreading appears clearly in this example: by spreading the most frequent symbol
A
accross the state, every time aA
appears in the output only one bit will be needed to describe the following symbol.
The Zstandard RFC specifies how the symbols will be spread:
- Start with filling state 0
- If state \(s\) was the last state filled and \(R\) is the table size, compute \[s' = (s + \frac R2 + \frac R8 + 3) \land (R-1)\].
- If state \(s'\) is already filled by a symbol with a "less than 1" probability, apply this operation again, otherwise you have your new state.
Given that \(R\) is a power of 2, \(x \land (R-1)\) is equivalent to x % R
in Rust. Also, the accuracy log is defined as being at least 5, which means that \(R \geq 32\). This means that by applying repeatidly the operation to get from one state \(s\) to the next one \(s'\) is a generating function which will reach all states once before starting from 0 again.
๐ก Using functions such as
std::iter::successors()
and combinators such as.filter()
, you can easily build an iterator whose.next()
result is the next state to use.
โ Determine the list of states that will be used for the current symbol (there will be as many states as the distribution value). Sort the list in natural order.
If the table size is \(R\) and the distribution value for the current symbol is \(d\), starting with a baseline of \(\frac {kR}d\) for \(k \in [0, d[\) and reading \(\log_2{\frac Rd}\) bits to compute an offset would allow you to reach any other state. However, that works only when \(d\) is itself a power of 2.
If \(d\) is not a power of 2, some of the states will need to read one more bit in order to ensure the coverage of the whole state table starting from the current symbol.
โ Given the distribution value \(d\):
- Compute \(p\) as being the smallest power of two greater than or equal to \(d\). Note that \(p\) may be equal to \(d\).
- Compute the number of bits \(b = log_2\frac Rp\) you will need to read to be able to globally reach all states if had \(p\) states available for your symbol instead of \(d\).
- Compute \(e = p - d\), the number of states to which you will need to add one extra bit provided that, in reality, you only have \(d\) states available for your symbol. \(e\) may be 0 if \(p=d\).
Once you have done that, you will be able to:
- reach \(2^{b+1}\) states from each of \(e\) of your states
- reach \(2^b\) states from each of \(d-e\) of your states
which allows you to reach a total of \(e \times 2^{b+1} + (d-e) \times 2^b = R\) states. The whole table is covered provided the baselines are chosen correctly.
The Zstandard mandates that:
- the first \(e\) states for the current symbol will use a Number of bits to read of \(b+1\);
- the remaining \(d-e\) states for the current symbol will use a Number of bits to read of \(b\);
- the base line starts at 0 for the first state with \b\ bits;
- after setting the base line of a state, it is incremented by 2 to the power of the number of bits corresponding to this state, then the next state is processed (wrapping around the list if needed) until we have completed all states for the current symbol.
A very detailed example is given in the RFC itself. You see in the table describing the 5 states allocated for a given symbol:
- The first 3 of those states require 5 bits (4 + 1 extra) and the other 2 require 4 bits.
- The baseline:
- starts at 0 in the first state requiring 4 bits, then is incremented by \(2^4=16\)
- is set to 16 in the next state, then incremented by \(2^4=16\), and \(16+16=32\)
- is set to 32 in the next state (wrapped around), then incremented by \(2^5=32\), and \(32+32=64\)
- is set to 64 in the next state, then incremented by \(2^5=32\), and \(64+32=96\)
- is set to 96 in the next state, then incremented by \(2^5=32\), and \(96+32=128\)
- All states have their baseline filled, and the baseline has reached 128 which is the table size anyway.
โ Follow a similar process and fill the states assigned to your symbol.
โ Do this for all symbols with a distribution value greater than 0.
You can return the table, the process is complete.
Building a FSE table from a bitstream
โ Add a
FseTable::parse()
constructor which takes a mutable reference to aForwardBitParser
and returns either the newly built table or an error.
โ Add tests.
You may, for example, check that the table built from [0x30, 0x6f, 0x9b, 0x03]
decodes as the table shown in Nigel Tao's example.
The FSE decoder
โ In the
fse
module, add aFseDecoder
struct which encapsulates everything a FSE decoder needs.
Your decoder will need to hold the following information:
- the FSE table used to decode
- the current baseline
- the next symbol size
- the symbol to emit when requested, as an
Option<u16>
(you may encounter more than 256 symbols while decoding sequences, but never more than 65535)
โ In the
decoders
module, add theBitDecoder
trait with the following signature:
/// A (possibly) stateful bit-level decoder
pub trait BitDecoder<T = u8, E> {
/// Initialize the state.
///
/// # Panics
///
/// This method may panic if the decoder is already initialized.
fn initialize(&mut self, bitstream: &mut BackwardBitParser) -> Result<(), E>;
/// Return the next expected input size in bits
///
/// # Panics
///
/// This method may panic if no bits are expected right now
fn expected_bits(&self) -> usize;
/// Retrieve a decoded symbol
///
/// # Panics
///
/// This method may panic if the state has not been updated
/// since the last state retrieval.
fn symbol(&mut self) -> T;
/// Update the state from a bitstream by reading the right
/// number of bits, silently completing with zeroes if needed.
/// Return `true` if zeroes have been added.
///
/// # Panics
///
/// This method may panic if the symbol has not been retrieved since
/// the last update.
fn update_bits(&mut self, bitstream: &mut BackwardBitParser) -> Result<bool, E>;
/// Reset the table at its state before `initialize` is called. It allows
/// reusing the same decoder.
fn reset(&mut self);
}
This trait will be implemented by several bit decoders, starting by FseDecoder
. You might note that some methods may panic: this is expected, as this would represent a logic error (for example initializing a decoder twice without resetting it), and not something that can be due to the input of our program. While it is not advised to panic if the user feeds our program bad data, we are allowed to do so when our program has an internal bug.
The expected use of a type implementing this trait is:
- Call
initialize()
. - Extract the decoded symbol from the decoder using
symbol()
and add it to the output stream. - If you have previously executed step 4 and it returned
Ok(true)
, meaning that the input stream had to be completed with zeroes, stop there, you have reached the end of the decoding process. - Update the decoder state with
update_bits()
(this will attempt to consumeexpected_bits()
). - Go back to 2.
โ Implement the
BitDecoder
trait forFseDecoder
.
โ Write some tests, using some hand-crafted pathological cases, and Nigel Tao's examples.
๐น Where do we stand?
Why did we implement all this? In order to decode Zstandard compressed data, we will need to manipulate several entities that belong to the current block being decoded. Please refer to the Zstandard in a nutshell explanation if you do not remember all the terms used here.
We need to first decode the literals. Those literals, which serve as the initial source of data copied to the output, are often compressed using a Huffman table whose weights are encoded using FSE.
Once we have decoded the literals section, sequences will be executed, one after the other, indicating what to do, such as "extract the first M unused bytes from the literals into the decompressed file, then copy N bytes from the already decoded output starting P bytes before the current point". Those M, N and P values are encoded using FSE, and their decoding is followed by some post-processing to obtain the correct values.
We have all the building blocks needed to achieve the Zstandard decompression process.
The decoding context
Within a Zstandard frame, blocks may rely on information provided by preceding blocks. For example, a block may reuse the same Huffman table as the previous block to decode literals. In this case, instead of describing the table again using FSE and recreating the decoder, an indication will be made to reuse the same decoder.
You will create a decoding_context
module to hold this state that can be shared between blocks, through a DecodingContext
. For now, the only things you will store in this DecodingContext
are:
- The frame window size which defines how far a sequence will be able to look back to output again some previously outputted data.
- The data that has been output already as a
Vec<u8>
, since sequences will be able to refer to this data. - The
HuffmanDecoder
used in previous blocks if any.
โ Create a
DecodingContext
structure inside thedecoding_context
module with three fields so far, as shown below:
pub struct DecodingContext {
pub huffman: Option<HuffmanDecoder>,
pub decoded: Vec<u8>,
pub window_size: u64,
}
โ Write a
DecodingContext::new()
constructor which receives a window size and checks it. If it is too large, an error (to be defined in the current module as usual) will be returned. Otherwise, a newDecodingContext
with no Huffman table and an empty decoded output will be used.
64MiB is enough and exceeds what is required by most encoders.
โ Modify
Block::decode()
so that it takes aDecodingContext
as a mutable reference. Instead of returning the decoded data,Block::decode()
will add it to the current context'sdecoded
field, and returnOk(())
if everything went well.
โ Modify
Frame::decode()
so that it creates a newDecodingContext
with the current frame size, and pass this context toBlock::decode()
. It will then return the content of the context'sdecoded
field if everything went well.
Don't forget to modify the tests as needed. Your program should still work as usual and pass the tests, even though it does nothing more than before, except check the window size.
Literals
Compressed blocks are made of two parts:
- the literals section
- the sequences to act on those literals
The literals section
The literals section can be of three different types: a raw (as-is) representation, a RLE encoding (a byte repeated many times), or a compressed representation.
โ In a
literals
module, create aLiteralsSection
enumeration with three variants:RawLiteralsBlock
,RLELiteralsBlock
andCompressedLiteralsBlock
.
Each variant will be a struct-like one with some fields:
RawLiteralsBlock
A raw literals block encapsulate a reference to a slice of bytes. It means that the LiteralsSection
type itself will have to take a lifetime as a generic parameter.
RLELiteralsBlock
A run-length-encoded literals block contains a byte and a number of repetitions of this byte.
CompressedLiteralsBlock
A compressed literals block contains more information:
huffman
: an optional Huffman decoder. When the current block will be decoded, the decoder present in the context will be updated to this one and used. Otherwise, the decoder present in the context will be reused for the current block.regenerated_size
: the size of the regenerated data once decoded.jump_table
: the literals section is made of either one stream of compressed data, or four streams which are interleaved; in this case, the size of the 1st to the 3rd stream can be found in this field, while the one of the 4th stream can be deduced by subtracting thoses sizes from the total compressed size.data
: the compressed data as a reference to a slice of bytes.
Note that we do not create a separate TreelessLiteralsBlock
variant to represent tree-less literals block as described in the standard: we use a CompressedLiteralsBlock
with a huffman
field set to None
.
Parsing the literals section
โ Implement
LiteralsSection::parse()
with the following signature:
impl<'a> LiteralsSection<'a> {
pub fn parse(input: &mut ForwardByteParser<'a>) -> Result<Self> {
todo!
}
}
Parsing the section involves decoding several steps:
- The literals section header is one byte.
- From this byte, several information can be extracted:
- The block type
- The format in which the regenerated size is stored
- The regenerated size itself (most formats will require reading more bytes from the streaรน)
In case of a compressed literals block, some more info must be extracted:
- The compressed data size. Each size (regenerated and compressed) can be encoded either on 10, 14, or 18 bits.
- A Huffman table, unless the block is a tree-less one which reuses the Huffman decoder from the previous block. The Huffman coefficients are compressed using FSE.
- The jump table, as explained above.
โ ๏ธ To be able to parse a compressed literals block, you need to parse Huffman tables coefficients from FSE, as is explained below. Use a
todo!()
orunimplemented!()
while you implement Huffman table parsing.
Decoding the Huffman table
The description of a Huffman table in Zstandard compressed data is can take two forms. First, a byte \(N\) is read from the stream.
Direct encoded Huffman weights
If \(N \geq 128\), it means that \(N-127\) coefficients follow, each one of them using 4 bits. If the number of coefficients is odd, 4 bits are lost at the end. The rest of the stream is the Huffman-encoded data.
Compressed Huffman weights
If \(N \lt 128\), coefficients are described using an FSE encoding. \(N\) describes the number of bytes used to describe the FSE table and the Huffman table. They will be organized as follows in memory:
FSE coefficients Huffman table coefficients
---------------> <-------------------------
F bytes N-F bytes
<-------------------------------------------->
N bytes in total
The FSE table will be decoded and will consume \(F\) bytes in the process (the FSE table knows when it has reached the end of its coefficients). Then the Huffman table will be decoded from two alternating FSE decoders (see below) using the same FSE table, using the remaining \(N-F\) bytes in a backward bit parser.
The rest of the original stream, after those \(N\) bytes, is the Huffman-encoded data.
Alternating FSE decoders
To decode the Huffman table coefficients, two FSE decoders are used in succession. They are built from the same FSE table, which has just been decoded, but will hold separate states. Bopth decoders will consume bits from the same stream.
The first decoder will get initialized from the stream, then the second one. The first decoder will emit its initial state and consume some bits to get the next state, then the second decoder will do so as well. Back to the first decoder, which will emit a state and consume some bits, then the second one, and so on.
When a decoder tries to consume more bits than is available, it will complete them with zeroes to get to its last state, and indicate to the caller that the stream has been exhausted.
โ Implement a
AlternatingDecoder
structure in adecoders::alternating
module which stores twoFseDecoder
objects as well as a boolean indicating which decoder has been updated last.
โ Implement
AlternatingDecoder::new()
which takes aFseTable
. You might want to make theFseTable
clonable, or find another way to share it.
โ Implement trait
BitDecoder
forAlternatingDecoder
.
โ Write some tests for
AlternatingDecoder
.
Decode and build a Huffman table
โ Write a
HuffmanDecoder::parse()
method with the signature shown below. You might want to also build theparse_direct()
andparse_fse()
functions but feel free to organize your code as you wish.
impl HuffmanDecoder {
/// Build a Huffman table from the given stream. Only the bytes needed to
/// build the table are consumed from the stream.
pub fn parse(input: &mut ForwardByteParser) -> Result<Self> {
let header = input.u8()?;
let weights = if header < 128 {
Self::parse_fse(input, header)?
} else {
Self::parse_direct(input, header as usize - 127)?
};
Self::from_weights(weights)
}
/// Parse the Huffman table weights directly from the stream, 4
/// bits per weights. If there are an odd number of weights, the
/// last four bits are lost. `number_of_weights/2` bytes (rounded
/// up) will be consumed from the `input` stream.
fn parse_direct(input: &mut ForwardByteParser, number_of_weights: usize)
-> Result<Vec<u8>>
{
todo!()
}
/// Decode a FSE table and use an alternating FSE decoder to parse
/// the Huffman table weights. `compressed_size` bytes will be
/// consumed from the `input` stream.
fn parse_fse(input: &mut ForwardByteParser, compressed_size: u8)
-> Result<Vec<u8>>
{
todo!()
}
}
Back to parsing the literals section
You should now be able to parse the literals section and remove the todo!()
markers you have set earlier in your code.
โ Complete the
LiteralsSection::parse()
code and test it.
Decoding the literals section
โ Implement
LiteralsSection::decode()
.
impl<'a> LiteralsSection<'a> {
/// Decompress the literals section. Update the Huffman decoder in
/// `context` if appropriate (compressed literals block with a
/// Huffman table inside).
pub fn decode(self, context: &mut DecodingContext) -> Result<Vec<u8>> {
todo!()
}
}
Remember that you might have four streams to decode with the same Huffman decoder if the jump_table
is not empty. Don't both doing it in parallel right now, you can decode them successively.
โ Add a
CompressedBlock
variant toBlock
, containing for now just aliterals_section
field.
โ When the block is a compressed block, add a call to
LiteralsSection::parse()
toBlocks::parse()
. Discard the rest of the block for now, as we haven't implemented sequences parsing yet, this will be the next step.
โ When the block is a compressed block, add a call to
LiteralsSection::decode()
toBlocks::decode()
, and display the decoded literals using compressed text files (most of them will be using compressed literal blocks). Even though you will not be able to rebuild the decompressed data, you should be able to make sense of the decoded literals.
Sequences
Sequences are the last pieces needed to decode Zstandard compressed frames. They will contain instructions on to how to combine data coming from the literals section of the current block.
Each sequence is made of three parts:
- The length in byte of the data to copy from the literals section to the output.
- How far behind in the already output data you can find data to emit, and
- how many bytes from this already output data to emit again.
Decoding example
Let's assume that your literals section decodes as "abcdefgh", and that your sequences section contains two sequences which, when fully decoded, contain: (3, 2, 3) and (2, 8, 1).
The first sequence (once decoded) is (3, 2, 3):
-
You copy 3 bytes of data from the literals section to the output: "abc"
-
The data to copy starts two bytes behind (one byte would be at "c", two bytes is at "b")
-
You repeatedly copy 3 bytes starting from this position:
Current output: abc Bytes repeated: 0 ^ Current output: abcb Bytes repeated: 1 ^ Current output: abcbc Bytes repeated: 2 ^ Current output: abcbcb Bytes repeated: 3 ^
Now, if the second sequence is (2, 8, 1):
- You copy 2 bytes of data from the literals section to the output: "de"
- You copy 1 byte of data, starting 8 bytes behind the current end of output:
Current output: abcbcbde Bytes repeated: 0 ^ Current output: abcbcbdea Bytes repeated: 1 ^
If there are no more sequences, you end up by copying the unused literal data ("fgh") to the output, giving the final output "abcbcbdeafgh".
Sequence decoding
Sequence decoding is described in section 3.1.1.3.2 of the RFC. You will first find a header describing the number of sequences with a variable size encoding. Then you will encounter a byte describing the three decoders:
- 2 bits will describe the way literal lengths (the number of bytes to copy from the literals section) are encoded
- 2 bits will describe the way offsets (how much to look back for data to repeat) are encoded
- 2 bits will describe the way the match lengths (how much to repeat) are encoded
There are four ways of encoding each tables:
- Predefined mode: a predefined FSE table, described in the RFC, must be used. No extra data is to be read from the stream for this table. Each table kind (literal lengths, offsets, and match lengths) get its own specific predefined table.
- RLE (run length encoding) mode: the next byte in the stream is the value that must be repeatedly produced by this decoder.
- FSE compressed mode: the FSE table to use must be read from the stream, using a forward bit parser.
- Repeat mode: the last decoder used for this table kind must be used again. No extra data is needed from the stream.
Once the three decoders have been set, the data stream containing the sequences themselves must be decoded using a backwards bit parser on the rest of the block.
Decoding process
All three decoders are initialized by reading the number of bits they need for their respective initialization (the accuracy log for a FSE decoder, nothing for a RLE decoder) in this order:
- the literals lengths decoder
- the offsets decoder
- the match lengths decoder
Each decoder emit a symbol in the order described below. Those symbols are not the value of the literals length, offset and match lengths: they use lookup tables, described in the RFC, that may require reading some more bits from the stream to determine the value. The order in which symbols are decoded is:
- the offset
- the match length
- the literals length
Then the decoders are updated from the bit stream in this order:
- the literals lengths decoder
- the match lengths decoder
- the offsets decoder
The final offset decoding process
The offset value must undergo one final transformation before it can be used, as described in the RFC: three repeat offsets (Offset1
, Offset2
, and Offset3
) variables must be kept in the decoding context and used throughout the whole frame decoding. They are initialized to 1, 4, and 8 at the beginning of the frame.
The following process updates the three repeat offset values depending on the decoded Offset
value and the literals length LL
value. The real value to use for the offset is the one of Offset1
after this transformation:
Decoded Offset value | LL | Offset1 | Offset2 | Offset3 |
---|---|---|---|---|
3 | 0 | Offset1 - 1 | Offset1 | Offset2 |
3 | != 0 | Offset3 | Offset1 | Offset2 |
2 | 0 | Offset3 | Offset1 | Offset2 |
2 | != 0 | Offset2 | Offset1 | Offset3 |
1 | 0 | Offset2 | Offset1 | Offset3 |
1 | != 0 | Offset1 | Offset2 | Offset3 |
> 3 | any | Offset - 3 | Offset1 | Offset2 |
Suggestions on code organization
For sequence execution
โ Execute decoded sequences with a decoded literals section from the decoding context.
In addition to the three repeat offset fields and the three literals lengths, offsets, and match lengths decoder, the following methods can be added to DecodingContext
:
impl DecodingContext {
/// Decode an offset and properly maintain the three repeat offsets
pub fn decode_offset(&mut self, offset: usize, literals_length: usize)
-> Result<usize>
{
todo!()
}
/// Execute the sequences while updating the offsets
pub fn execute_sequences(
&mut self,
sequences: Vec<(usize, usize, usize)>,
literals: &[u8],
) -> Result<()> {
todo!() // Using the `Self::decode_offset()` method
}
}}
For low-level sequence decoding
โ Create a
decoders::sequence
module and adecoders::sequence::SequenceDecoder
type. This type will contain references to the three bit decoders to use when decoding sequences.
A suggested implementation could be (in addition to the classical module-specific Error
type):
pub struct SequenceDecoder<'d> {
literals_lengths_decoder: &'d mut dyn BitDecoder<u16>,
offsets_decoder: &'d mut dyn BitDecoder<u16>,
match_lengths_decoder: &'d mut dyn BitDecoder<u16>,
}
impl BitDecoder<(u16, u16, u16), Error> for SequenceDecoder<'_> {
โฆ
}
A SequenceDecoder
would produce triples of literals length, offset, and match lengths. The offset at this stage would not have been transformed using the repeat offsets described above, this would be the job of the DecodingContext::execute_sequences()
method.
โ Also, since sequences can use a RLE decoder instead of a FSE one, implement a
RLEDecoder
type in adecoders::rle
module. This decoder will implementBitDecoder
, and never require any bit from the bit stream, and always produce the same value.
For high-level sequence decoding
โ Create a
sequences
module to parse and decode sequences from the following partial skeleton, and implement the methods (and other methods and types as needed):
pub struct Sequences<'a> {
pub number_of_sequences: usize,
pub literal_lengths_mode: SymbolCompressionMode,
pub offsets_mode: SymbolCompressionMode,
pub match_lengths_mode: SymbolCompressionMode,
pub bitstream: &'a [u8],
}
impl<'a> Sequences<'a> {
/// Parse the sequences data from the stream
pub fn parse(mut input: ForwardByteParser<'a>) -> Result<Self> {
todo!()
}
/// Return vector of (literals length, offset value, match length) and update the
/// decoding context with the tables if appropriate.
pub fn decode(self, context: &mut DecodingContext)
-> Result<Vec<(usize, usize, usize)>>
{
todo!()
}
}
pub enum SymbolCompressionMode {
PredefinedMode,
RLEMode(u8),
FseCompressedMode(FseTable),
RepeatMode,
}
At the block level
โ Add a
sequences
field to theCompressedBlock
variant ofBlock
. Parse and decode the sequence when parsing and decoding a compressed block.
โ Call the
execute_sequences()
method of theDecodingContext
when decoding aBlock
to apply the decoded sequences to the decoded literals.
๐ Congratulations, you are now able to decode any Zstandard compressed file!
Checksum
Even though this is not strictly necessary in order to decode compressed Zstandard files, it is advised to implement checksum verification. Remember, you have already [read the checksum value][whole-frame.md#building-the-type] if it is present.
You can compute the checksum as you decode data by adding an appropriate hasher to the decoding context and update it every time you decode a byte. The XxHash64
type of the twox_hash
crate implements the right algorithm. You can truncate the final result to 32 bit, and compare it to the value found in the compressed Zstandard file: if they match, the data is likely not corrupt.
โ Implement the checksum verification.
๐น Where do we stand?
You have implemented a full Zstandard decoder, even capable of verifying the integrity of uncompressed data.
Of course, you might want to optimize your code, for example:
- Each frame is independently encoded and can be decoded in parallel.
- Literal sections and sequences decoding can progress simultaneously, while producing the decoded data. This will reduce the number of bytes to be processed.
- Outputting data as decoding progresses would allow starting emitting data sooner.
But more importantly, your decoder must be made robust. This is what the next section is for.
Fuzzing the decompressor
While panicking is acceptable in case of a programming error in your code, it should never be due to an incorrect user input, which is what fuzzing will test.
โ Use
cargo fuzz
to feed arbitrary data to your decoder.
Your decoder should never crash. Of course, it might return an error, but the error should be meaningful and not result in a memory corruption.
๐ซ Congratulations, you now have a robust Zstandard decoder!