Project summary for academic year 2023-2024

In this project, you will be tasked by writing a compression/decompression program for the Zstandard format.

The project will be written in Rust. It is expected that you will be using a safe programming style, and that you will be able to use what has been shown in class in order to strengthen your program.

What you will learn

During the course of this project, you will learn:

  • How to parse binary data according to a specification.
  • How to write a decoder based on Huffman tables.
  • How to write a decoder based on arithmetic coding (FSE).
  • How to design and use safe data structures.
  • How to write unit and functional tests.
  • How to increase confidence in your code by using fuzzing techniques.

The Zstandard file format

What does data compression mean?

Data compression is a technique which takes advantage of the redundancy present in the data to represent it in less bytes than the original data. Compressed data takes less space in storage and takes less time to transmit on a communication channel.

Some compression techniques voluntarily lose less significant data to obtain better compression ratio, such as JPEG which compress images but may discard some colors nuances or other details. Some others are said to be lossless, and can reconstruct the original uncompressed data from the compressed one.

You probably already know some of those lossless common compression formats: zip, gzip, rar, gif, png, 7z or zstd. The one we are interested in is Zstandard, commonly abbreviated as zstd.

Zstandard is only a compression format. It does not take care of organizing files into archives as some utilities such as zip or rar do. It most ressembles gzip or 7z in that it only cares about one input and one output.

Zstandard in a nutshell: frames, blocks, literals and sequences

The Zstandard format (zstd) is described in RFC 8878. As for any standardized compressed format, only what goes in the compressed file and how it must be decompressed belong to the standard. A compressor is free to do whatever it needs as long as its output conforms to the standardized format.

A compressed zstd file is made of one or more frames, written after one another. Each frame can be decoded independently of the others. For example, you could take two compressed zstd files and concatenate them: it would make a valid zstd file, whose output would be the concatenation of the two original files.

Let's check this:

# Create two files f1.txt and f2.txt
$ echo "This is the content of the first file" > f1.txt
$ echo "This is the content of the second file" > f2.txt

# Compress each file with zstd
$ zstd f1.txt
f1.txt               :134.21%   (    38 B =>     51 B, f1.txt.zst)
$ zstd f2.txt
f2.txt               :133.33%   (    39 B =>     52 B, f2.txt.zst)
$ ls -l
-rw-r--r-- 1 sam users 38 Aug 11 12:36 f1.txt
-rw-r--r-- 1 sam users 51 Aug 11 12:36 f1.txt.zst
-rw-r--r-- 1 sam users 39 Aug 11 12:36 f2.txt
-rw-r--r-- 1 sam users 52 Aug 11 12:36 f2.txt.zst

# Concatenate f1.txt.zst and f2.txt.zst into f3.txt.fzst
$ cat f1.txt.zst f2.txt.zst > f3.txt.zst
$ ls -l f3.txt.zst
-rw-r--r-- 1 sam users 103 Aug 11 12:37 f3.txt.zst

# Uncompress f3.txt.zst and display the result
$ unzstd f3.txt.zst    # or, equivalently: zstd -d f3.txt.zst
f3.txt.zst          : 77 bytes
$ ls -l f3.txt
-rw-r--r-- 1 sam users 77 Aug 11 12:37 f3.txt
$ cat f3.txt
This is the content of the first file
This is the content of the second file

You may have noted that the compression has not been very efficient: f1.txt was expanded (instead of shrinking) from 38 bytes to 51 bytes, and f2.txt from 39 bytes to 52 bytes. Also, f3.txt.zst took 103 bytes on disk, while the cleartext (decompressed) version is only 77 bytes long. This can be easily explained:

  • The contents of f1.txt and f2.txt did not contain obvious repetitions that could have been compressed. As a consequence, the compressed files include the uncompressed content and add some overhead pertaining to the zstd format.
  • The content of f3.txt.zst has been made by concatenating two inefficiently compressed files.

But what if we compress f3.txt on its own? After all, it contains some repetitions ("This is the content of the " and " file\n"):

# Compress f3.txt. Add -f to overwrite the existing f3.txt.zst without confirmation
$ zstd -f f3.txt
f3.txt               : 88.31%   (    77 B =>     68 B, f3.txt.zst)
$ ls -l f3.txt.zst
-rw-r--r-- 1 sam users 68 Aug 11 12:37 f3.txt

Interesting! This time, the file has really been compressed from 77 bytes down to 68 bytes.

If we look inside the compressed file f3.txt.zst, we can get some idea of what zstd did to compress the file:

$ strings f3.txt.zst
This is the content offirst file
second file

We can get an idea of what zstd will be doing to decompress this file:

  • Extract "This is the content of" from the compressed file
  • Copy " the " from the already decoded output
  • Extract "first file\n" from the compressed file
  • Copy "This is the content of the " from the already decoded output
  • Extract "second file\n" from the compressed file

How does the zstd program know what to extract and what to copy? A zstd frame is made of blocks. Each block contains literals, that is text that will be extracted into the decompressed file. Those literals are followed by a list of sequences, which are basic instructions of the form "extract the first M unused bytes from the literals into the decompressed file, then copy N bytes from the already decoded output starting P bytes before the current point.". If after executing all the sequences of the block some literals have not been extracted into the decompressed file, they are extracted as a final step.

In our case, here is the state of the literals and of the output after every sequence. The following notations will be used:

  • The next byte to use in the literals section is shown with a caret (^).
  • Spaces are represented by an underscore (_) to make them more visible in the output.
  • Minus signs (-) show the part from the already decoded output which is copied again, while plus signs (+) show its destination.
  • "\n" (line feed character) is shown with a dot (.).
                 LITERALS                             OUTPUT
     | This_is_the_content_offirst_ |                                         |
     | ^                            |                                         |
     | file.second_file.            |                                         |
     |                              |                                         |

Executing the sequence (22, 5, 15) (extract 22 bytes from literal, then copy 5 bytes from 15 positions behind):

                 LITERALS                             OUTPUT
     | This_is_the_content_offirst_ | This_is_the_content_of_the_             |
     |                       ^      |        -----          +++++             |
     | file.second_file.            |                                         |
     |                              |                                         |

Executing the sequence (11, 27, 27):

                 LITERALS                             OUTPUT
     | This_is_the_content_offirst_ | This_is_the_content_of_the_first_file.T |
     |                              | ---------------------------           + |
     | file.second_file.            | his_is_the_content_of_the_              |
     |      ^                       | ++++++++++++++++++++++++++              |

We have reached the end of the sequence, the rest of the literals are extracted into the output:

                 LITERALS                             OUTPUT
     | This_is_the_content_offirst_ | This_is_the_content_of_the_first_file.T |
     |                              |                                         |
     | file.second_file.            | his_is_the_content_of_the_second_file.  |
     |                  ^           |                                         |

๐Ÿ‘๐Ÿพ If you undestand how frames, blocks, literals and sequences work together, you already know 50% of what you need to understand for completing this project.

Of course, for the compression algorithm to be effective, some work will be needed to generate or parse a compressed zstd file:

  • Literals can be present as-is (as in this example), or encoded more tightly using Huffman coding.
  • The Huffman trees are described using weight tables, a concise representation.
  • The Huffman weight tables may be themselves encoded more tightly using finite state entropy (FSE) coding.
  • Sequences parameters may be encoded tightly using FSE followed by an additional expansion step.
  • FSE tables coefficients use a concise bit-level representation.
  • Data decoded through FSE or Huffman coding is read backwards, a bit after another.
  • An optional checksum may be used to verify the integrity of the decoded data.

Repository setup

We will create a Git repository for you so that you can develop your application in a private space. Access will be available only to you (and your teammate if any) and to the lecturers.

Repository creation

Please ensure that you already have an account on https://gitlab.telecom-paris.fr/, or create one by signing in with the "SSO & Fรฉdรฉration d'identitรฉ Edugain" button, using your academic credentials.

โœ… Send a mail to the class mailing-list to request the creation of a new repository. Make sure you include the GitLab usernames of every team member in your mail.

Note: you can find your GitLab username by clicking on the top-right area in GitLab โ€” it starts with a "@".

Repository layout

Your project is expected to be located at the root of your repository.

โœ… Once you have cloned the repository we have created for you, you can initialize the project structure by using the following commands at the repository root:

$ ls -a
. .. .git README.md
$ cargo init . --name net7212
     Created binary (application) package

You can then check that everything works as expected:

$ cargo run
   Compiling net7212 v0.1.0 (/home/johndoe/courses/net7212/project/)
    Finished dev [unoptimized + debuginfo] target(s) in 0.32s
     Running `target/debug/net7212`
Hello, world!

โœ… It is time to add the Rust source code to git:

$ ls -a
. .. .git .gitignore Cargo.toml Cargo.lock README.md src target
$ git add .
$ git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    new file:   .gitignore
    new file:   Cargo.lock
    new file:   Cargo.toml
    new file:   src/main.rs
$ git commit -m "Initialize cargo project net7212"
$ git push

If several persons work on the same project, the initialization above has to be done only once. Other participants may now clone the project or update their copy.

Note: the location (root of the git repository) and the name (net7212) of the cargo project are mandatory so that our automated project tools can work properly. Also, while it is fine to use branches to develop specific features, the work to be graded is expected to be found in the main branch.

Getting help

You might need to get help during the project. Do not worry, this is expected.

๐Ÿ˜• If anything here is unclear, please send a mail to the list (see below) to ask for clarifications. We might indeed want to update the site to make it clearer.

Useful resources

You might want to use the following resources first:

  • The Zstandard RFC is the authoritative reference for Zstandard. Every question you might have about the file format should find an answer there.
  • Nigel Tao has authored a seven part blog series explaining in details various parts of the standard.

Mailing-list

๐Ÿ“ง A mailing-list has been created for all course and project related exchanges. Please use it. If you do not remember its address, please look in your received mail folder and look for the name of the class (NET7212).1

You can ask a question about anything related to the project or course on the mailing-list. When sending a mail, make sure that:

  1. You use a sensible Subject: field (e.g., "How can I convert a String to lowercase?" rather than "Help!!!").
  2. You include enough information so that people can actually help you ("it does not work" will not help others helping you) ๐Ÿ”ฎ.
  3. If you want us to look at your code, ensure it is committed and pushed to GitLab and state clearly which branch or commit we should look at.

Also, do not hesitate to send a mail on the mailing-list if you think you have the answer to a question asked by others, or if you think our answer is not the best one.

โœ‹๐Ÿป When sending or answering a mail, always keep the mailing-list as a recipient: everyone in the class deserves to see the question and the answer.

โ›” Do not send an email about the project or the course content directly to the lecturers in private emailโ€”we insist in sharing the knowledge between all participants.

During the lab sessions

The lab sessions are supposed to be usedโ€ฆ for working on the lab topic, not the project. So unless you are done with the lab assignment, please refrain from requesting help about the project. However, if you want to discuss something related to the project with the lecturer, feel free to take some time off the lab session to talk about a particular point.

1

We don't want to give the address on this publicly accessible web site to prevent being spammed.

Implementing a decompressor

Your goal, in this part, is to implement a zstd file decompressor which ultimately accepts any zstd compressed file.

The decompressor will be built incrementally, and tests will be added at each step. More and more zstd compressed files will be decodable as you progress.

In this first part, you will build a decompressor which is able to parse and decode a subset of Zstandard files, thus achieving a first level of compression capabilities. In the second part, you will later implement the full standard.

General principles

You are free to organize your project as your wish, but some general principles will be respected (unless you convince us by discussing it on the mailing-list).

๐Ÿ’ก Do not hesitate to come back to this page frequently to ensure you respect those design principles.

โš ๏ธ You are not supposed to implement everything described in the current page at the beginning of the project. Various parts will be described in more details in subsequent sections.

Library vs. program

While we ultimately want to produce an executable program that will take arguments from the command line, we also want all capabilities we offer to be available as a library for at least two reasons:

  1. We want other programs to be able to include our decoder as a dependency.
  2. We want to be able to test various functions using unit tests.

Your project will contain, in the same crate:1

  • a library
  • an executable

The executable should be reduced to handling the command line arguments and calling the right library functions. Its job should be kept to a minimum.

Error handling

Every module whose functions return a Result should have its own Error type, and possibly its own Result shortcut. Using the thiserror crate in the library allows easy embedding and conversion of one error type to another:

// File "mymod.rs"

#[derive(Debug, thiserror::Error)]
pub enum Error {
    // Encapsulate an I/O error without adding any more context
    // and add a `impl From<std::io::Error> for Error` implementation.
    #[error(transparent)]
    Io(#[from] std::io::Error)
    
    // Custom error originating from this module
    #[error("bad magic found {found:#06x} instead of expected {expected:#06x}")]
    BadMagic { expected: u32, found: u32 }
    
    // Encapsulate an error from another module while adding context
    #[error("corrupted Huffman table weights")]
    CorruptedWeights(#[source] fse::Error)
}

// Define a local `Result` type alias which can be used with one argument,
// the default for the second type being defined as the local `Error` type.
pub type Result<T, E = Error> = std::result::Result<T, E>;

// Example function returning a `Result` without repeating the `Error` type.
// The return type can be written as:
// - `Result<Vec<u8>>` in this module, or `Result<Vec<u8>, Error>`
// - `mymod::Result<Vec<u8>>` or `Result<Vec<u8>, mymod::Error>` from outside
fn read_file(filename: &str) -> Result<Vec<u8>> {
    Ok(std::file::read(filename)?)
}

The executable can implement generic error handling by using, for example, anyhow, or the combination of eyre and color-eyre for a more pleasant and colorful error output2.

Consuming the input

We will need to use several parser to handle the compressed content:

  • A main forward byte-oriented one, which delivers the bytes in the input in order.
  • A forward bit-oriented sub-parser to decode the coefficients of a FSE table.
  • A backward bit-oriented sub-parser to decode FSE-encoded or Huffman-encoded data.

The main parser should have the following properties:

  • It can preload data on-demand, for example if all input bytes have been consumed, or if the decoder needs to access bytes located later in the input.
  • It can lend a reference to the data to a sub-parser without copying it.
  • It may discard data that has been either consumed directly or through a sub-parser.

๐Ÿ’ก At the beginning of the project, having the data to decode in memory as a slice of bytes might be sufficient to satisfy the requirements of the main parser.

Producing the output

Producing the output is simpler than parsing the input, as all transactions are byte-oriented and going forward. Our output producer, dedicated to a frame, should have the following properties:

  • It should emit data as soon as possible but not sooner: decoding a frame should only output its uncompressed data when previous frames have finished outputing theirs.
  • It should discard data as soon as possible, but not sooner: later blocks in the same frame may need to copy already-produced data.

Decoding a frame

Within a frame, later blocks might refer to decoding tables or data structures used in previous ones. For example, if two successive blocks use the same Huffman weights to encode the literals in both blocks, the second block will only contain a "use the previous Huffman decoder" marker.

You will need to maintain a shared state between successive blocks of the same frame. This state must contain:3

  • The three repeat offsets.
  • The latest decoders used to decode the constituents of a sequence (literals lengths, offsets, and match lenghts).
  • The latest Huffman decoder used to decode literals.
  • Some high-level information about the frame such as the window size.
  • The current checksum of the data being decoded in this frame.
1

Do not worry if you do not yet understand the Rust-specific parts of the project specification at first (e.g., crates, Result, thiserror, etc.), you will learn about all of them later during the course.

2

If your program depends on the color-eyre crate, you may omit the eyre dependency if you wish, as color-eyre depends on eyre and reexports it. You can use color_eyre::eyre; in your code then use eyre::Result.

3

Do not worry if some concepts seem obscure at the beginning of the project, you will discover and understand them later while reading the zstd standard.

Parsing input data

It is now time to try and extract frames. In order to later be able to handle frames in parallel as much as possible, it is necessary to parse the input in two steps:

  1. Do the minimum work needed to extract a frame from the input (parsing phase).
  2. Later, as frames have been separated, do the rest of the work to decode the frame compressed content (decoding phase).

As zstd compressed input does not contain a table of contents indicating where each frame starts, parsing a frame implies having already parsed all frames coming beforehand.

Forward byte parser

The first parser you need is a forward byte parser, which can deliver bytes from the input in order. The parser must also remember not to deliver bytes it has delivered already. To do this, a slice of bytes is enough as it references a piece of memory.

โœ… Create a parsing module (in file parsing.rs) and create a ForwardByteParser type in it.

โš ๏ธ Do not forget to reference this module in lib.rs as a module with public visibility.

Since a slice does not own its content, it must be accompanied by a lifetime parameter, for example:

pub struct ForwardByteParser<'a>(&'a [u8]);

Initializing a forward byte parser from an existing slice is straightforward:

impl<'a> ForwardByteParser<'a> {
    pub fn new(data: &'a [u8]) -> Self {
        Self(data)
    }
}

Consuming a byte from the input

Consuming a byte from the input implies returning the first byte, if it exists (the input may be empty) and storing a slice with the first byte removed:

impl<'a> ForwardByteParser<'a> {
    pub fn u8(&mut self) -> Option<u8> {
        let (first, rest) = self.0.split_first()?;
        self.0 = rest;
        Some(*first)
    }
}

While returning an Option<u8> seems handy, it will not be very useful: when you need a byte from the input, not obtaining it should be an error that can be propagated further.

โœ… Create an Error type and a Result alias in the parsing module as described in the general principles. Modify your u8() method so that it returns an error if the input is empty.

This error alternative (for example NotEnoughBytes { requested: usize, available: usize }) must be generic and will be reused in other methods. When displayed, the error should print: not enough bytes: 1 requested out of 0 available.

Writing a unit test

Does the parser work as expected? A unit test should be written along with the code.

โœ… Create a tests/parsing.rs file in your repository and include the following tests.

use net7212::parsing::{self, ForwardByteParser};

#[test]
fn forward_byte_parser_u8() {
    // Check that bytes are delivered in order
    let mut parser = ForwardByteParser::new(&[0x12, 0x23, 0x34]);
    assert_eq!(0x12, parser.u8().unwrap());
    assert_eq!(0x23, parser.u8().unwrap());
    assert_eq!(0x34, parser.u8().unwrap());
    assert!(matches!(
        parser.u8(),
        Err(parsing::Error::NotEnoughBytes {
            requested: 1,
            available: 0,
        })
    ));
}

Running the tests with cargo test should yield a success.

๐Ÿ’ก Notice that we use matches!() instead of assert_eq!() to compare the error, as the parsing::Error type does not implement PartialEq (and does not need to).

Adding more methods

Parsing a frame will require reading a 4-byte unsigned integer (u32) in little-endian format, or extracting an arbitrary number of bytes. You can add utility functions now to your parser, such as (more will be needed later):

impl<'a> ForwardByteParser<'a> {
    /// Return the number of bytes still unparsed
    pub fn len(&self) -> usize { todo!() }

    /// Check if the input is exhausted
    pub fn is_empty(&self) -> bool { todo!() }

    /// Extract `len` bytes as a slice
    pub fn slice(&mut self, len: usize) -> Result<&'a [u8]> { todo!() }

    /// Consume and return a u32 in little-endian format
    pub fn le_u32(&mut self) -> Result<u32> { todo!() }
}

Tests must also be added for those methods to ensure they act as expected. Those tests will also act as non-regression tests, allowing us to later modify the body of those methods without fear.

Parsing some frames

Now that you know how to read bytes, integers, and slices from the input, you can think about parsing frames.

There exist two kinds of frames in the Zstandard RFC:

  • Zstandard frames: they contain compressed data that must be decompressed.
  • Skippable frames: they contain extra sections not related to compressed data and serve no purpose while decompressing. However they must be parsed in order to be skipped and removed from the input.

Adding a frame type

โœ… Create a frame module, and a Frame enumerated type in it with two variants, ZstandardFrame and SkippableFrame.

The content of those variants will be filled later on.

Parsing a frame

Each frame starts with a magic number, which can be, in little-endian format:

  • 0xFD2FB528 for a Zstandard frame
  • 0x184D2A5? for a Skippable frame, where ? stands for any hexadecimal digit from 0 to F.

Any other value is an error.

Skippable frames

A skippable frame is easy to extract from the input: right after the magic number, a 32 bit little-endian value gives the length of the frame data, then the data follows.

โœ… Write a new struct type SkippableFrame to represent a skippable frame with two fields: magic which is the found magic number, and data which represents the unparsed data corresponding to the skippable frame as a byte slice. Make the Frame::SkippableFrame variant take a SkippableFrame parameter.

As a SkippableFrame will reference data using a byte slice, it will necessarily involve a lifetime generic parameter. Since the Frame enumeration has a SkippableFrame in its Frame::SkippableFrame variant, it will itself need this lifetime generic parameter.

โœ… Write a Frame::parse() constructor that returns a frame: a valid Frame::SkippableFrame within Ok if a skippable frame is found, todo!() (for the time being) if a Zstandard frame is found, or an error Error::UnrecognizedMagic(magic) otherwise.

This requires creating a local Error type, and a Result type alias. Notice that the Error type will need a variant to encapsulate a parsing::Error, as the parsing method will use ForwardByteParser methods to extract data.

use crate::parsing::ForwardByteParser;

#[derive(Debug, thiserror::Error)]
pub enum Error {
    // โ€ฆ
}

type Result<T, E = Error> = std::result::Result<T, Error>;

impl<'a> Frame<'a> {
    pub fn parse(input: &mut ForwardByteParser<'a>) -> Result<Self> {
        todo!()
    }
}

โœ… Write tests to check that skippable frames are decoded correctly.

Here is an example for starting tests/frame.rs. You must complete this with tests ensuring that a proper error is returned if the input is truncated, similar to the one checking the error if the magic is not recognized.

use net7212::frame::{self, Frame};
use net7212::parsing::ForwardByteParser;

#[test]
fn parse_skippable_frame() {
    let mut parser = ForwardByteParser::new(&[
        // Skippable frame with magic 0x184d2a53, length 3, content 0x10 0x20 0x30
        // and an extra byte at the end.
        0x53, 0x2a, 0x4d, 0x18, 0x03, 0x00, 0x00, 0x00, 0x10, 0x20, 0x30, 0x40,
		//^--- magic (LE) ----^ ^------ 3 (LE) -------^ ^--- content ---^ ^-- extra
    ]);
    let Frame::SkippableFrame(skippable) = Frame::parse(&mut parser).unwrap() else {
        panic!("unexpected frame type")
    };
    assert_eq!(0x184d2a53, skippable.magic);
    assert_eq!(&[0x10, 0x20, 0x30], skippable.data);
    assert_eq!(1, parser.len());
}

#[test]
fn error_on_unknown_frame() {
    let mut parser = ForwardByteParser::new(&[0x10, 0x20, 0x30, 0x40]);
    assert!(matches!(
        Frame::parse(&mut parser),
        Err(frame::Error::UnrecognizedMagic(0x40302010))
    ));
}

The main program

You are now able to parse skippable frames. Let's write a program which displays, for each skippable frame, its magic number and content. Also, the program will check that the file is a well-formed zstd file.

Command-line arguments

We want our program to act like this:

$ cargo run -- --help
Usage: net7212 [OPTIONS] <FILE_NAME>

Arguments:
  <FILE_NAME>  File name to decompress

Options:
  -i, --info     Dump information about frames instead of outputing the result
  -h, --help     Print help
  -V, --version  Print version
  
$ cargo run -- --info skippables.zst
SkippableFrame(
    SkippableFrame {
        magic: 0x184d2a53,
        data: [
            0x10,
            0x20,
            0x30,
        ],
    },
)
SkippableFrame(
    SkippableFrame {
        magic: 0x184d2a51,
        data: [
            0x42,
        ],
    },
)

(you can download skippables.zst)

Create the main program

โœ… Create the src/main.rs file containing a main function.

โœ… Use clap to implement command-line arguments handling.

You can read a file using std::fs::read(). Do not print anything yet, we will do it step by step.

Decoding a frame

As previously explained, a decoding phase must happen after the parsing of a frame when --info is not used as an argument to the main program.

โœ… Implement a decode(self) method on a Frame returning the decoded data in a Vec<u8>.

Note that self means that the Frame object no longer exist after decode() has been called. This is expected, as some decoding operations will consume frame data.

So, at this time:

  • You can decode a SkippableFrame as it decodes to nothing (it is skipped after all).
  • You may use todo!() while decoding a ZstandardFrame as not even the parsing is implemented.

Iterating over the frames

Since all frames must be parsed in sequence, one can use an iterator to transform a forward byte parser into a succession of Frame objects.

โœ… Add a FrameIterator in the frame module, which encapsulates a ForwardByteParser.

โœ… Implement the Iterator trait for FrameIterator, the Item being Result<Frame>.

The iterator's next() method will:

  • Return None if the parser has no content left to indicate the end of the iterator.
  • Return Some(Frame::parse(&mut self.parser)) otherwise.

Note that this will automatically signal an error if extra characters are present at the end of an otherwise valid file, as the next iteration will attempt to parse another frame which will be invalid.

Writing the main program body

โœ… Complete the main program so that it can either print information or decode frames from a zstd files in sequence.

Of course you must stop if the frame iterator returns an error.

Do not forget that:

  • You can derive the Debug trait automatically on many types.
  • You can display something implementing the Debug trait with the {:?} construct.
  • You can use an alternate representation with the {:#?} construct (for example indented structures).
  • You can use an alternate representation and print numbers in hexadecimal with {:#x?}.

Also, when --info is not used, the decoded data (even though there is nothing to display yet) must be sent to the standard output. You can use the write_all() method to write on std::io::stdout() after locking it.

Note that if you use the color_eyre crate, you can make your main() function return a color_eyre::eyre::Result<()> and let color_eyre display a pretty error if there is one. You can also use the anyhow crate as an alternative.

Adding tracing information

Since your program will use the standard output when decoding a file, you must not use it yourself to print debugging information. You should use the log crate to add traces if you need them, or the tracing/tracing_subscriber couple.

โœ… Check that you can decode a file made only of skippable frames, as well as an empty file (which is a valid zstd file with zero frames). Check that you signal an error if you try to decode an invalid file.

๐Ÿน Where do we stand?

At this point, if you have done everything right:

  • You have written a main program which can decode a file containing only skippable frames and output the decoded version (i.e., nothing) on the standard output.
  • Your main program can also print information about the frames from a file it knows how to decode using --info.
  • All your functions and methods have unit tests so that you can fearlessly refactor your code and extend it later.

If not, you have probably missed some things in the previous steps, please go back and fix them.

Handling Zstandard frames

Zstandard frames consist of three parts arranged in the following sequence:

  1. A header that provides information about the frame parameters.
  2. One or more blocks, referred to as blocks.
  3. An optional checksum.

To begin decoding and testing the code, it is necessary to parse and decode a few basic blocks. Let's begin with this.

Handling simple blocks

Blocks are the basic components forming a frame. There exist three types of blocks:

  • A raw block contains some data, without any compression, that must be copied while decoding.
  • A RLE block is a block which contains a byte b and a length l. Decoding it duplicates the b byte l times.
  • A compressed block contains much more complex information and is the basis of Zstandard compression capabilities.

For a start, you will implement the decoding of the first two block types.

Structure

โœ… In a block module, add an enumerated Block type with two variants (for now): RawBlock which encapsulates a slice, and RLEBlock which contains a byte and a repeat field.

โœ… Implement Block::parse() which accepts a &mut ForwardByteParser and returns a result containing a (Block, bool) pair. The bool is true if the block is the last block of the frame.

Do not forget to add a dedicated Error type in the block module, as well as a Result type alias. If the block type is invalid (3 is reserved), a Error::ReservedBlockType should be returned. Also, you can call unimplemented!() if you encounter a block of type 2 (compressed block) as we do not implement it yet.

โœ… Implement the Block::decode(self) method which decodes a block (and destroys the Block object by taking self).

Right now, decode() does not need any other parameter. This will no longer be the case when we implement compressed blocks, as decoding a compressed block requires accessing a decoding context containing information about earlier operations. Also, decode() needs to return a Result: even if decoding cannot fail right now, it will later when more complex blocks are implemented.

โœ… Add tests.

Here are two simple tests, you must add others, including for checking that errors are correctly detected:

use net7212::block::Block;
use net7212::parsing::ForwardByteParser;

#[test]
fn decode_raw_block_last() {
    let mut parser = ForwardByteParser::new(&[
        // Raw block, last block, len 4, content 0x10, 0x20, 0x30, 0x40,
        // and an extra 0x50 at the end.
        0x21, 0x0, 0x0, 0x10, 0x20, 0x30, 0x40, 0x50
    ]);
    let (block, last) = Block::parse(&mut parser).unwrap();
    assert!(last);
    assert!(matches!(block, Block::RawBlock(&[0x10, 0x20, 0x30, 0x40])));
    assert_eq!(1, parser.len());
    let decoded = block.decode().unwrap();
    assert_eq!(vec![0x10, 0x20, 0x30, 0x40], decoded);
}

#[test]
fn decode_rle_block_not_last() {
    let mut parser = ForwardByteParser::new(&[
        // RLE block, not last, byte 0x42 and repeat 0x30004,
        // and an extra 0x50 at the end.
        0x22, 0x0, 0x18, 0x42, 0x50
    ]);
    let (block, last) = Block::parse(&mut parser).unwrap();
    assert!(!last);
    assert!(matches!(
        block,
        Block::RLEBlock {
            byte: 0x42,
            repeat: 196612
        }
    ));
    assert_eq!(1, parser.len());
    let decoded = block.decode().unwrap();
    assert_eq!(196612, decoded.len());
    assert!(decoded.into_iter().all(|b| b == 0x42));
}

Handling the frame header

The frame header is made of a frame header descriptor and some other fields, which may or may not be present. The frame header descriptor is a set of flags which indicates which optional fields are present or not.

โœ… Implement a frame::FrameHeader structure type containing the fields of a frame header.

โœ… Implement FrameHeader::parse() in order to parse the frame header, and write tests to check the parsing behaviour.

Handling the frame itself

You are reaching the end of the frame parsing! It is time to put the various bits together.

Building the type

โœ… Make the frame::ZstandardFrame structure contain three fields: the frame header, a vector of blocks, and an optional checksum.

Parsing

โœ… Implement ZstandardFrame::parse() and call it from Frame::parse() in the presence of a Zstandard frame.

This method will need to:

  • parse the header;
  • parse the blocks (there is at least one), and continue doing so as long as the parsed block is not the last one;
  • parse the checksum if it is present according to the frame descriptor header.

Decoding

โœ… Implement ZstandardFrame::decode() and call it from Frame::decode() in the presence of a Zstandard frame.

Decoding a frame means decoding each block in turn and concatenating the result.

Running the main program

Our main program should not need any change to print information or decode a zstd file containing only the supported block types.

โœ… Check that you can decode the welcome.zst file which is made of two frames, the second one containing four blocks.

๐Ÿ’ก You can examine the content of the compressed file by using hexdump or a similar program:

$ hexdump -C welcome.zst
00000000  57 2a 4d 18 30 00 00 00  3e 3e 3e 20 59 4f 55 20  |W*M.0...>>> YOU |
00000010  57 49 4c 4c 20 4e 4f 54  20 53 45 45 20 54 48 49  |WILL NOT SEE THI|
00000020  53 20 50 41 52 54 20 49  4e 20 54 48 45 20 4f 55  |S PART IN THE OU|
00000030  54 50 55 54 20 3c 3c 3c  28 b5 2f fd 24 7e 4a 01  |TPUT <<<(./.$~J.|
00000040  00 23 58 01 00 0a 23 20  57 65 6c 63 6f 6d 65 20  |.#X...# Welcome |
00000050  74 6f 20 54 65 6c 65 63  6f 6d 20 50 61 72 69 73  |to Telecom Paris|
00000060  20 7a 73 74 64 20 65 78  61 6d 70 6c 65 20 23 0a  | zstd example #.|
00000070  4a 01 00 23 09 00 00 0a  9e 2e 5d 9f              |J..#......].|
0000007c

๐Ÿน Where do we stand?

At this point, if you have done everything right:

  • You have written a main program which can decode a file containing skippable frames, or Zstandard frames with raw or RLE blocks.
  • Your program can output the decoded version on the standard output.
  • Your main program can also print information about the frames from a file it knows how to decode using --info.
  • All your functions and methods have unit tests so that you can fearlessly refactor your code and extend it later.

If not, you have probably missed some things in the previous steps, please go back and fix them.

The decompressor: part 2

Your decompressor is already able to handle a subset of the Zstandard file format. However, it does not yet take advantage of the most efficient compression techniques used by the standard.

Zstandard utilizes two efficient compressors: Huffman codes and FSE.

  • Huffman codes are used to encode byte-oriented sequences with \(2^8\) possible output symbols. These codes are employed in the literals section, which consists of portions of data to be copied to the output stream during decompression.
  • FSE codes are used to encode data with a smaller number of output symbols. They are used for instructions that guide the overall decoding process, known as sequences, as well as coefficients used to initialize the Huffman decoder.

๐Ÿ’ก Ensure you understand fully the Huffman and FSE examples, and that you obtain the same result while decoding the bit strings by hand.

Huffman code

The Huffman code is a prefix code. Bits are read one-by-one. Either the current sequence of bits designates a symbol to add to the output (for example a byte), or it is incomplete and more bits need to be read. Most frequent symbols get assigned shorter bit sequences, while less frequent symbols get assigned longer bit sequences. Once a symbol has been decoded, the decoder restarts from scratch and handles the next input bits.

Of course, no bit sequence for a symbol may correspond to the prefix of another bit sequence.

Practical Huffman code example

For example, if we only had 3 symbols A, B and C in our ouput, and B is more frequent than A and C, Huffman coding could use a dictionary such as:

Input bitsOutput symbol
00A
01C
1B

The input bit sequence 10010111 would get parsed as 1_00_1_01_1_1 and decoded as BABCBB. 8 input bits decode as 6 output symbols, taking approximately 1.33 bit by symbol.

Note that while B is represented as a single 1, all other bit sequences start with 0. No ambiguity can arise while adding bits as needed. The table can be represented as a prefix tree:

Huffman tree representation

In Zstandard, longer prefixes start with zeroes and shorter prefixes with ones. Also, for the same prefix width, symbols are ordered, which is why A uses 00 because it comes before C at 01.

FSE code

While the Huffman decoders starts from scratch once a symbol has been emitted, a FSE (finite-state entropy) decoder keeps an internal state. Upon entering a state, the decoder emits an output symbol, then reads a number of input bits to switch to another state, and so on. States are consecutively numbered starting from 0. Each state contains three pieces of information:

  • the output symbol to emit
  • the number of bits to read to determine the next state
  • the base line (a number) to add to the number read from the new bits to get the new state number

A FSE code has an accuracy log (AL) which represents the base 2 logarithm of the number of states. For example, a FSE decoder with 32 states will have an AL of 5, because \(2^5\) is 32. The AL is also the number of bits to read to determine the initial state.

Practical FSE example

Let us assume that we have a FSE table with 4 states. The accuracy log (AL) is 2.

StateOutput symbolBase line (BL)Number of bits to read
0A10
1D21
2B01
3A21

Let us decode the 1110000101 bit sequence (11_1_0_0_0_0_1_0_1 for clarity of this example). AL bits are read from the string, giving 11. The initial state is 3 (0b11):

  • Entering state 3 outputs symbol A. 1 bit (1) is added to the BL 2: new state is (0b1 + 2)
  • Entering state 3 outputs symbol A. 1 bit (0) is added to the BL 2: new state is 2 (0b0 + 2).
  • Entering state 2 outputs symbol B. 1 bit (0) is added to the BL 0: new state is 0 (0b0 + 0).
  • Entering state 0 outputs symbol A. 0 bits are read, the new state is the BL 1: new state is 1.
  • Entering state 1 outputs symbol D. 1 bit (0) is added to the BL 2: new state is 2 (0b0 + 2).
  • Entering state 2 outputs symbol B. 1 bit (0) is added to the BL 0: new state is 0 (0b0 + 0).
  • Entering state 0 outputs symbol A. 0 bits are read, the new state is the BL: new state is 1.
  • Entering state 1 outputs symbol D. 1 bit (1) is added to the BL 2: new state is 3 (0b1 + 2).
  • Entering state 3 outputs symbol A. 1 bit (0) is added to the BL 2: new state is 2 (0b0 + 2).
  • Entering state 2 outputs symbol B. 1 bit (1) is added to the BL 0: new state is 1 (0b1 + 0).
  • Entering state 1 outputs symbol D. No 1 bit is available, stopping

The decoded symbol sequence is AABADBADABD. One may note that:

  • One state may encode more than one output symbols, since the number of bits to read to get to the following state may be 0.
  • Contrary to a Huffman code, some symbols need less than one bit in the compressed representation. Indeed, here we have a 10 bits input string decoding into a 11 symbols output, that is approximately 0.9 bit per symbol.
  • Several states may encode the same symbol (states 0 and 3 both emit A).

Huffman decoder

You will now implement a Huffman decoder able to decode a bit stream into a set of bytes. Your implementation will be implemented as a tree as this is a direct mapping of the Huffman decoder concepts explained previously.

Huffman tree structure

Your Huffman tree structure will reside in the decoders/huffman module.

โœ… Create a decoders module.

โœ… Create a huffman module in the decoders module.

โœ… Create a HuffmanDecoder enumerated type in the decoders/huffman module.

The HuffmanDecoder represents a node in the tree. It is either Absent, a Symbol (with a u8 payload) or a subtree represented by a Tree variant and two HuffmanDecoder values representing the left (0) and right (1) alternatives.

Note that you cannot put an object of type HuffmanDecoder inside a HuffmanDecoder itself, as the container object would have to have a greater size than the contained one, which is not possible as they have the same type. Also, you cannot use references either to represent the left and right alternatives: references do not represent ownership. Who would be the owner of the left and right alternatives?

You will have to use a type which acts like a reference (in that you don't have to store the whole object) and owns a subtree.

Inserting a symbol in a tree

โœ… Write a HuffmanDecoder::insert() method which inserts a symbol in the tree at the leftmost possible free position for a given width, and returns true if it has succeded.

impl HuffmanDecoder {
    pub fn insert(&mut self, symbol: u8, width: u8) -> bool {
        todo!()
    }
}

Its behaviour will be:

  • If width is 0, the symbol must be inserted right at the top if the top is currently Absent. Otherwise this is a failure, and false must be returned.
  • Otherwise, the symbol must be inserted with a decreased width on the left branch, and if it fails on the right branch. Branches must be created (as absent) if the node doesn't contain them yet. If the top of the tree is a Symbol, just panic!() as this can should never happen. It would mean that it is impossible to insert the node without resolving to a symbol before the prefix length is reached.

For example, the tree shown in the Huffman code example can be built by running the following function:

fn build_example_tree() -> HuffmanDecoder {
    let mut tree = HuffmanDecoder::Absent;
    assert!(tree.insert(b'A', 2));
    assert!(tree.insert(b'C', 2));
    assert!(tree.insert(b'B', 1));
    tree
}

Printing a tree (optional)

You may want to implement Debug on your HuffmanDecoder type to list its content. While you may derive Debug automatically, you may want a better representation, showing the prefixes.

To do so, you can build an iterator which iterates over the nodes of your tree and their respective prefix. This iterator will keep a list of nodes to visit and their prefix internally, and yield pairs of (String, u8) which represent a prefix and the associated symbol.

โœ… Build a HuffmanDecoderIterator<'a> structure which contains a list of nodes yet to visit, for example as a Vec<(&'a HuffmanDecoder, String)>.

โœ… Implement Iterator for this structure: take the first node to visit by popping it from the vector. If it is a symbol, return it along with its prefix, if it is a subtree, note that you have to visit both branches (choose the order right) and try again with the node in your list, if there is nothing just go to the next node in your list. โœ… Implement Debug for HuffmanDecoder, taking advantage of the std::fmt::Formatter::debug_struct() method to structure your output.

If you do so, executing the following code:

fn main() {
   println!("{:?}", build_example_tree());
}

should print something like:

HuffmanDecoder { 00: 65, 01: 67, 1: 66 }

(because 65, 67, 66 are the ASCII codes for A, C and B respectively)

Building a tree from prefix widths

In a Zstandard file, a Huffman table will be described by succession of prefix widths (number of bits) for each symbol. After some decoding process that will be shown later, you will end up with a list of up to \(2^8\) integers, each representing the number of bits needed to represent each symbol. Missing ones mean, or prefix width set to 0, mean that this particular symbol will not be present in the output stream and thus will not appear in the Huffman table.

โœ… Build a HuffmanDecoder::from_number_of_bits() constructor which takes a vector containing the number of bits for each symbol and returns a newly built HuffmanDecoder.

What you must do is:

  • Build a list of every symbol and its width.
  • Filter out zero width entries as those symbols will never appear in the output. They are present only because some later symbols in the list have a non-zero width.
  • In a newly created tree, insert every symbol at the right width, starting from the largest width and smallest symbol values within the same width. Make sure that you read the note on sort stability in Rust documentation.

Once you have done so, you should check that the following code builds the same tree as before:

fn main() {
    // 0 repeated 65 times, 2, 1, 2
    let widths: Vec<_> = std::iter::repeat(0).take(65).chain([2, 1, 2]).collect();
    println!("{:?}", HuffmanDecoder::from_number_of_bits(widths));
}

Representation through weights

Inside a Zstandard file, prefix widths are represented as weights:

  • A weight of 0 represents a symbol absent from the output.
  • If the longest prefix width is \(L\), a weight of \(w\) represents a prefix width of \(L+1-w\).

Moreover, three small additional optimizations are done to the list of weights present in a Zstandard file:

  • All 0 weights at the end of the list are absent. For example, a list of [2, 1, 0, 1, 0, 0โ€ฆ] will not contain the zeroes after the second 1.
  • The latest non-0 weight in the list is also absent. It can be computed by looking at the existing weights. For example, a list of [2, 1, 0, 1] will be represented as [2, 1, 0].
  • The longest prefix width is not indicated in the file. It must be deduced from the list of weights.

โœ… Build a HuffmanDecoder::from_weights constructor which takes a vector of weights, represented as u8, and returns a result with the newly built HuffmanDecoder or an error. You must create a new local Error enum as you have done in other modules, as well as the Result type alias.

To compute the missing weight, which must be added to the list of weights received by the method, you can use the following properties:

  • If you sum \( 2^{w-1} \) for all weights \(w\) different from 0, you will get a power of 2.
  • No \( 2^{w-1} \) expression can be strictly greater than the sum of all others.

This should let you determine the missing last weight by ensuring that when added, the sum of \( 2^{w-1} \) rounds to the smallest strictly greater power of 2.

Once you have determined the missing weight, you can also deduce the number of bits needed to represent the weights. This can be easily done by taking the base 2 logarithm of the power of 2 you have reached.

You can now compute the prefix widths for all weights, and call your previously written HuffmanDecoder::from_number_of_bits() to build the Huffman decoder.

The following test code should build the same decoder as before:

fn main() {
    // 0 repeated 65 times, 1, 2
    let weights: Vec<_> = std::iter::repeat(0).take(65).chain([1, 2]).collect();
    println!("{:?}", HuffmanDecoder::from_weights(weights));
}

๐Ÿ’ก You may have noticed that in our example above the vector of weights above contains a lot of 0 at the beginning. In fact, in a real weights array, many weights will be similar to each other, as they will all take a value between 0 and 8, inclusive. Does this ring a bell? Indeed, as we will see later, the weights are often themselves encoded using a FSE encoder which thrives in the presence of a small number of symbols.

In the next section, we will learn how to feed bits to our Huffman decoder and make it decode them.

Backward bit parser

The Huffman decoder you have just built needs to be fed some bits. In the same way that you have built a forward byte parser, you will now have to build a backward bit parser.

Why backward? Because of the way Zstandard encoders work, starting from the end of the stream to compress and writing the various bits to the compressed stream.

However, writing a stream of bits does not necessarily mean that the encoder will fill an integral number of bytes. To signal the end of the encoded bitstream, the encoder writes an extra 1 then completes the byte with 0 bits if needed.

Parsing a bitstream backwards

What does it mean to parse a bitstream backwards?

  • You must start from the last byte and progress in reverse order.
  • In every byte, you start reading bits from the most significant bit to the least significant bit.
  • When you start producing the bitstream, you must skip all initial zeroes and the first one you encounter.

For example, let us start from the [0x05, 0x74] byte sequence found in a Zstandard file. In binary, this is written as [0b00000101, 0b01110100].

The bit string will be consumed starting from the byte marked ฮฑ then the byte marked ฮฒ, following the arrow direction inside each byte:

00000101 01110100
|---ฮฒ--> |---ฮฑ-->
         ^

The caret (^) represents the next bit to read. First, the initial zeroes must be zapped until the first one is found and discarded. This happens necessarily within the initial byte. After discarding the header, the caret as advanced two positions (one initial zero and the initial one):

00000101 --110100
|---ฮฒ--> |---ฮฑ-->
           ^

Now, let us read a 3 bits value. We get 110 and the caret advances 3 positions.

00000101 -----100
|---ฮฒ--> |---ฮฑ-->
              ^

Reading 7 bits will give us 0b1000000 (3 bits from the current byte, 4 bits from the byte just before).

----0101 --------
|---ฮฒ--> |---ฮฑ-->
    ^

Reading 4 bits yields 0b0101: we have reached the end of the stream.

-------- --------
|---ฮฒ--> |---ฮฑ-->

As another example, here is how the two bitstreams from the examples would be encoded in the Zstandard file:

  • Huffman code: 10010111 would be [0b10010111, 0b00000001], or [0x97, 0x01]. Seven 0 from the last byte will be skipped before the initial 1 is found and skipped.
  • FSE code: 1110000101 would be [10000101, 00000111], or [0x85, 0x07]. Five 0 and the initial 1 will be skipped.

โœ… Implement a BackwardBitParser type in the parsing module.

The following methods would be useful for such a parser:

impl<'a> BackwardBitParser<'a> {
    /// Create a new backward bit parser. The header is skipped automatically or
    /// an error is returned if the initial 1 cannot be found in the first 8 bits.
    pub fn new(data: &'a [u8]) -> Result<Self> { todo!() }

    /// True if there are no bits left to read
    pub fn is_empty(&self) -> bool { todo!() }

    /// Get the given number of bits, or return an error.
    pub fn take(&mut self, len: usize) -> Result<u64> { todo!() }
}

Using the backward bit parser

โœ… Add a method HuffmanDecoder::decode() which takes a mutable reference to a BackwardBitParser and returns the decoded symbol or an error.

You will probably have to add a new variant to HuffmanDecoder::Error in order to propagate a parsing error signaled by the BackwardBitParser.

Testing our parser and decoder together

Why not test our backward bit parser and our Huffman decoder together using the example given here?

#[test]
fn huffman_project_example() {
    // 0 repeated 65 times, 1, 2
    let weights: Vec<_> = std::iter::repeat(0).take(65).chain([1, 2]).collect();
    let decoder = HuffmanDecoder::from_weights(weights).unwrap();
    let mut parser = BackwardBitParser::new(&[0x97, 0x01]).unwrap();
    let mut result = String::new();
    while !parser.is_empty() {
        let decoded = decoder.decode(&mut parser).unwrap();
        result.push(decoded as char);  // We know they are valid A, B, or C char
    }
    assert_eq!(result, "BABCBB");
}

Forward bit parser

FSE decoder distribution table

The FSE decoder is used at two places in the Zstandard decoding process:

  • to decode the Huffman decoder tables: two similar FSE decoders produce output symbols alternatively using the same input bitstream;
  • to decode the sequences of instructions used to rebuild the output in a compressed block: three FSE decoders will be used simultaneously.

The principles of the FSE decoding process have been explained here.

The FSE decoder states are built from1:

  • An accuracy log \(AL\)
  • A distribution table \(\mathcal{D}\) which describes how symbols are distributed amongst the states. Every entry in the table is a number in the \([-1, R]\) range.

Since all the states will be filled, and since \(-1\) is just a special value which means that a symbol occupies only one state, we have \[R = \sum_{d \in \mathcal{D}}|d| = 2^{AL}\].

The various arguments will be read from a bitstream coming from the file being decoded:

  • \(AL\) is read first and computed from 4 bits, which also gives us the states table size \(R\)
  • The distribution table is read, one symbol after the other. Since the sum of absolute values in the distribution table cannot exceed \(R\), you always know the maximum value that can be read. This lets you determine the number of bits you need to read from the stream in order to decode this value.
  • You will know when you have a full table, as the sum of absolute values will be equal to \(R\). It means that you know when to stop reading the bitstream.

For reasons which will become apparent later, the data pertaining to the FSE table description are parsed with a forward bit parser. This is a parser similar to the backward bit parser you have written already, but bits are consumed forward.

Parsing a bitstream forwards

What does it mean to parse a bitstream forwards?

  • You must start from the first byte and progress in regular order.
  • In every byte, you start reading bits from the least significant bit to the most significant bit.
  • If you read several bits at a time to get an integer value, the first bit you get is the least significant one.

For example, let us start from the [0x05, 0x74] byte sequence found in a Zstandard file. In binary, this is written as [0b00000101, 0b01110100].

The bit string will be consumed starting from the byte marked ฮฑ then the byte marked ฮฒ, following the arrow direction inside each byte:

00000101 01110100
<---ฮฑ--| <---ฮฒ--|
       ^

Now, let us read a 2 bits value. We get 0b01 (remember that the first bit read is the least significant one in the result) and the caret advances 2 positions:

000001-- 01110100
<---ฮฑ--| <---ฮฒ--|
     ^

Similarly, reading 9 bits yields 0b100000001 (or 0b100_000001 to make it clearer which part comes from each input byte), and the caret advances 9 positions:

-------- 01110---
<---ฮฑ--| <---ฮฒ--|
             ^

Reading four more bits yields 0b1110:

-------- 0-------
<---ฮฑ--| <---ฮฒ--|
         ^

If we don't want to read any further, the remaining bits are discarded. Note that contrary to the backwards bit parser case, we did not have to skip an initial header in order to start decoding data.

โœ… Implement a ForwardBitParser type in the parsing module.

The following methods would be useful for such a parser:

impl<'a> ForwardBitParser<'a> {
    /// Create a new backward bit parser. The header is skipped automatically or
    /// an error is returned if the initial 1 cannot be found in the first 8 bits.
    pub fn new(data: &'a [u8]) -> Result<Self> { todo!() }

    /// True if there are no bits left to read
    pub fn is_empty(&self) -> bool { todo!() }

    /// Get the given number of bits, or return an error.
    pub fn take(&mut self, len: usize) -> Result<u64> { todo!() }
}

Using the forward bit parser

You will now read a FSE decoder table

โœ… In a decoders::fse module, add a parse_fse_table() method taking a mutable reference to a ForwardBitParser and returning either a (accuracy_log, distribution) table or an error if decoding the table failed.

The signature would be:

pub fn parse_fse_table(parser: &mut ForwardBitParser) -> Result<(u8, Vec<i16>)> {
    todo!()
}

The process to decode a FSE table is described with many details in the Zstandard RFC:

  • Read a 4 bits value and add 5. This gives you the accuracy log \(AL\), as well as the table size \(R = 2^{AL}\). You may reject large accuracy log (greater than 9 for example) as you will not encounter them in the real life.
  • As long as the sum of the absolute values of the distribution table entries is strictly lower than \(R\), you must:
    • Compute the maximum value \(m = R-\sum_{d\in\mathcal{D}}|d|\) that the next entry may reach.
    • You need to read a value between 0 and \(m+1\) from the stream โ€“ you will later subtract 1 from this value to get a number between -1 and \m\). Compute the number of bits \(b\) you need to read such a value. โš ๏ธ How many values are between 0 and \(m+1\), inclusive?
    • You may need to read either \(b-1\) bits or \(b\) bits to get the value \(v\) from the bitstream. You can do this in several ways:
      • Add a mechanism to your forward bit parser to reinject a bit that you haven't needed (hard)
      • Add a peek() method which returns the next bit in the bitstream without consuming it (easy).
      • Determine from the \(b-1\) value that you have read if you need an extra bit or not (easiest).
    • \(p = v-1\) is the new table coefficient, add it to the table. If the total exceeds \(R\), return an error.
    • If \(p = 0\), you must read 2 extra bits in the bitstream to get the number of extra zeroes to insert in the table. If you get 3 (0b11), you must repeat the process for extra zeroes.

Does your implementation work?

Add some tests for your implementation, such as:

#[test]
fn parse_fse_table() {
    let mut parser = ForwardBitParser::new(&[0x30, 0x6f, 0x9b, 0x03]);
    let (accuracy_log, table) = parse_fse_table(&mut parser).unwrap();
    assert_eq!(5, accuracy_log);
    assert_eq!(&[18, 6, 2, 2, 2, 1, 1][..], &table);
    assert_eq!(parser.len(), 6);
}

This test shows that:

  • the table which is described has \(32 = 2^5\) states;
  • those 32 states can outputs symbols 0 to 6;
  • the approximate respective probabilities of the symbols in the output stream will be \(\frac{18}{32}\) for symbol 0, \(\frac{6}{32}\) for symbol 1, โ€ฆ (meaning that 18 states will output symbol 0, 6 states will output symbol 1, โ€ฆ);
  • 6 bits were unused in the provided bistream.
1

Note that the way adopted by the Zstandard RFC is not the only way to represent a FSE table. In particular, the way chosen here, through a distribution of probabilities, assumes that any symbol (present in the uncompressed stream) can follow any symbol, while more specialized tables could allow the full scope of FSE decoders as shown in the introductory FSE example.

FSE decoder

Now that the FSE distribution table is available, the FSE table itself must be built. As you saw in the example, those 7 coefficients will be used to build a 32 states table.

โœ… In the decoders::fse module, add a FseTable type containing the states (from a State type) of a FSE decoder.

The symbols decoded by the FSE decoder are not necessarily bytes, it is suggested that you use u16 as the symbol type to ensure you will be able to decode all sequences. Tables may contain up to \(2^{20}\) states, but you are allowed to set a limit. No common Zstandard file uses more than \(2^{14}\) states, so using a u16 for the baseline looks reasonable.

Building the FSE decoder tables

A FSE decoder table is built from a distribution of symbols, i.e. the number of states that will decode to each symbol. The states corresponding to a symbol will be roughly evenly spaced in the table, making it possible to branch to a new state to emit any other symbol afterwards after reading only a small number of bits.

For every symbol, the distribution list contains:

  • \(-1\) if the symbol is not frequent at all. In this case, it will be placed at the end of the states table, and will read all AL (accuracy log) bits to get the following state, making all further states reachable.
  • \(0\) if the symbol does not appear in the output. No state will be built for this symbol.
  • \(n > 0\): the symbol will be emitted by \(n\) states spread through the table. Enough bits will be read by every state so that by carefully choosing the selected state for the current symbol it is possible to reach any other state in the table.

โœ… Build a FseTable::from_distribution() method building the FSE states table given the accuracy log and the distribution list.

One possible signature is:

impl FseTable {
    pub fn from_distribution(accuracy_log: u8, distribution: &[i16]) -> Self {
        todo!()
    }
}

The steps to build the table are described below, and are also described in the RFC.

Symbols with "less than 1" probability

Some symbols have a "less than 1" probability, represented with the -1 distribution value. They will be emitted by one state only. Since you must be able to jump to any state after emitting such a symbol, its baseline will be 0 and the number of bits to read will be the accuracy log.

โœ… After building an empty table, fill it starting from the end with states emitting those symbols with a -1 distribution value (the lowest symbol with a -1 distribution value will be stored last in the table).

The rest of the table must now be filled with states emitting symbols with a positive distribution value, in order.

Spreading the entries for a given symbol with a positive distribution value

The states corresponding to a symbol will be spread throughout the remaining table entries. The idea behind spreading them, rather agglomerating them near each other, is that for a given symbol you will need less bits to reach any other state.

Aside: why are the entries for a symbol spread in a FSE table?

Let us assume that your output uses four symbols A, B, C, and D, and that A density is \(\frac 12\), B density is \(\frac 14\), and C and D both have density \(\frac 18\). The chosen accuracy log is 3, so there exist 8 states.

The table may then be built as-is:

StateOutput symbolBase line (BL)Number of bits to read
0A01
1B02
2A21
3C03
4A41
5B42
6A61
7D03

Let us assume we want to emit a symbol then jump to a given state, there always exists a state which has this property. An encoder would start from the end of the stream of symbols to encode, and go backward to find an appropriate state. For example, let us encode "AACBAABD" and build the stream (which will be parsed by a backward bit parser when decoding) in the process. Underscores (_) will be used to separate bytes once they have been filled entirely.

  • Only state 7 emits D. We want the previous state to jump there and stop.
  • State 5 emits B and jumps to 7 when reading 11. Stream: 11.
  • State 4 emits A and jumps to 5 when reading 1. Stream: 111.
  • State 4 emits A and jumps to 4 when reading 0. Stream: 0111.
  • State 5 emits B and jumps to 4 when reading 00. Stream: 000111.
  • State 3 emits C and jumps to 4 when reading 100. Stream: 00000111_1.
  • State 2 emits A and jumps to 3 when reading 1. Stream: 00000111_11.
  • State 2 emits A and jumps to 2 when reading 0. Stream: 00000111_011.
  • The initial value is taken by reading a number of bits equal to the accuracy log (3): 010 will designate state 2. Stream is: 00000111_010011.
  • A 1 is added to mark the beginning is the stream, then zeroes are added as necessary to complete the last byte (here only one is needed). Stream is 00000111_01010011, or [0b00000111, 0b01010011], or [0x07, 0x53].

The advantage of spreading appears clearly in this example: by spreading the most frequent symbol A accross the state, every time a A appears in the output only one bit will be needed to describe the following symbol.

The Zstandard RFC specifies how the symbols will be spread:

  • Start with filling state 0
  • If state \(s\) was the last state filled and \(R\) is the table size, compute \[s' = (s + \frac R2 + \frac R8 + 3) \land (R-1)\].
  • If state \(s'\) is already filled by a symbol with a "less than 1" probability, apply this operation again, otherwise you have your new state.

Given that \(R\) is a power of 2, \(x \land (R-1)\) is equivalent to x % R in Rust. Also, the accuracy log is defined as being at least 5, which means that \(R \geq 32\). This means that by applying repeatidly the operation to get from one state \(s\) to the next one \(s'\) is a generating function which will reach all states once before starting from 0 again.

๐Ÿ’ก Using functions such as std::iter::successors() and combinators such as .filter(), you can easily build an iterator whose .next() result is the next state to use.

โœ… Determine the list of states that will be used for the current symbol (there will be as many states as the distribution value). Sort the list in natural order.

If the table size is \(R\) and the distribution value for the current symbol is \(d\), starting with a baseline of \(\frac {kR}d\) for \(k \in [0, d[\) and reading \(\log_2{\frac Rd}\) bits to compute an offset would allow you to reach any other state. However, that works only when \(d\) is itself a power of 2.

If \(d\) is not a power of 2, some of the states will need to read one more bit in order to ensure the coverage of the whole state table starting from the current symbol.

โœ… Given the distribution value \(d\):

  • Compute \(p\) as being the smallest power of two greater than or equal to \(d\). Note that \(p\) may be equal to \(d\).
  • Compute the number of bits \(b = log_2\frac Rp\) you will need to read to be able to globally reach all states if had \(p\) states available for your symbol instead of \(d\).
  • Compute \(e = p - d\), the number of states to which you will need to add one extra bit provided that, in reality, you only have \(d\) states available for your symbol. \(e\) may be 0 if \(p=d\).

Once you have done that, you will be able to:

  • reach \(2^{b+1}\) states from each of \(e\) of your states
  • reach \(2^b\) states from each of \(d-e\) of your states

which allows you to reach a total of \(e \times 2^{b+1} + (d-e) \times 2^b = R\) states. The whole table is covered provided the baselines are chosen correctly.

The Zstandard mandates that:

  • the first \(e\) states for the current symbol will use a Number of bits to read of \(b+1\);
  • the remaining \(d-e\) states for the current symbol will use a Number of bits to read of \(b\);
  • the base line starts at 0 for the first state with \b\ bits;
  • after setting the base line of a state, it is incremented by 2 to the power of the number of bits corresponding to this state, then the next state is processed (wrapping around the list if needed) until we have completed all states for the current symbol.

A very detailed example is given in the RFC itself. You see in the table describing the 5 states allocated for a given symbol:

  • The first 3 of those states require 5 bits (4 + 1 extra) and the other 2 require 4 bits.
  • The baseline:
    • starts at 0 in the first state requiring 4 bits, then is incremented by \(2^4=16\)
    • is set to 16 in the next state, then incremented by \(2^4=16\), and \(16+16=32\)
    • is set to 32 in the next state (wrapped around), then incremented by \(2^5=32\), and \(32+32=64\)
    • is set to 64 in the next state, then incremented by \(2^5=32\), and \(64+32=96\)
    • is set to 96 in the next state, then incremented by \(2^5=32\), and \(96+32=128\)
  • All states have their baseline filled, and the baseline has reached 128 which is the table size anyway.

โœ… Follow a similar process and fill the states assigned to your symbol.

โœ… Do this for all symbols with a distribution value greater than 0.

You can return the table, the process is complete.

Building a FSE table from a bitstream

โœ… Add a FseTable::parse() constructor which takes a mutable reference to a ForwardBitParser and returns either the newly built table or an error.

โœ… Add tests.

You may, for example, check that the table built from [0x30, 0x6f, 0x9b, 0x03] decodes as the table shown in Nigel Tao's example.

The FSE decoder

โœ… In the fse module, add a FseDecoder struct which encapsulates everything a FSE decoder needs.

Your decoder will need to hold the following information:

  • the FSE table used to decode
  • the current baseline
  • the next symbol size
  • the symbol to emit when requested, as an Option<u16> (you may encounter more than 256 symbols while decoding sequences, but never more than 65535)

โœ… In the decoders module, add the BitDecoder trait with the following signature:

/// A (possibly) stateful bit-level decoder
pub trait BitDecoder<T = u8, E> {
    /// Initialize the state.
    ///
    /// # Panics
    ///
    /// This method may panic if the decoder is already initialized.
    fn initialize(&mut self, bitstream: &mut BackwardBitParser) -> Result<(), E>;

    /// Return the next expected input size in bits
    ///
    /// # Panics
    ///
    /// This method may panic if no bits are expected right now
    fn expected_bits(&self) -> usize;

    /// Retrieve a decoded symbol
    ///
    /// # Panics
    ///
    /// This method may panic if the state has not been updated
    /// since the last state retrieval.
    fn symbol(&mut self) -> T;

    /// Update the state from a bitstream by reading the right
    /// number of bits, silently completing with zeroes if needed.
    /// Return `true` if zeroes have been added.
    ///
    /// # Panics
    ///
    /// This method may panic if the symbol has not been retrieved since
    /// the last update.
    fn update_bits(&mut self, bitstream: &mut BackwardBitParser) -> Result<bool, E>;

    /// Reset the table at its state before `initialize` is called. It allows
    /// reusing the same decoder.
    fn reset(&mut self);
}

This trait will be implemented by several bit decoders, starting by FseDecoder. You might note that some methods may panic: this is expected, as this would represent a logic error (for example initializing a decoder twice without resetting it), and not something that can be due to the input of our program. While it is not advised to panic if the user feeds our program bad data, we are allowed to do so when our program has an internal bug.

The expected use of a type implementing this trait is:

  1. Call initialize().
  2. Extract the decoded symbol from the decoder using symbol() and add it to the output stream.
  3. If you have previously executed step 4 and it returned Ok(true), meaning that the input stream had to be completed with zeroes, stop there, you have reached the end of the decoding process.
  4. Update the decoder state with update_bits() (this will attempt to consume expected_bits()).
  5. Go back to 2.

โœ… Implement the BitDecoder trait for FseDecoder.

โœ… Write some tests, using some hand-crafted pathological cases, and Nigel Tao's examples.

๐Ÿน Where do we stand?

Why did we implement all this? In order to decode Zstandard compressed data, we will need to manipulate several entities that belong to the current block being decoded. Please refer to the Zstandard in a nutshell explanation if you do not remember all the terms used here.

We need to first decode the literals. Those literals, which serve as the initial source of data copied to the output, are often compressed using a Huffman table whose weights are encoded using FSE.

Once we have decoded the literals section, sequences will be executed, one after the other, indicating what to do, such as "extract the first M unused bytes from the literals into the decompressed file, then copy N bytes from the already decoded output starting P bytes before the current point". Those M, N and P values are encoded using FSE, and their decoding is followed by some post-processing to obtain the correct values.

We have all the building blocks needed to achieve the Zstandard decompression process.

The decoding context

Within a Zstandard frame, blocks may rely on information provided by preceding blocks. For example, a block may reuse the same Huffman table as the previous block to decode literals. In this case, instead of describing the table again using FSE and recreating the decoder, an indication will be made to reuse the same decoder.

You will create a decoding_context module to hold this state that can be shared between blocks, through a DecodingContext. For now, the only things you will store in this DecodingContext are:

  • The frame window size which defines how far a sequence will be able to look back to output again some previously outputted data.
  • The data that has been output already as a Vec<u8>, since sequences will be able to refer to this data.
  • The HuffmanDecoder used in previous blocks if any.

โœ… Create a DecodingContext structure inside the decoding_context module with three fields so far, as shown below:

pub struct DecodingContext {
    pub huffman: Option<HuffmanDecoder>,
    pub decoded: Vec<u8>,
    pub window_size: u64,
}

โœ… Write a DecodingContext::new() constructor which receives a window size and checks it. If it is too large, an error (to be defined in the current module as usual) will be returned. Otherwise, a new DecodingContext with no Huffman table and an empty decoded output will be used.

64MiB is enough and exceeds what is required by most encoders.

โœ… Modify Block::decode() so that it takes a DecodingContext as a mutable reference. Instead of returning the decoded data, Block::decode() will add it to the current context's decoded field, and return Ok(()) if everything went well.

โœ… Modify Frame::decode() so that it creates a new DecodingContext with the current frame size, and pass this context to Block::decode(). It will then return the content of the context's decoded field if everything went well.

Don't forget to modify the tests as needed. Your program should still work as usual and pass the tests, even though it does nothing more than before, except check the window size.

Literals

Compressed blocks are made of two parts:

  • the literals section
  • the sequences to act on those literals

The literals section

The literals section can be of three different types: a raw (as-is) representation, a RLE encoding (a byte repeated many times), or a compressed representation.

โœ… In a literals module, create a LiteralsSection enumeration with three variants: RawLiteralsBlock, RLELiteralsBlock and CompressedLiteralsBlock.

Each variant will be a struct-like one with some fields:

RawLiteralsBlock

A raw literals block encapsulate a reference to a slice of bytes. It means that the LiteralsSection type itself will have to take a lifetime as a generic parameter.

RLELiteralsBlock

A run-length-encoded literals block contains a byte and a number of repetitions of this byte.

CompressedLiteralsBlock

A compressed literals block contains more information:

  • huffman: an optional Huffman decoder. When the current block will be decoded, the decoder present in the context will be updated to this one and used. Otherwise, the decoder present in the context will be reused for the current block.
  • regenerated_size: the size of the regenerated data once decoded.
  • jump_table: the literals section is made of either one stream of compressed data, or four streams which are interleaved; in this case, the size of the 1st to the 3rd stream can be found in this field, while the one of the 4th stream can be deduced by subtracting thoses sizes from the total compressed size.
  • data: the compressed data as a reference to a slice of bytes.

Note that we do not create a separate TreelessLiteralsBlock variant to represent tree-less literals block as described in the standard: we use a CompressedLiteralsBlock with a huffman field set to None.

Parsing the literals section

โœ… Implement LiteralsSection::parse() with the following signature:

impl<'a> LiteralsSection<'a> {
    pub fn parse(input: &mut ForwardByteParser<'a>) -> Result<Self> {
        todo!
    }
}

Parsing the section involves decoding several steps:

  • The literals section header is one byte.
  • From this byte, several information can be extracted:
    • The block type
    • The format in which the regenerated size is stored
    • The regenerated size itself (most formats will require reading more bytes from the streaรน)

In case of a compressed literals block, some more info must be extracted:

  • The compressed data size. Each size (regenerated and compressed) can be encoded either on 10, 14, or 18 bits.
  • A Huffman table, unless the block is a tree-less one which reuses the Huffman decoder from the previous block. The Huffman coefficients are compressed using FSE.
  • The jump table, as explained above.

โš ๏ธ To be able to parse a compressed literals block, you need to parse Huffman tables coefficients from FSE, as is explained below. Use a todo!() or unimplemented!() while you implement Huffman table parsing.

Decoding the Huffman table

The description of a Huffman table in Zstandard compressed data is can take two forms. First, a byte \(N\) is read from the stream.

Direct encoded Huffman weights

If \(N \geq 128\), it means that \(N-127\) coefficients follow, each one of them using 4 bits. If the number of coefficients is odd, 4 bits are lost at the end. The rest of the stream is the Huffman-encoded data.

Compressed Huffman weights

If \(N \lt 128\), coefficients are described using an FSE encoding. \(N\) describes the number of bytes used to describe the FSE table and the Huffman table. They will be organized as follows in memory:

     FSE coefficients    Huffman table coefficients
     --------------->    <-------------------------
         F bytes                 N-F bytes
     <-------------------------------------------->
                    N bytes in total

The FSE table will be decoded and will consume \(F\) bytes in the process (the FSE table knows when it has reached the end of its coefficients). Then the Huffman table will be decoded from two alternating FSE decoders (see below) using the same FSE table, using the remaining \(N-F\) bytes in a backward bit parser.

The rest of the original stream, after those \(N\) bytes, is the Huffman-encoded data.

Alternating FSE decoders

To decode the Huffman table coefficients, two FSE decoders are used in succession. They are built from the same FSE table, which has just been decoded, but will hold separate states. Bopth decoders will consume bits from the same stream.

The first decoder will get initialized from the stream, then the second one. The first decoder will emit its initial state and consume some bits to get the next state, then the second decoder will do so as well. Back to the first decoder, which will emit a state and consume some bits, then the second one, and so on.

When a decoder tries to consume more bits than is available, it will complete them with zeroes to get to its last state, and indicate to the caller that the stream has been exhausted.

โœ… Implement a AlternatingDecoder structure in a decoders::alternating module which stores two FseDecoder objects as well as a boolean indicating which decoder has been updated last.

โœ… Implement AlternatingDecoder::new() which takes a FseTable. You might want to make the FseTable clonable, or find another way to share it.

โœ… Implement trait BitDecoder for AlternatingDecoder.

โœ… Write some tests for AlternatingDecoder.

Decode and build a Huffman table

โœ… Write a HuffmanDecoder::parse() method with the signature shown below. You might want to also build the parse_direct() and parse_fse() functions but feel free to organize your code as you wish.

impl HuffmanDecoder {
    /// Build a Huffman table from the given stream. Only the bytes needed to
    /// build the table are consumed from the stream.
    pub fn parse(input: &mut ForwardByteParser) -> Result<Self> {
        let header = input.u8()?;
        let weights = if header < 128 {
            Self::parse_fse(input, header)?
        } else {
            Self::parse_direct(input, header as usize - 127)?
        };
        Self::from_weights(weights)
    }

    /// Parse the Huffman table weights directly from the stream, 4
    /// bits per weights. If there are an odd number of weights, the
    /// last four bits are lost. `number_of_weights/2` bytes (rounded
    /// up) will be consumed from the `input` stream.
    fn parse_direct(input: &mut ForwardByteParser, number_of_weights: usize)
            -> Result<Vec<u8>>
    {
       todo!()
    }

    /// Decode a FSE table and use an alternating FSE decoder to parse
    /// the Huffman table weights. `compressed_size` bytes will be
    /// consumed from the `input` stream.
    fn parse_fse(input: &mut ForwardByteParser, compressed_size: u8)
            -> Result<Vec<u8>>
    {
       todo!()
    }
}

Back to parsing the literals section

You should now be able to parse the literals section and remove the todo!() markers you have set earlier in your code.

โœ… Complete the LiteralsSection::parse() code and test it.

Decoding the literals section

โœ… Implement LiteralsSection::decode().

impl<'a> LiteralsSection<'a> {
   /// Decompress the literals section. Update the Huffman decoder in
    /// `context` if appropriate (compressed literals block with a
    /// Huffman table inside).
    pub fn decode(self, context: &mut DecodingContext) -> Result<Vec<u8>> {
        todo!()
    }
}

Remember that you might have four streams to decode with the same Huffman decoder if the jump_table is not empty. Don't both doing it in parallel right now, you can decode them successively.

โœ… Add a CompressedBlock variant to Block, containing for now just a literals_section field.

โœ… When the block is a compressed block, add a call to LiteralsSection::parse() to Blocks::parse(). Discard the rest of the block for now, as we haven't implemented sequences parsing yet, this will be the next step.

โœ… When the block is a compressed block, add a call to LiteralsSection::decode() to Blocks::decode(), and display the decoded literals using compressed text files (most of them will be using compressed literal blocks). Even though you will not be able to rebuild the decompressed data, you should be able to make sense of the decoded literals.

Sequences

Sequences are the last pieces needed to decode Zstandard compressed frames. They will contain instructions on to how to combine data coming from the literals section of the current block.

Each sequence is made of three parts:

  1. The length in byte of the data to copy from the literals section to the output.
  2. How far behind in the already output data you can find data to emit, and
  3. how many bytes from this already output data to emit again.

Decoding example

Let's assume that your literals section decodes as "abcdefgh", and that your sequences section contains two sequences which, when fully decoded, contain: (3, 2, 3) and (2, 8, 1).

The first sequence (once decoded) is (3, 2, 3):

  • You copy 3 bytes of data from the literals section to the output: "abc"

  • The data to copy starts two bytes behind (one byte would be at "c", two bytes is at "b")

  • You repeatedly copy 3 bytes starting from this position:

    Current output: abc        Bytes repeated: 0
                     ^
    Current output: abcb       Bytes repeated: 1
                      ^
    Current output: abcbc      Bytes repeated: 2
                       ^
    Current output: abcbcb     Bytes repeated: 3
                        ^
    

Now, if the second sequence is (2, 8, 1):

  • You copy 2 bytes of data from the literals section to the output: "de"
  • You copy 1 byte of data, starting 8 bytes behind the current end of output:
    Current output: abcbcbde        Bytes repeated: 0
                    ^
    Current output: abcbcbdea       Bytes repeated: 1
                     ^
    

If there are no more sequences, you end up by copying the unused literal data ("fgh") to the output, giving the final output "abcbcbdeafgh".

Sequence decoding

Sequence decoding is described in section 3.1.1.3.2 of the RFC. You will first find a header describing the number of sequences with a variable size encoding. Then you will encounter a byte describing the three decoders:

  • 2 bits will describe the way literal lengths (the number of bytes to copy from the literals section) are encoded
  • 2 bits will describe the way offsets (how much to look back for data to repeat) are encoded
  • 2 bits will describe the way the match lengths (how much to repeat) are encoded

There are four ways of encoding each tables:

  • Predefined mode: a predefined FSE table, described in the RFC, must be used. No extra data is to be read from the stream for this table. Each table kind (literal lengths, offsets, and match lengths) get its own specific predefined table.
  • RLE (run length encoding) mode: the next byte in the stream is the value that must be repeatedly produced by this decoder.
  • FSE compressed mode: the FSE table to use must be read from the stream, using a forward bit parser.
  • Repeat mode: the last decoder used for this table kind must be used again. No extra data is needed from the stream.

Once the three decoders have been set, the data stream containing the sequences themselves must be decoded using a backwards bit parser on the rest of the block.

Decoding process

All three decoders are initialized by reading the number of bits they need for their respective initialization (the accuracy log for a FSE decoder, nothing for a RLE decoder) in this order:

  • the literals lengths decoder
  • the offsets decoder
  • the match lengths decoder

Each decoder emit a symbol in the order described below. Those symbols are not the value of the literals length, offset and match lengths: they use lookup tables, described in the RFC, that may require reading some more bits from the stream to determine the value. The order in which symbols are decoded is:

  • the offset
  • the match length
  • the literals length

Then the decoders are updated from the bit stream in this order:

  • the literals lengths decoder
  • the match lengths decoder
  • the offsets decoder

The final offset decoding process

The offset value must undergo one final transformation before it can be used, as described in the RFC: three repeat offsets (Offset1, Offset2, and Offset3) variables must be kept in the decoding context and used throughout the whole frame decoding. They are initialized to 1, 4, and 8 at the beginning of the frame.

The following process updates the three repeat offset values depending on the decoded Offset value and the literals length LL value. The real value to use for the offset is the one of Offset1 after this transformation:

Decoded Offset valueLLOffset1Offset2Offset3
30Offset1 - 1Offset1Offset2
3!= 0Offset3Offset1Offset2
20Offset3Offset1Offset2
2!= 0Offset2Offset1Offset3
10Offset2Offset1Offset3
1!= 0Offset1Offset2Offset3
> 3anyOffset - 3Offset1Offset2

Suggestions on code organization

For sequence execution

โœ… Execute decoded sequences with a decoded literals section from the decoding context.

In addition to the three repeat offset fields and the three literals lengths, offsets, and match lengths decoder, the following methods can be added to DecodingContext:

impl DecodingContext {
   /// Decode an offset and properly maintain the three repeat offsets
    pub fn decode_offset(&mut self, offset: usize, literals_length: usize)
        -> Result<usize>
    {
        todo!()
    }

    /// Execute the sequences while updating the offsets
    pub fn execute_sequences(
        &mut self,
        sequences: Vec<(usize, usize, usize)>,
        literals: &[u8],
    ) -> Result<()> {
        todo!()   // Using the `Self::decode_offset()` method
    }
}}

For low-level sequence decoding

โœ… Create a decoders::sequence module and a decoders::sequence::SequenceDecoder type. This type will contain references to the three bit decoders to use when decoding sequences.

A suggested implementation could be (in addition to the classical module-specific Error type):

pub struct SequenceDecoder<'d> {
    literals_lengths_decoder: &'d mut dyn BitDecoder<u16>,
    offsets_decoder: &'d mut dyn BitDecoder<u16>,
    match_lengths_decoder: &'d mut dyn BitDecoder<u16>,
}

impl BitDecoder<(u16, u16, u16), Error> for SequenceDecoder<'_> {
    โ€ฆ
}

A SequenceDecoder would produce triples of literals length, offset, and match lengths. The offset at this stage would not have been transformed using the repeat offsets described above, this would be the job of the DecodingContext::execute_sequences() method.

โœ… Also, since sequences can use a RLE decoder instead of a FSE one, implement a RLEDecoder type in a decoders::rle module. This decoder will implement BitDecoder, and never require any bit from the bit stream, and always produce the same value.

For high-level sequence decoding

โœ… Create a sequences module to parse and decode sequences from the following partial skeleton, and implement the methods (and other methods and types as needed):

pub struct Sequences<'a> {
    pub number_of_sequences: usize,
    pub literal_lengths_mode: SymbolCompressionMode,
    pub offsets_mode: SymbolCompressionMode,
    pub match_lengths_mode: SymbolCompressionMode,
    pub bitstream: &'a [u8],
}

impl<'a> Sequences<'a> {
    /// Parse the sequences data from the stream
    pub fn parse(mut input: ForwardByteParser<'a>) -> Result<Self> {
        todo!()
    }

    /// Return vector of (literals length, offset value, match length) and update the
    /// decoding context with the tables if appropriate.
    pub fn decode(self, context: &mut DecodingContext)
        -> Result<Vec<(usize, usize, usize)>>
    {
        todo!()
    }
}

pub enum SymbolCompressionMode {
    PredefinedMode,
    RLEMode(u8),
    FseCompressedMode(FseTable),
    RepeatMode,
}

At the block level

โœ… Add a sequences field to the CompressedBlock variant of Block. Parse and decode the sequence when parsing and decoding a compressed block.

โœ… Call the execute_sequences() method of the DecodingContext when decoding a Block to apply the decoded sequences to the decoded literals.

๐Ÿ‘ Congratulations, you are now able to decode any Zstandard compressed file!

Checksum

Even though this is not strictly necessary in order to decode compressed Zstandard files, it is advised to implement checksum verification. Remember, you have already [read the checksum value][whole-frame.md#building-the-type] if it is present.

You can compute the checksum as you decode data by adding an appropriate hasher to the decoding context and update it every time you decode a byte. The XxHash64 type of the twox_hash crate implements the right algorithm. You can truncate the final result to 32 bit, and compare it to the value found in the compressed Zstandard file: if they match, the data is likely not corrupt.

โœ… Implement the checksum verification.

๐Ÿน Where do we stand?

You have implemented a full Zstandard decoder, even capable of verifying the integrity of uncompressed data.

Of course, you might want to optimize your code, for example:

  • Each frame is independently encoded and can be decoded in parallel.
  • Literal sections and sequences decoding can progress simultaneously, while producing the decoded data. This will reduce the number of bytes to be processed.
  • Outputting data as decoding progresses would allow starting emitting data sooner.

But more importantly, your decoder must be made robust. This is what the next section is for.

Fuzzing the decompressor

While panicking is acceptable in case of a programming error in your code, it should never be due to an incorrect user input, which is what fuzzing will test.

โœ… Use cargo fuzz to feed arbitrary data to your decoder.

Your decoder should never crash. Of course, it might return an error, but the error should be meaningful and not result in a memory corruption.

๐Ÿ’ซ Congratulations, you now have a robust Zstandard decoder!