General principles
You are free to organize your project as your wish, but some general principles will be respected (unless you convince us by discussing it on the mailing-list).
💡 Do not hesitate to come back to this page frequently to ensure you respect those design principles.
⚠️ You are not supposed to implement everything described in the current page at the beginning of the project. Various parts will be described in more details in subsequent sections.
Library vs. program
While we ultimately want to produce an executable program that will take arguments from the command line, we also want all capabilities we offer to be available as a library for at least two reasons:
- We want other programs to be able to include our decoder as a dependency.
- We want to be able to test various functions using unit tests.
Your project will contain, in the same crate:1
- a library
- an executable
The executable should be reduced to handling the command line arguments and calling the right library functions. Its job should be kept to a minimum.
Error handling
Every module whose functions return a Result
should have its own Error
type, and possibly its own Result
shortcut. Using the thiserror
crate in the library allows easy embedding and conversion of one error type to another:
// File "mymod.rs"
#[derive(Debug, thiserror::Error)]
pub enum Error {
// Encapsulate an I/O error without adding any more context
// and add a `impl From<std::io::Error> for Error` implementation.
#[error(transparent)]
Io(#[from] std::io::Error)
// Custom error originating from this module
#[error("bad magic found {found:#06x} instead of expected {expected:#06x}")]
BadMagic { expected: u32, found: u32 }
// Encapsulate an error from another module while adding context
#[error("corrupted Huffman table weights")]
CorruptedWeights(#[source] fse::Error)
}
// Define a local `Result` type alias which can be used with one argument,
// the default for the second type being defined as the local `Error` type.
pub type Result<T, E = Error> = std::result::Result<T, E>;
// Example function returning a `Result` without repeating the `Error` type.
// The return type can be written as:
// - `Result<Vec<u8>>` in this module, or `Result<Vec<u8>, Error>`
// - `mymod::Result<Vec<u8>>` or `Result<Vec<u8>, mymod::Error>` from outside
fn read_file(filename: &str) -> Result<Vec<u8>> {
Ok(std::file::read(filename)?)
}
The executable can implement generic error handling by using, for example, anyhow, or the combination of eyre and color-eyre for a more pleasant and colorful error output2.
Consuming the input
We will need to use several parser to handle the compressed content:
- A main forward byte-oriented one, which delivers the bytes in the input in order.
- A forward bit-oriented sub-parser to decode the coefficients of a FSE table.
- A backward bit-oriented sub-parser to decode FSE-encoded or Huffman-encoded data.
The main parser should have the following properties:
- It can preload data on-demand, for example if all input bytes have been consumed, or if the decoder needs to access bytes located later in the input.
- It can lend a reference to the data to a sub-parser without copying it.
- It may discard data that has been either consumed directly or through a sub-parser.
💡 At the beginning of the project, having the data to decode in memory as a slice of bytes might be sufficient to satisfy the requirements of the main parser.
Producing the output
Producing the output is simpler than parsing the input, as all transactions are byte-oriented and going forward. Our output producer, dedicated to a frame, should have the following properties:
- It should emit data as soon as possible but not sooner: decoding a frame should only output its uncompressed data when previous frames have finished outputing theirs.
- It should discard data as soon as possible, but not sooner: later blocks in the same frame may need to copy already-produced data.
Decoding a frame
Within a frame, later blocks might refer to decoding tables or data structures used in previous ones. For example, if two successive blocks use the same Huffman weights to encode the literals in both blocks, the second block will only contain a "use the previous Huffman decoder" marker.
You will need to maintain a shared state between successive blocks of the same frame. This state must contain:3
- The three repeat offsets.
- The latest decoders used to decode the constituents of a sequence (literals lengths, offsets, and match lenghts).
- The latest Huffman decoder used to decode literals.
- Some high-level information about the frame such as the window size.
- The current checksum of the data being decoded in this frame.
Do not worry if you do not yet understand the Rust-specific parts of the project specification at first (e.g., crates, Result
, thiserror
, etc.), you will learn about all of them later during the course.
If your program depends on the color-eyre
crate, you may omit the eyre
dependency if you wish, as color-eyre
depends on eyre
and reexports it. You can use color_eyre::eyre;
in your code then use eyre::Result
.
Do not worry if some concepts seem obscure at the beginning of the project, you will discover and understand them later while reading the zstd standard.