The decompressor: part 2

Your decompressor is already able to handle a subset of the Zstandard file format. However, it does not yet take advantage of the most efficient compression techniques used by the standard.

Zstandard utilizes two efficient compressors: Huffman codes and FSE.

Huffman codes are used to encode byte-oriented sequences with \(2^8\) possible output symbols. These codes are employed in the literals section, which consists of portions of data to be copied to the output stream during decompression.
FSE codes are used to encode data with a smaller number of output symbols. They are used for instructions that guide the overall decoding process, known as sequences, as well as coefficients used to initialize the Huffman decoder.

💡 Ensure you understand fully the Huffman and FSE examples, and that you obtain the same result while decoding the bit strings by hand.

Huffman code

The Huffman code is a prefix code. Bits are read one-by-one. Either the current sequence of bits designates a symbol to add to the output (for example a byte), or it is incomplete and more bits need to be read. Most frequent symbols get assigned shorter bit sequences, while less frequent symbols get assigned longer bit sequences. Once a symbol has been decoded, the decoder restarts from scratch and handles the next input bits.

Of course, no bit sequence for a symbol may correspond to the prefix of another bit sequence.

Practical Huffman code example

For example, if we only had 3 symbols A, B and C in our ouput, and B is more frequent than A and C, Huffman coding could use a dictionary such as:

Input bits	Output symbol
00	`A`
01	`C`
1	`B`

The input bit sequence 10010111 would get parsed as 1_00_1_01_1_1 and decoded as BABCBB. 8 input bits decode as 6 output symbols, taking approximately 1.33 bit by symbol.

Note that while B is represented as a single 1, all other bit sequences start with 0. No ambiguity can arise while adding bits as needed. The table can be represented as a prefix tree:

Huffman tree representation

In Zstandard, longer prefixes start with zeroes and shorter prefixes with ones. Also, for the same prefix width, symbols are ordered, which is why A uses 00 because it comes before C at 01.

FSE code

While the Huffman decoders starts from scratch once a symbol has been emitted, a FSE (finite-state entropy) decoder keeps an internal state. Upon entering a state, the decoder emits an output symbol, then reads a number of input bits to switch to another state, and so on. States are consecutively numbered starting from 0. Each state contains three pieces of information:

the output symbol to emit
the number of bits to read to determine the next state
the base line (a number) to add to the number read from the new bits to get the new state number

A FSE code has an accuracy log (AL) which represents the base 2 logarithm of the number of states. For example, a FSE decoder with 32 states will have an AL of 5, because \(2^5\) is 32. The AL is also the number of bits to read to determine the initial state.

Practical FSE example

Let us assume that we have a FSE table with 4 states. The accuracy log (AL) is 2.

State	Output symbol	Base line (BL)	Number of bits to read
0	`A`	1	0
1	`D`	2	1
2	`B`	0	1
3	`A`	2	1

Let us decode the 1110000101 bit sequence (11_1_0_0_0_0_1_0_1 for clarity of this example). AL bits are read from the string, giving 11. The initial state is 3 (0b11):

Entering state 3 outputs symbol A. 1 bit (1) is added to the BL 2: new state is (0b1 + 2)
Entering state 3 outputs symbol A. 1 bit (0) is added to the BL 2: new state is 2 (0b0 + 2).
Entering state 2 outputs symbol B. 1 bit (0) is added to the BL 0: new state is 0 (0b0 + 0).
Entering state 0 outputs symbol A. 0 bits are read, the new state is the BL 1: new state is 1.
Entering state 1 outputs symbol D. 1 bit (0) is added to the BL 2: new state is 2 (0b0 + 2).
Entering state 2 outputs symbol B. 1 bit (0) is added to the BL 0: new state is 0 (0b0 + 0).
Entering state 0 outputs symbol A. 0 bits are read, the new state is the BL: new state is 1.
Entering state 1 outputs symbol D. 1 bit (1) is added to the BL 2: new state is 3 (0b1 + 2).
Entering state 3 outputs symbol A. 1 bit (0) is added to the BL 2: new state is 2 (0b0 + 2).
Entering state 2 outputs symbol B. 1 bit (1) is added to the BL 0: new state is 1 (0b1 + 0).
Entering state 1 outputs symbol D. No 1 bit is available, stopping

The decoded symbol sequence is AABADBADABD. One may note that:

One state may encode more than one output symbols, since the number of bits to read to get to the following state may be 0.
Contrary to a Huffman code, some symbols need less than one bit in the compressed representation. Indeed, here we have a 10 bits input string decoding into a 11 symbols output, that is approximately 0.9 bit per symbol.
Several states may encode the same symbol (states 0 and 3 both emit A).

NET7212 project

The decompressor: part 2

Huffman code

Practical Huffman code example

FSE code

Practical FSE example