The decompressor: part 2
Your decompressor is already able to handle a subset of the Zstandard file format. However, it does not yet take advantage of the most efficient compression techniques used by the standard.
Zstandard utilizes two efficient compressors: Huffman codes and FSE.
- Huffman codes are used to encode byte-oriented sequences with \(2^8\) possible output symbols. These codes are employed in the literals section, which consists of portions of data to be copied to the output stream during decompression.
- FSE codes are used to encode data with a smaller number of output symbols. They are used for instructions that guide the overall decoding process, known as sequences, as well as coefficients used to initialize the Huffman decoder.
💡 Ensure you understand fully the Huffman and FSE examples, and that you obtain the same result while decoding the bit strings by hand.
Huffman code
The Huffman code is a prefix code. Bits are read one-by-one. Either the current sequence of bits designates a symbol to add to the output (for example a byte), or it is incomplete and more bits need to be read. Most frequent symbols get assigned shorter bit sequences, while less frequent symbols get assigned longer bit sequences. Once a symbol has been decoded, the decoder restarts from scratch and handles the next input bits.
Of course, no bit sequence for a symbol may correspond to the prefix of another bit sequence.
Practical Huffman code example
For example, if we only had 3 symbols A, B and C in our ouput, and B is more frequent than A and C, Huffman coding could use a dictionary such as:
| Input bits | Output symbol |
|---|---|
| 00 | A |
| 01 | C |
| 1 | B |
The input bit sequence 10010111 would get parsed as 1_00_1_01_1_1 and decoded as BABCBB. 8 input bits decode as 6 output symbols, taking approximately 1.33 bit by symbol.
Note that while B is represented as a single 1, all other bit sequences start with 0. No ambiguity can arise while adding bits as needed. The table can be represented as a prefix tree:
In Zstandard, longer prefixes start with zeroes and shorter prefixes with ones. Also, for the same prefix width, symbols are ordered, which is why A uses 00 because it comes before C at 01.
FSE code
While the Huffman decoders starts from scratch once a symbol has been emitted, a FSE (finite-state entropy) decoder keeps an internal state. Upon entering a state, the decoder emits an output symbol, then reads a number of input bits to switch to another state, and so on. States are consecutively numbered starting from 0. Each state contains three pieces of information:
- the output symbol to emit
- the number of bits to read to determine the next state
- the base line (a number) to add to the number read from the new bits to get the new state number
A FSE code has an accuracy log (AL) which represents the base 2 logarithm of the number of states. For example, a FSE decoder with 32 states will have an AL of 5, because \(2^5\) is 32. The AL is also the number of bits to read to determine the initial state.
Practical FSE example
Let us assume that we have a FSE table with 4 states. The accuracy log (AL) is 2.
| State | Output symbol | Base line (BL) | Number of bits to read |
|---|---|---|---|
| 0 | A | 1 | 0 |
| 1 | D | 2 | 1 |
| 2 | B | 0 | 1 |
| 3 | A | 2 | 1 |
Let us decode the 1110000101 bit sequence (11_1_0_0_0_0_1_0_1 for clarity of this example). AL bits are read from the string, giving 11. The initial state is 3 (0b11):
- Entering state 3 outputs symbol
A. 1 bit (1) is added to the BL2: new state is (0b1 + 2) - Entering state 3 outputs symbol
A. 1 bit (0) is added to the BL2: new state is 2 (0b0 + 2). - Entering state 2 outputs symbol
B. 1 bit (0) is added to the BL0: new state is 0 (0b0 + 0). - Entering state 0 outputs symbol
A. 0 bits are read, the new state is the BL1: new state is 1. - Entering state 1 outputs symbol
D. 1 bit (0) is added to the BL2: new state is 2 (0b0 + 2). - Entering state 2 outputs symbol
B. 1 bit (0) is added to the BL0: new state is 0 (0b0 + 0). - Entering state 0 outputs symbol
A. 0 bits are read, the new state is the BL: new state is 1. - Entering state 1 outputs symbol
D. 1 bit (1) is added to the BL2: new state is 3 (0b1 + 2). - Entering state 3 outputs symbol
A. 1 bit (0) is added to the BL2: new state is 2 (0b0 + 2). - Entering state 2 outputs symbol
B. 1 bit (1) is added to the BL0: new state is 1 (0b1 + 0). - Entering state 1 outputs symbol
D. No 1 bit is available, stopping
The decoded symbol sequence is AABADBADABD. One may note that:
- One state may encode more than one output symbols, since the number of bits to read to get to the following state may be 0.
- Contrary to a Huffman code, some symbols need less than one bit in the compressed representation. Indeed, here we have a 10 bits input string decoding into a 11 symbols output, that is approximately 0.9 bit per symbol.
- Several states may encode the same symbol (states 0 and 3 both
emit
A).