The decompressor: part 2
Your decompressor is already able to handle a subset of the Zstandard file format. However, it does not yet take advantage of the most efficient compression techniques used by the standard.
Zstandard utilizes two efficient compressors: Huffman codes and FSE.
- Huffman codes are used to encode byte-oriented sequences with \(2^8\) possible output symbols. These codes are employed in the literals section, which consists of portions of data to be copied to the output stream during decompression.
- FSE codes are used to encode data with a smaller number of output symbols. They are used for instructions that guide the overall decoding process, known as sequences, as well as coefficients used to initialize the Huffman decoder.
💡 Ensure you understand fully the Huffman and FSE examples, and that you obtain the same result while decoding the bit strings by hand.
Huffman code
The Huffman code is a prefix code. Bits are read one-by-one. Either the current sequence of bits designates a symbol to add to the output (for example a byte), or it is incomplete and more bits need to be read. Most frequent symbols get assigned shorter bit sequences, while less frequent symbols get assigned longer bit sequences. Once a symbol has been decoded, the decoder restarts from scratch and handles the next input bits.
Of course, no bit sequence for a symbol may correspond to the prefix of another bit sequence.
Practical Huffman code example
For example, if we only had 3 symbols A
, B
and C
in our ouput, and B
is more frequent than A
and C
, Huffman coding could use a dictionary such as:
Input bits | Output symbol |
---|---|
00 | A |
01 | C |
1 | B |
The input bit sequence 10010111
would get parsed as 1_00_1_01_1_1
and decoded as BABCBB
. 8 input bits decode as 6 output symbols, taking approximately 1.33 bit by symbol.
Note that while B
is represented as a single 1
, all other bit sequences start with 0
. No ambiguity can arise while adding bits as needed. The table can be represented as a prefix tree:
In Zstandard, longer prefixes start with zeroes and shorter prefixes with ones. Also, for the same prefix width, symbols are ordered, which is why A
uses 00
because it comes before C
at 01
.
FSE code
While the Huffman decoders starts from scratch once a symbol has been emitted, a FSE (finite-state entropy) decoder keeps an internal state. Upon entering a state, the decoder emits an output symbol, then reads a number of input bits to switch to another state, and so on. States are consecutively numbered starting from 0. Each state contains three pieces of information:
- the output symbol to emit
- the number of bits to read to determine the next state
- the base line (a number) to add to the number read from the new bits to get the new state number
A FSE code has an accuracy log (AL) which represents the base 2 logarithm of the number of states. For example, a FSE decoder with 32 states will have an AL of 5, because \(2^5\) is 32. The AL is also the number of bits to read to determine the initial state.
Practical FSE example
Let us assume that we have a FSE table with 4 states. The accuracy log (AL) is 2.
State | Output symbol | Base line (BL) | Number of bits to read |
---|---|---|---|
0 | A | 1 | 0 |
1 | D | 2 | 1 |
2 | B | 0 | 1 |
3 | A | 2 | 1 |
Let us decode the 1110000101
bit sequence (11_1_0_0_0_0_1_0_1
for clarity of this example). AL bits are read from the string, giving 11
. The initial state is 3 (0b11
):
- Entering state 3 outputs symbol
A
. 1 bit (1
) is added to the BL2
: new state is (0b1 + 2
) - Entering state 3 outputs symbol
A
. 1 bit (0
) is added to the BL2
: new state is 2 (0b0 + 2
). - Entering state 2 outputs symbol
B
. 1 bit (0
) is added to the BL0
: new state is 0 (0b0 + 0
). - Entering state 0 outputs symbol
A
. 0 bits are read, the new state is the BL1
: new state is 1. - Entering state 1 outputs symbol
D
. 1 bit (0
) is added to the BL2
: new state is 2 (0b0 + 2
). - Entering state 2 outputs symbol
B
. 1 bit (0
) is added to the BL0
: new state is 0 (0b0 + 0
). - Entering state 0 outputs symbol
A
. 0 bits are read, the new state is the BL: new state is 1. - Entering state 1 outputs symbol
D
. 1 bit (1
) is added to the BL2
: new state is 3 (0b1 + 2
). - Entering state 3 outputs symbol
A
. 1 bit (0
) is added to the BL2
: new state is 2 (0b0 + 2
). - Entering state 2 outputs symbol
B
. 1 bit (1
) is added to the BL0
: new state is 1 (0b1 + 0
). - Entering state 1 outputs symbol
D
. No 1 bit is available, stopping
The decoded symbol sequence is AABADBADABD
. One may note that:
- One state may encode more than one output symbols, since the number of bits to read to get to the following state may be 0.
- Contrary to a Huffman code, some symbols need less than one bit in the compressed representation. Indeed, here we have a 10 bits input string decoding into a 11 symbols output, that is approximately 0.9 bit per symbol.
- Several states may encode the same symbol (states 0 and 3 both
emit
A
).