# The decompressor: part 2

Your decompressor is already able to handle a subset of the Zstandard file format. However, it does not yet take advantage of the most efficient compression techniques used by the standard.

Zstandard utilizes two efficient compressors: Huffman codes and FSE.

- Huffman codes are used to encode byte-oriented sequences with \(2^8\) possible output symbols. These codes are employed in the literals section, which consists of portions of data to be copied to the output stream during decompression.
- FSE codes are used to encode data with a smaller number of output symbols. They are used for instructions that guide the overall decoding process, known as sequences, as well as coefficients used to initialize the Huffman decoder.

💡 Ensure you understand fully the Huffman and FSE examples, and that you obtain the same result while decoding the bit strings by hand.

## Huffman code

The Huffman code is a prefix code. Bits are read one-by-one. Either the current sequence of bits designates a symbol to add to the output (for example a byte), or it is incomplete and more bits need to be read. Most frequent symbols get assigned shorter bit sequences, while less frequent symbols get assigned longer bit sequences. Once a symbol has been decoded, the decoder restarts from scratch and handles the next input bits.

Of course, no bit sequence for a symbol may correspond to the prefix of another bit sequence.

### Practical Huffman code example

For example, if we only had 3 symbols `A`

, `B`

and `C`

in our ouput, and `B`

is more frequent than `A`

and `C`

, Huffman coding could use a dictionary such as:

Input bits | Output symbol |
---|---|

00 | `A` |

01 | `C` |

1 | `B` |

The input bit sequence `10010111`

would get parsed as `1_00_1_01_1_1`

and decoded as `BABCBB`

. 8 input bits decode as 6 output symbols, taking approximately 1.33 bit by symbol.

Note that while `B`

is represented as a single `1`

, all other bit sequences start with `0`

. No ambiguity can arise while adding bits as needed. The table can be represented as a prefix tree:

In Zstandard, longer prefixes start with zeroes and shorter prefixes with ones. Also, for the same prefix width, symbols are ordered, which is why `A`

uses `00`

because it comes before `C`

at `01`

.

## FSE code

While the Huffman decoders starts from scratch once a symbol has been emitted, a FSE (finite-state entropy) decoder keeps an internal state. Upon entering a state, the decoder emits an output symbol, then reads a number of input bits to switch to another state, and so on. States are consecutively numbered starting from 0. Each state contains three pieces of information:

- the output symbol to emit
- the number of bits to read to determine the next state
- the base line (a number) to add to the number read from the new bits to get the new state number

A FSE code has an accuracy log (AL) which represents the base 2 logarithm of the number of states. For example, a FSE decoder with 32 states will have an AL of 5, because \(2^5\) is 32. The AL is also the number of bits to read to determine the initial state.

### Practical FSE example

Let us assume that we have a FSE table with 4 states. The accuracy log (AL) is 2.

State | Output symbol | Base line (BL) | Number of bits to read |
---|---|---|---|

0 | `A` | 1 | 0 |

1 | `D` | 2 | 1 |

2 | `B` | 0 | 1 |

3 | `A` | 2 | 1 |

Let us decode the `1110000101`

bit sequence (`11_1_0_0_0_0_1_0_1`

for clarity of this example). AL bits are read from the string, giving `11`

. The initial state is 3 (`0b11`

):

- Entering state 3 outputs symbol
`A`

. 1 bit (`1`

) is added to the BL`2`

: new state is (`0b1 + 2`

) - Entering state 3 outputs symbol
`A`

. 1 bit (`0`

) is added to the BL`2`

: new state is 2 (`0b0 + 2`

). - Entering state 2 outputs symbol
`B`

. 1 bit (`0`

) is added to the BL`0`

: new state is 0 (`0b0 + 0`

). - Entering state 0 outputs symbol
`A`

. 0 bits are read, the new state is the BL`1`

: new state is 1. - Entering state 1 outputs symbol
`D`

. 1 bit (`0`

) is added to the BL`2`

: new state is 2 (`0b0 + 2`

). - Entering state 2 outputs symbol
`B`

. 1 bit (`0`

) is added to the BL`0`

: new state is 0 (`0b0 + 0`

). - Entering state 0 outputs symbol
`A`

. 0 bits are read, the new state is the BL: new state is 1. - Entering state 1 outputs symbol
`D`

. 1 bit (`1`

) is added to the BL`2`

: new state is 3 (`0b1 + 2`

). - Entering state 3 outputs symbol
`A`

. 1 bit (`0`

) is added to the BL`2`

: new state is 2 (`0b0 + 2`

). - Entering state 2 outputs symbol
`B`

. 1 bit (`1`

) is added to the BL`0`

: new state is 1 (`0b1 + 0`

). - Entering state 1 outputs symbol
`D`

. No 1 bit is available, stopping

The decoded symbol sequence is `AABADBADABD`

. One may note that:

- One state may encode more than one output symbols, since the number of bits to read to get to the following state may be 0.
- Contrary to a Huffman code, some symbols need less than one bit in the compressed representation. Indeed, here we have a 10 bits input string decoding into a 11 symbols output, that is approximately 0.9 bit per symbol.
- Several states may encode the same symbol (states 0 and 3 both
emit
`A`

).