2023-08-13

Data compression

There is a very high possibility that your browser downloaded this website as compressed data. I can feel sure saying this because the web standard defines gzip as a standard compressor for web content. So what it achieved by compressing this webpage? As with any data compression algorithm, it reduces the data’s size with no or very little data loss. Data compression is a bigger deal than you might imagine, some works conclude that compression is equal to intelligence. There are contests (Hutter’s prize) that ask only one question “Can you compress this better?”. But how do data compressors work, what algorithms are used, what are the mathematics behind it and why people who believe we are going to achieve infinite compression are just as wrong as people who believe free energy exists?

Claude Shannon and the definition of a bit

In the 1940s while working at Bell Labs, Claude Shannon wrote a paper in which he devised a tool for measuring information. Up until this point, we didn’t know what information was or how to measure it, kind of like how we talk about consciousness today. Building on previous work he defined information as the measurement of probability and called it Information Entropy. The name itself is a little pompous but the theory behind is instrumental to understanding the limits of information transfer and data compression. Entropy is measured in bits, a term many people are accustomed to as we talk about the speed of our internet connection for example. To understand the idea we are going to follow three examples. In the first example suppose we’re flipping a coin that has a 50% chance to land on heads and an equal chance to land on tails. How would you encode which outcome has happened? Well, you could use just a single bit and store a zero when it was heads and store a one when it landed tails. To store that information we need just a single bit, so the entropy for that scenario is equal to 1 bit. Now let’s suppose that we are throwing a four-sided die with equal probabilities for each side. To store its result we would need 2 bits, where we could store each consequent digit (00, 01, 10, 11 -> 0, 1, 2, 3). But now consider something different, we randomly pick a year and want to know if it’s a leap year. Because a leap year happens once every four years the probabilities are 25% for a leap year and 75% for it not being a leap year. You might imagine that we need to store an entire bit to represent it, just as in the coin flip case (0 is a leap year, 1 is not). But the surprising fact is that we would need to store only around 0.811 bits on average. Right about now you’re either intrigued and want to know more or you think it’s just some stupid theory that has no real-world use case. How can you even store a non-integer count of bits in the computer? All I’ll say right now is that you can, and I’ll provide the evidence very soon. Okay, but you want to know where did I get that number from, and here’s how to calculate it

$H (X) = \sum p_{x} \cdot l o g_{2} (1 / p_{x})$

I’m not going to explain how and why it works, if you want to see the derivation of this, I recommend Khan Academy. Remember that this is the formula for the number of bits required on average while coding any alphabet. To know how much information is in a single event (symbol), like a non-leap year appearing with 75% probability we would compute that using $I (e) = l o g_{2} (1 / p_{e})$ . So for the non-leap year example, we would get $l o g_{2} (1 / 0.75) \approx 0.42$ bits - remember, this is the amount of information in this event, but on average rare events are going to appear rarely and thus contribute very little to the overall amount of information. Let’s go over together and calculate the average amount of information needed to code each example

First example

This is our alphabet is $X = {h e a d s, t a i l s}$ and its probabilities are ${0.5, 0.5}$ , from which we can calculate

$\begin{aligned} H (X) & = \sum p_{x} \cdot l o g_{2} (1 / p_{x}) \\ H (X) & = p_{x_{1}} \cdot l o g_{2} (1 / p_{x_{1}}) + p_{x_{2}} \cdot l o g_{2} (1 / p_{x_{2}}) \\ H (X) & = 0.5 \cdot l o g_{2} (1 / 0.5) + 0.5 \cdot l o g_{2} (1 / 0.5) \\ H (X) & = 0.5 \cdot 1 + 0.5 \cdot 1 \\ H (X) & = 1 \end{aligned}$

since the second example is very similar, I’m going to skip it for brevity

Third example

This is our alphabet is $X = {l e a p, r e g u l a r}$ and its probabilities are ${0.25, 0.75}$ , from which we can calculate

$\begin{aligned} H (X) & = \sum p_{x} \cdot l o g_{2} (1 / p_{x}) \\ H (X) & = p_{x_{1}} \cdot l o g_{2} (1 / p_{x_{1}}) + p_{x_{2}} \cdot l o g_{2} (1 / p_{x_{2}}) \\ H (X) & = 0.25 \cdot l o g_{2} (1 / 0.25) + 0.75 \cdot l o g_{2} (1 / 0.75) \\ H (X) & = 0.25 \cdot 2 + 0.75 \cdot 0.415 \\ H (X) & = 0.5 + 0.311 \\ H (X) & = 0.811 \end{aligned}$

Randomness and order

From the third example, you can start to see how you might be able to take some data and reduce its size. You might also start to notice that if we have a two-symbol alphabet where each symbol is equally likely - just like in the first example - each symbol is going to require 1 bit to represent. That means if we have a stream of bits where each one is equally likely (purely random) we can’t compress it. There is some subtlety behind that statement, but I want to prove the next thing. We can’t compress random data and there is no universal compressor. I’m going to focus on the second one, what does it mean? It means that it’s impossible for any compressor to take any data stream and always compress it to something smaller. The proof for that is cool, short and it’s called the counting proof.

Define $n$ as an integer, there are $2^{n}$ unique messages $M$ of length $| M | = n$ . A compressed message is always uniquely decodable - no encoded message $C$ decodes to $M_{1}$ and $M_{2}$ at the same time. Since we assume that the universal compressor always reduces the message size, the compressed messages have to be mapped to the range of size $2^{n} - 1$ . But here is where we encounter a contradiction, we can’t uniquely map a bigger range onto a smaller range. Ergo, a universal compressor can not exist. That also means that we can’t achieve infinite compression, since we always need to map the compressed message back to the original. Some people believe that this is not the case and we will be able to store the entire internet on a floppy disk, but some people also believe that perpetual motion is a thing.

So how is it that we can compress anything? The messages that we can compress are redundant, they are ordered and we can exploit this to reduce their size. Random messages must stay the same size (and or get bigger) to get the ability to compress ordered messages. Order is a tiny fraction in the sea of randomness. If you want to read more about mathematical theory at the basis of compression or anything connect to compression in general I recommend Matt Mahoney’s book Data Compression Explained.

How are compression algorithms structured?

Everybody and their grandma heard about zip which is probably the most popular compression in use right now. It’s quite old, as the algorithm that most often powers it was developed in the 1990s. The algorithm itself is called DEFLATE and quoting Wikipedia:

In computing, Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding.

So what are those two things LZ77 and Huffman coding? This is the most popular structure of everyday compression algorithm where you’re combining data transforms (LZ77) and entropy coding (Huffman Coding).

Entropy coding

The most popular three are:

Huffman Coding
Arithmetic Coding
Asymmetric Number Systems (ANS)

Those algorithms allow us to create a straight mapping of symbols where for each symbol in $M$ we are going to map it using one of those approaches and $M^{'}$ is just a set of all the mapped symbols. If $E C$ is the entropy coding function, then

$M^{'} = {E C (m_{0}), E C (m_{1}) \dots E C (m_{| M | - 1})}$

This mapping allows us to encode symbols that appear very often in the message to a smaller bit size. As we learned using information theory if we have a stream where a single outcome (called a symbol in the compression world) has some very high probability the information for that symbol is low. The other way to think about this is that your level of surprise is low when you see a symbol that is very probable to appear. Entropy coders allow us to assign a lower bit count to probable symbols thus reducing the message size. In other words, entropy coders assign each symbol a new bit size equal (or approximate) to the entropy of that symbol.

Huffman Coding

The simplest to understand and implement entropy coder, additionally low complexity translates to one of the highest if not the highest compression/decompression performance. We are going to build a binary tree where each leaf is a symbol and the bit sequence that corresponds to that symbol is equal to the path taken to reach this symbol.

alt

In the above image, you can see the Huffman tree for a message where its alphabet is equal to ${A, B, C}$ with the following bit sequences assigned to each one ${0, 10, 11}$ . It’s important to notice that the only reason that the Huffman trees work at all is the most important thing about the bit sequences they produce. No bit sequence is a prefix for any other bit sequence. There are shared prefixes, like the $1$ in the example, but it applies to the whole bit symbol. Even if you create a tree with 256 symbols, this is always going to hold.

Building such a tree is extremely simple, storing it is relatively simple, and variants like Canonical Huffman Coding make working with these trees heavenly. Okay, so Huffman coding can’t do any wrong, it’s simple, proven, and high-performance. Where’s the catch? You might’ve spotted it already, but ask yourself this question, how would you encode a symbol to take 0.811 bits like in the above information entropy example? Well, you can’t. Huffman coding is always going to take ceil(H(symbol)) bits per symbol. Even if your symbol is so frequent that it’s going to take 0.1 bits to encode, Huffman can’t do it. Is this a problem? Not really, for everyday things, we care more about performance than anything else so Huffman coding is used in compressors like DEFLATE (used in almost everything - PNG, gzip, zip, and anywhere where zlib is used), zstd, and brotli. There are some coders which for decode speed omit adding a Huffman coding stage like lz4 and snappy.

Arithmetic Coding

The first coder that allows us to code bits with decimal precision, so for a symbol that appears 7 times in 1000 symbols we would need $l o g_{2} (1 / \frac{7}{1000}) \approx 7.16$ bits. Huffman coding would require us to assign an 8-bit sequence to that symbol, essentially having a 1-bit overhead for every time we would want to code this symbol. This overhead in Huffman is something that can be reduced if the alphabet is small with techniques like blocking but it still only allows to approximate the ideal coding. With Arithmetic Coding we can code each symbol exactly with this amount of required bits - yes, through math magic, we can code fractional bits.

I’ll present the simple example and touch briefly on the more expanded concept, that being coding bits and coding multi-bit symbols. The way we are going to code data is by narrowing down a range and picking a number in it so that when we decode we can just tell in which bit bin that number lays and decode bits. Each bit bin’s width is equal to the probability of seeing that bit. To instantly get it, let’s pick an example, our symbols and their probabilities are ${0, 1} \to {0.6, 0.4}$ and the bit stream we want to encode is $11010$ .

alt

You can just follow with your finger what is happening, going from left to right we start with a range $0.0 \to 1.0$ and a midpoint of $0.6$ . First, we code a one, so we take the up “gate”, we close the range from the bottom, right now it’s $0.6$ and we do it again with a newly calculated midpoint. For the third symbol, we take the bottom “gate” and close the range this time from the top leaving the bottom range intact, calculate the new midpoint, and so on. We pick a number in the resulting range ${0.8976 \to 0.92064}$ that has the fewest possible bits and emit it - suppose we picked $0.91$ . You can also see how having that number we can decode what was emitted by playing the game “Into which gate does this number fit”. To play this game we also have to provide the number of coded bits so the decoder knows when to stop.

Right now I explained this in floating point space, but every single time when you read about some compressing strategy where there are floats involved you can assume that there’ll be a step where you convert to integers because it’s faster, less error-prone, higher resolution and overall you can work with it. The same is applicable here, in the real world we map this range of ${0, 1}$ to some nice integer numbers that fit our registers, if it’s 32-bit we have ${1 \to 2^{32} - 1}$ . Now that we talk about integers, I tell you that the number of bits that will be required to code this number is $k = l o g_{2} (\frac{1}{H - L}) + 2$ bits. Plus two because we need to have a number that is unambiguous while decoding no matter what trailing bits we put in it. So for any block, we only need to encode two extra bits which is basically free for any sufficient block size. Coding more symbols just requires us to add more ranges, but the basic rules still follow, pick range, shrink range, calculate new range limits, and repeat.

So why don’t we use this? BECAUSE IS SUPER SLOW! On generic data, this coder is around 5 times slower than Huffman coding. It does produce a better compression ratio, but the trade-off most likely is not worth it. Algorithms like the original bzip used it, but after that, the history of it is kinda muffled. Want to know why? Because it got patented so hard no one wanted any trouble and didn’t use it. All these patents expired now, but we still don’t use it because…

ANS (Asymmetric numeral systems)

ANS can achieve the same compression ratio as arithmetic coding while being at least twice as fast or very close to Huffman performance if we sacrifice just a little bit of size coding performance. To understand asymmetric numeral systems let’s first show an example of a symmetric numeral system. Consider an alphabet $X = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}$ and for each symbol, its probability is equal to $p (s) = \frac{1}{10}$ (each one is equally likely). We could encode a run of these symbols by combining every single one into a single number $z$ using this equation

$z^{'} = 10 z + s$

So to encode a message $S = {2, 0, 8, 0, 6}$ , we would do the following:

$\begin{aligned} z^{'} & = 10 \cdot 0 + 2 = 2 \\ z^{'} & = 10 \cdot 2 + 0 = 20 \\ z^{'} & = 10 \cdot 20 + 8 = 208 \\ z^{'} & = 10 \cdot 208 + 0 = 2080 \\ z^{'} & = 10 \cdot 2080 + 6 = 20806 \end{aligned}$

Now $20806$ is our encoded message and it takes up $15$ bits of information. And if you think about it, in this scheme, every time we encode a new symbol our new $z$ is ten times bigger than before, so on average it grows by $l o g_{2} (1 / 0.1) = 3.32$ bits. This encoding scheme is optimal if our symbols have a uniform or symmetric probability, but what if they do not? Assume that in the same alphabet, the symbol $7$ has probability of $0.9$ , so it’s no longer a uniform probability distribution, it’s asymmetric. So this trick with multiplying won’t work? Let’s think about this by more deeply understanding what we were doing before.

If I write the following string $0123456789$ it’s just the alphabet combined, but we can see an interesting thing in this. The last digit of base ten is the symbol we encoded. I’m going to call this merged string a state description because it tells us how to interpret the incoming data state while decoding. And now consider a smaller description $012$ , the alphabet has now three entries ( $X = {0, 1, 2}$ ) and each symbol is equally likely. Now it still holds that were we to get the last digit of $z$ in ternary base it will directly represent the encoded symbol. Everything is clear? Awesome, so what does this description ( $001$ ) tell us? Well, now the alphabet has only two entries but the probability is not symmetric, $0$ is twice as likely to appear as $1$ . For simplicity, let’s change the description to use letters instead of numbers, so now we have $A A B$ . If we take the last digit in the ternary base we have some choices now if that digit $d$ is $d = 2$ it’s $B$ , but if it’s $b = 0$ or $b = 1$ it’s $A$ . Cool, let’s encode a message $S = {B, A}$

$\begin{array}{r} z^{'} = 3 \cdot 0 + 2 = 2 \\ z^{'} = 3 \cdot 2 + 0 = 20 \land z^{'} = 3 \cdot 2 + 1 = 21 \end{array}$

Which to choose, there are two ways to encode the same symbol. Suppose we were to choose, let’s do some information entropy math, information in $A$ is $0.58$ bits and information in $B$ is $1.58$ bits. Exactly one more bit in $A$ than in $B$ . Remember how the state grows? It grows by $l o g_{2} (1 / 0.33 (3)) = 1.58$ bits. Every time we code a $B$ we do so perfectly, but every time we code an $A$ we are wasting one bit of information, no matter if we chose $0$ or $1$ . What can we do? Now I’d like you to focus. We are going to take (borrow) a bit of information from $z$ - or the last encoded symbol in this case $B$ . It might have clicked for you, or it might not have, so I’ll try to help it click. To take a bit (shift it out) we need to divide by two, so that’s what we are going to do with the following equation $z^{'} = 3 * (z / 2) + m o d (z, 2)$ .

$\begin{aligned} z^{'} & = 3 \cdot 0 + 2 = 2 \\ z^{'} & = 3 \cdot (2 / 2) + m o d (2, 2) = 3 \end{aligned}$

After that modulus, it’s either going to be zero or one, so either way, it’s going to code the symbol $A$ . And it all works because we took (borrowed, reduced) from $z$ a bit of information that we put right back through but this time the state did not grow times $3$ , but times $\frac{3}{2}$ . The state grew inversely proportional to the probability of the symbol.

Now to discover the equation that will allow us to encode any symbol from any alphabet consider the following description $A A B C C$ . The last digit of $z$ in the base 5 decodes as

$\begin{aligned} 0 \land 1 & \to A \\ 2 & \to B \\ 3 \land 4 & \to C \end{aligned}$

And this is our encoding table

$\begin{aligned} A & \to 5 \cdot (z / 2) + m o d (z, 2) \\ B & \to 5 \cdot z + 2 \\ C & \to g (z) \end{aligned}$

Our quest now is to find that function $g (z)$ . The first observation we make is that it’s similar to encoding $A$ because there are two options to choose from. Because it’s either $3$ or $4$ which we could map to either $0$ or $1$ , because of that we would ideally want to take out a single bit from $z$ . Since this mapping is only a linear offset, we could just use encoding for $A$ and later add $3$ . The discovered $g (z) = 5 \cdot (z / 2) + m o d (z, 2) + 3$ . But have a look at this

$\begin{aligned} A & \to 5 \cdot (z / 2) + m o d (z, 2) \\ B & \to 5 \cdot z + 2 \\ C & \to 5 \cdot (z / 2) + m o d (z, 2) + 3 \end{aligned}$

There is a pattern here! We are always adding $b_{s}$ defined as the index of the first slot encoding a symbol in the description ( $A$ plus zero, $B$ plus two, $C$ plus three). And also we only divide the $z$ if we are later going to be adding the modulo of the same base. So we can make this generalization that in the case of $B$ we are just dividing by one and taking the modulo by one. This way the general function for encoding $z^{'}$ is $C (z) = b \cdot (z / s) + m o d (z, s) + b_{s}$ . Where: $b$ is the length of the description, $s$ is the occurrence count of a given symbol in the description and $b_{s}$ is this index we mentioned before.

To decode, peek the last digit of the base equal to the length of the description. Figure out into which symbol’s range this digit falls. This last digit is going to get popped from the $z$ , so the information inside it is going to be lost. But remember that this last digit also contains some information from the next symbol. To preserve it, we need to determine what information was borrowed. The borrowed information is $m o d (z, s)$ , in the encoding step it gets added to $b_{s}$ . When we peek the last digit it’s equal to $m o d (z, s) + b_{s}$ , but peeking is done by taking a modulo of the $z$ with the $b$ . From this follows that $m o d (z, b) = m o d (z, s) + b_{s}$ which we can simply rearrange to $m o d (z, s) = m o d (z, b) - b_{s}$ The second step is to determine how the $z$ is going to evolve, it’s as simple as noticing that it grows as $\frac{b}{s}$ while encoding, so while decoding it’s going to be inversely shrinking as $\frac{s}{b}$ . Assemble those two things and we get the decode step $D (z) = \frac{z s}{b} + m o d (z, b) - b_{s}$ .

What we have just derived is a subset of ANS coders family called ranged ANS (rANS), for an example implementation check out Fabian’s Giesen work. There are of course more types of ANS coders like tabled ANS (tANS or FSE) where we can shift and choose the description to achieve an even higher compression ratio. Also, the tabled nature of this algorithm allows it to run branchless while decoding. For more about tANS read Yann Collet’s work or Charles Bloom’s introduction to ANS. What’s interesting is that ANS contains within itself Huffman coding, so it’s a more generic approach to entropy coding (read here about it). If you’re really into it and want to understand everything try to give the white paper a crack.

Data transforms

While entropy coding allows us to represent more frequent symbols using fewer bits, data transforms aim to deduplicate symbols and or put the data into a form that allows better compression. Let’s imagine that we want to send the message $A B C A B C A B C A B C$ , its alphabet contains 3 symbols where each one is equally likely. But we can immediately see that a pattern exists and we can simplify this message by saying just

Repeat $A B C$ four times.

If encoding this repetition is going to take less space than encoding the symbols directly we just reduced the size of the message more than what Information Entropy theory says we would expect. Let’s look at some examples, in order of their computational complexity and their compression performance:

Run length encoding (RLE)
Lempel-Ziv (LZ77 or variations)
Burrows-Wheeler transform (BWT)

RLE

It’s the simplest possible data transform where each entry has two parts, the symbol and how many times to repeat it. For example, the message $A A A B B B B B B B C C C C$ could be encoded as ${3 A} {7 B} {4 C}$ . You can program something really brain dead like each entry is two bytes or try to be more clever by not encoding the length if it’s one. One way to achieve that is to emit let’s say a byte where each bit represents if the length follows after the symbol. Therefore a byte of 0b00101000 would indicate that in the next 8 entries, two of them will have non-one lengths encoded as well. Sometimes you need just a touch of compression and you can’t be bothered to mess with full-blown compressors. RLE is a perfect algorithm for that, its low complexity might be useful for example while trying to reduce a size of a binary.

LZ77

Just like ANS is a generalization of Huffman Coding, one might say that LZ77 is a generalization of RLE in correct conditions. In LZ77 each entry codes an offset in the currently decoded bytes and the number of bytes to copy starting from that offset. Entries for LZ compressors in data compression lingo are called matches because some previous data matches the one we are just about to emit.

You might notice my preference to talk about the side of the decompression, that’s because it’s easier. If you have a feeling that the decode step of this is way faster than the encode step you are correct. The decompressor just blindly copies bytes as fast as it can, while the compressor has to make some choices like which match to use. The amount of memory needed to compress is in most cases way higher than the memory needed to decompress. The thing that also behaves in this way is a complier. An interesting way to look at compression is to imagine that the compression step is compilation while decompression is execution which results in the generation of uncompressed data. Where the decompression algorithm is the virtual machine or CPU and the compressed stream are ops to execute.

Going back to the topic at hand, LZ is so simple it lends itself to many, and by many I mean many different versions. Where each version adds something of its own to make it somehow fit a given need better. Since data compression is a field where diminishing returns are just part of the game, LZ produces a very good compression while offering a simple decompression. One of the most widely known compressors lz4 and his less known brother snappy use only LZ transform to achieve their compression, they skip the entropy coding stage. And while you won’t get a big compression ratio out of them, they have very fast decompression. For example, at the time of writing lz4 boasts a staggering 5GB/s per core decode speed. But many different compressors pretty much always use some kind of LZ transform, for example, gzip uses LZ together with Huffman Coding.

BWT

This transform does not reduce the size of the message in any way, instead, it will sort the message in a reversible way. For example, the message $A B C A B C$ will be transformed to $C C A A B B$ . To get this result you are going to create a matrix with shifts of the message and later sort rows in this matrix lexicographically, the last column in the sorted matrix is your result.

 unsorted            sorted
| ABCABC$ |        | $ABCABC |
| BCABC$A |        | ABC$ABC |
| CABC$AB |        | ABCABC$ |
| ABC$ABC |   =>   | BC$ABCA |
| BC$ABCA |        | BCABC$A |
| C$ABCAB |        | C$ABCAB |
| $ABCABC |        | CABC$AB |

These $ characters are there to signal an EOF because without it you can see that we can not say which row is the original message. To be able to decode we also need to keep track of at which row we have the original message in the sorted matrix, in this case, the answer is the third row. But after that, we can take this sorted string and compress it using RLE or LZ77. To see how the transform produces a message where very often the same characters will end up together play around with this site. Copy some paragraph from this website and paste it there, you’ll be surprised how well this transform groups together characters. This is because of the lexicographic sort, if inside a message there are many occurrences of words like “the”, every single one of them will be grouped after shifting. Because sorting is not the fastest known algorithm, usage of BWT is very low, you only find them in things like bzip or programs that win the Hutter’s Prize. It does not mean that people do not recognize how great this transform is, projects like bzip3 claim to use BWT and achieve impressive results. People like Charles Bloom even wrote about how to speed up the inverse transform of BWT, so if you’re interested in coding a semi-serious implementation have a look.

How to use these parts

Right now you might think that you can easily mix and match different parts, like: first apply a BWT and later do a LZ77 and after which perform a entropy coding stage using ANS. And you’re right, but don’t be fooled into thinking that designing data compression algorithms is easy. DEFLATAE for example uses values above 256 in the Huffman decoded stream to indicate that there is a LZ77 match in the pipeline. lz4 will for each entry emit a pair of literals array and afterward an LZ match. zstd will for example compress a Canonical Huffman Tree using weights with FSE and use that Huffman Tree to decode four streams and later stitch them back together afterward it will decode LZ matches encoded using FSE and perform an inverse transform on the stitched data. As you can see, the amount of complexity and clever tricks grows and grows, and as with any architecture, the decisions are not arbitrary. For field-tested compressors, these decisions were carefully selected to align with the goals of the given compressors. If you want to understand this more deeply try to implement an lz4 decoder using its spec, it’s really easy and great fun.