Data compression
There is a very high possibility that your browser downloaded this website as compressed data. I can feel sure saying this because the web standard defines gzip as a standard compressor for web content. So what it achieved by compressing this webpage? As with any data compression algorithm, it reduces the data’s size with no or very little data loss. Data compression is a bigger deal than you might imagine, some works conclude that compression is equal to intelligence. There are contests (Hutter’s prize) that ask only one question “Can you compress this better?”. But how do data compressors work, what algorithms are used, what are the mathematics behind it and why people who believe we are going to achieve infinite compression are just as wrong as people who believe free energy exists?
Claude Shannon and the definition of a bit
In the 1940s while working at Bell Labs, Claude Shannon wrote a paper in which he devised a tool for measuring information. Up until this point, we didn’t know what information was or how to measure it, kind of like how we talk about consciousness today. Building on previous work he defined information as the measurement of probability and called it Information Entropy. The name itself is a little pompous but the theory behind is instrumental to understanding the limits of information transfer and data compression. Entropy is measured in bits, a term many people are accustomed to as we talk about the speed of our internet connection for example. To understand the idea we are going to follow three examples. In the first example suppose we’re flipping a coin that has a 50% chance to land on heads and an equal chance to land on tails. How would you encode which outcome has happened? Well, you could use just a single bit and store a zero when it was heads and store a one when it landed tails. To store that information we need just a single bit, so the entropy for that scenario is equal to 1 bit. Now let’s suppose that we are throwing a four-sided die with equal probabilities for each side. To store its result we would need 2 bits, where we could store each consequent digit (00, 01, 10, 11 -> 0, 1, 2, 3). But now consider something different, we randomly pick a year and want to know if it’s a leap year. Because a leap year happens once every four years the probabilities are 25% for a leap year and 75% for it not being a leap year. You might imagine that we need to store an entire bit to represent it, just as in the coin flip case (0 is a leap year, 1 is not). But the surprising fact is that we would need to store only around 0.811 bits on average. Right about now you’re either intrigued and want to know more or you think it’s just some stupid theory that has no real-world use case. How can you even store a non-integer count of bits in the computer? All I’ll say right now is that you can, and I’ll provide the evidence very soon. Okay, but you want to know where did I get that number from, and here’s how to calculate it
I’m not going to explain how and why it works, if you want to see the derivation of this, I recommend Khan Academy.
Remember that this is the formula for the number of bits required on average while coding any alphabet.
To know how much information is in a single event (symbol), like a non-leap year appearing with 75% probability we would compute that using
First example
This is our alphabet is
since the second example is very similar, I’m going to skip it for brevity
Third example
This is our alphabet is
Randomness and order
From the third example, you can start to see how you might be able to take some data and reduce its size. You might also start to notice that if we have a two-symbol alphabet where each symbol is equally likely - just like in the first example - each symbol is going to require 1 bit to represent. That means if we have a stream of bits where each one is equally likely (purely random) we can’t compress it. There is some subtlety behind that statement, but I want to prove the next thing. We can’t compress random data and there is no universal compressor. I’m going to focus on the second one, what does it mean? It means that it’s impossible for any compressor to take any data stream and always compress it to something smaller. The proof for that is cool, short and it’s called the counting proof.
Define
So how is it that we can compress anything? The messages that we can compress are redundant, they are ordered and we can exploit this to reduce their size. Random messages must stay the same size (and or get bigger) to get the ability to compress ordered messages. Order is a tiny fraction in the sea of randomness. If you want to read more about mathematical theory at the basis of compression or anything connect to compression in general I recommend Matt Mahoney’s book Data Compression Explained.
How are compression algorithms structured?
Everybody and their grandma heard about zip which is probably the most popular compression in use right now. It’s quite old, as the algorithm that most often powers it was developed in the 1990s. The algorithm itself is called DEFLATE and quoting Wikipedia:
In computing, Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding.
So what are those two things LZ77 and Huffman coding? This is the most popular structure of everyday compression algorithm where you’re combining data transforms (LZ77) and entropy coding (Huffman Coding).
Entropy coding
The most popular three are:
- Huffman Coding
- Arithmetic Coding
- Asymmetric Number Systems (ANS)
Those algorithms allow us to create a straight mapping of symbols where for each symbol in
This mapping allows us to encode symbols that appear very often in the message to a smaller bit size. As we learned using information theory if we have a stream where a single outcome (called a symbol in the compression world) has some very high probability the information for that symbol is low. The other way to think about this is that your level of surprise is low when you see a symbol that is very probable to appear. Entropy coders allow us to assign a lower bit count to probable symbols thus reducing the message size. In other words, entropy coders assign each symbol a new bit size equal (or approximate) to the entropy of that symbol.
Huffman Coding
The simplest to understand and implement entropy coder, additionally low complexity translates to one of the highest if not the highest compression/decompression performance. We are going to build a binary tree where each leaf is a symbol and the bit sequence that corresponds to that symbol is equal to the path taken to reach this symbol.
In the above image, you can see the Huffman tree for a message where its alphabet is equal to
Building such a tree is extremely simple, storing it is relatively simple, and variants like Canonical Huffman Coding make working with these trees heavenly.
Okay, so Huffman coding can’t do any wrong, it’s simple, proven, and high-performance.
Where’s the catch?
You might’ve spotted it already, but ask yourself this question, how would you encode a symbol to take 0.811 bits like in the above information entropy example?
Well, you can’t.
Huffman coding is always going to take ceil(H(symbol))
bits per symbol.
Even if your symbol is so frequent that it’s going to take 0.1 bits to encode, Huffman can’t do it.
Is this a problem?
Not really, for everyday things, we care more about performance than anything else so Huffman coding is used in compressors like DEFLATE (used in almost everything - PNG, gzip, zip, and anywhere where zlib is used), zstd, and brotli.
There are some coders which for decode speed omit adding a Huffman coding stage like lz4 and snappy.
Arithmetic Coding
The first coder that allows us to code bits with decimal precision, so for a symbol that appears 7 times in 1000 symbols we would need
I’ll present the simple example and touch briefly on the more expanded concept, that being coding bits and coding multi-bit symbols.
The way we are going to code data is by narrowing down a range and picking a number in it so that when we decode we can just tell in which bit bin that number lays and decode bits.
Each bit bin’s width is equal to the probability of seeing that bit.
To instantly get it, let’s pick an example, our symbols and their probabilities are
You can just follow with your finger what is happening, going from left to right we start with a range
Right now I explained this in floating point space, but every single time when you read about some compressing strategy where there are floats involved you can assume that there’ll be a step where you convert to integers because it’s faster, less error-prone, higher resolution and overall you can work with it.
The same is applicable here, in the real world we map this range of
So why don’t we use this? BECAUSE IS SUPER SLOW! On generic data, this coder is around 5 times slower than Huffman coding. It does produce a better compression ratio, but the trade-off most likely is not worth it. Algorithms like the original bzip used it, but after that, the history of it is kinda muffled. Want to know why? Because it got patented so hard no one wanted any trouble and didn’t use it. All these patents expired now, but we still don’t use it because…
ANS (Asymmetric numeral systems)
ANS can achieve the same compression ratio as arithmetic coding while being at least twice as fast or very close to Huffman performance if we sacrifice just a little bit of size coding performance.
To understand asymmetric numeral systems let’s first show an example of a symmetric numeral system.
Consider an alphabet
So to encode a message
Now
If I write the following string
Which to choose, there are two ways to encode the same symbol.
Suppose we were to choose, let’s do some information entropy math, information in
After that modulus, it’s either going to be zero or one, so either way, it’s going to code the symbol
Now to discover the equation that will allow us to encode any symbol from any alphabet consider the following description
And this is our encoding table
Our quest now is to find that function
There is a pattern here!
We are always adding
To decode, peek the last digit of the base equal to the length of the description.
Figure out into which symbol’s range this digit falls.
This last digit is going to get popped from the
What we have just derived is a subset of ANS coders family called ranged ANS (rANS), for an example implementation check out Fabian’s Giesen work. There are of course more types of ANS coders like tabled ANS (tANS or FSE) where we can shift and choose the description to achieve an even higher compression ratio. Also, the tabled nature of this algorithm allows it to run branchless while decoding. For more about tANS read Yann Collet’s work or Charles Bloom’s introduction to ANS. What’s interesting is that ANS contains within itself Huffman coding, so it’s a more generic approach to entropy coding (read here about it). If you’re really into it and want to understand everything try to give the white paper a crack.
Data transforms
While entropy coding allows us to represent more frequent symbols using fewer bits, data transforms aim to deduplicate symbols and or put the data into a form that allows better compression.
Let’s imagine that we want to send the message
Repeat
four times.
If encoding this repetition is going to take less space than encoding the symbols directly we just reduced the size of the message more than what Information Entropy theory says we would expect. Let’s look at some examples, in order of their computational complexity and their compression performance:
- Run length encoding (RLE)
- Lempel-Ziv (LZ77 or variations)
- Burrows-Wheeler transform (BWT)
RLE
It’s the simplest possible data transform where each entry has two parts, the symbol and how many times to repeat it.
For example, the message 0b00101000
would indicate that in the next 8 entries, two of them will have non-one lengths encoded as well.
Sometimes you need just a touch of compression and you can’t be bothered to mess with full-blown compressors.
RLE is a perfect algorithm for that, its low complexity might be useful for example while trying to reduce a size of a binary.
LZ77
Just like ANS is a generalization of Huffman Coding, one might say that LZ77 is a generalization of RLE in correct conditions. In LZ77 each entry codes an offset in the currently decoded bytes and the number of bytes to copy starting from that offset. Entries for LZ compressors in data compression lingo are called matches because some previous data matches the one we are just about to emit.
You might notice my preference to talk about the side of the decompression, that’s because it’s easier. If you have a feeling that the decode step of this is way faster than the encode step you are correct. The decompressor just blindly copies bytes as fast as it can, while the compressor has to make some choices like which match to use. The amount of memory needed to compress is in most cases way higher than the memory needed to decompress. The thing that also behaves in this way is a complier. An interesting way to look at compression is to imagine that the compression step is compilation while decompression is execution which results in the generation of uncompressed data. Where the decompression algorithm is the virtual machine or CPU and the compressed stream are ops to execute.
Going back to the topic at hand, LZ is so simple it lends itself to many, and by many I mean many different versions. Where each version adds something of its own to make it somehow fit a given need better. Since data compression is a field where diminishing returns are just part of the game, LZ produces a very good compression while offering a simple decompression. One of the most widely known compressors lz4 and his less known brother snappy use only LZ transform to achieve their compression, they skip the entropy coding stage. And while you won’t get a big compression ratio out of them, they have very fast decompression. For example, at the time of writing lz4 boasts a staggering 5GB/s per core decode speed. But many different compressors pretty much always use some kind of LZ transform, for example, gzip uses LZ together with Huffman Coding.
BWT
This transform does not reduce the size of the message in any way, instead, it will sort the message in a reversible way.
For example, the message
unsorted sorted
| ABCABC$ | | $ABCABC |
| BCABC$A | | ABC$ABC |
| CABC$AB | | ABCABC$ |
| ABC$ABC | => | BC$ABCA |
| BC$ABCA | | BCABC$A |
| C$ABCAB | | C$ABCAB |
| $ABCABC | | CABC$AB |
These $
characters are there to signal an EOF
because without it you can see that we can not say which row is the original message.
To be able to decode we also need to keep track of at which row we have the original message in the sorted matrix, in this case, the answer is the third row.
But after that, we can take this sorted string and compress it using RLE or LZ77.
To see how the transform produces a message where very often the same characters will end up together play around with this site.
Copy some paragraph from this website and paste it there, you’ll be surprised how well this transform groups together characters.
This is because of the lexicographic sort, if inside a message there are many occurrences of words like “the”, every single one of them will be grouped after shifting.
Because sorting is not the fastest known algorithm, usage of BWT is very low, you only find them in things like bzip or programs that win the Hutter’s Prize.
It does not mean that people do not recognize how great this transform is, projects like bzip3 claim to use BWT and achieve impressive results.
People like Charles Bloom even wrote about how to speed up the inverse transform of BWT, so if you’re interested in coding a semi-serious implementation have a look.
How to use these parts
Right now you might think that you can easily mix and match different parts, like: first apply a BWT and later do a LZ77 and after which perform a entropy coding stage using ANS. And you’re right, but don’t be fooled into thinking that designing data compression algorithms is easy. DEFLATAE for example uses values above 256 in the Huffman decoded stream to indicate that there is a LZ77 match in the pipeline. lz4 will for each entry emit a pair of literals array and afterward an LZ match. zstd will for example compress a Canonical Huffman Tree using weights with FSE and use that Huffman Tree to decode four streams and later stitch them back together afterward it will decode LZ matches encoded using FSE and perform an inverse transform on the stitched data. As you can see, the amount of complexity and clever tricks grows and grows, and as with any architecture, the decisions are not arbitrary. For field-tested compressors, these decisions were carefully selected to align with the goals of the given compressors. If you want to understand this more deeply try to implement an lz4 decoder using its spec, it’s really easy and great fun.