The Simple Beauty of XOR Floating Point Compression

I recently implemented a small program to visualize the inner workings of a scheme that compresses floating point timeseries by XORing subsequent values. The resulting visualizations are quite neat and made it much easier for me to understand this beautiful algorithm than any of the explanations that I had previously encountered.

Hacker News (280 points, 68 comments)

The algorithm

The algorithm1This particular version of XOR floating point compression was first described in and is often referred to as “Gorilla compression. is simple. We start by writing out the first floating point number in full. All subsequent numbers are XORed with the previous number and then encoded in one of three ways:

  1. We write a single 02 bit. This indicates that the number is identical to the previous number.
  2. We write the two bit sequence 112. We write a 5 bit integer that indicates the number of leading zeros in the XORed value. We write a 6 bit integer that indicates the number of bits in the subsequence that starts at the first 1 bit and includes and ends at the last 1 bit. We then write out this subsequence, omitting all leading and trailing zeros.
  3. We write the two bit sequence 102. We then immediately write out the subsequence of the XORed value that has the same starting index and length as the last subsequence we wrote.

How do we choose between (2) and (3)? The advantage of (3) is that we omit the 5+6=11 bits we would otherwise require for storing the number of leading zeros and length of the sequence, but we can only use (3) if our nonzero subsequence fits within the window defined by the previous offset and length values. The Gorilla paper proposes a simple heuristic: if the previously used subsequence of bits includes all nonzero bits, we go with (3) and omit writing out the offset and length bits. If not, we go with (2) and write out a new offset and length. Note that this heuristic minimizes the amount times we have to write out the 11 bits of offset/length values, but doesn’t necessarily give optimal compression since we might be writing much longer subsequences on subsequent values than necessary (more on that later).

While the algorithm is simple, its not immediately obvious from its description why it should lead to good compression. Case (1) is easy, repeated values are compressed to a single bit. But why does slicing off leading and trailing zeros on the XOR of subsequent floating point values work well? Let’s apply the algorithm to some real data and see what it looks like!

expandable image

scrollable text

chart

The first column shows the decimal representation of the timeseries we are compressing. This particular timeseries happens to measure the time spent in a particular code section of enn-trainer on each iteration of a training run.

The second column is the 64-bit floating point representation of the same numbers with sign bit in red, exponent in green, and mantissa in blue. Note that the exponent bits are identical for almost all of the numbers in the series, which is expected when values vary by a lot less than 2x. Note also that the numbers have a lot of trailing zero bits, which is not something that’s obvious from the decimal representation. It looks like these timings were recorded at microsecond resolution, which means that values smaller than 0.02 all fit into just 15bits.

The third column shows the XOR of each value which the previous value, with the region of non-zero bits highlighted in yellow. The XORed values have many leading zeros (because the sign bit and exponent don’t change on subsequent values) and many trailing zeros (because the original values already shared a lot of trailing zeros). These regular long runs of leading and trailing zero bits are what makes the compression work!

The fourth column shows the bits output by the algorithm. On the first row, we simply have the full bit representation of the first number without any compression. On the second row, we first write the 11 control bits, the number of leading zeros (810 = 010002), the length of the nonzero region – 1 (1710 = 0100012)2We subtract 1 from the length beause otherwise we might overflow the 6 bits allocated for the length when all bits are different and the length is 64. This is sound because the length is always at least 1., and the nonzero bits 111111110110110011. On the third row, the region of nonzero bits falls into the same window, so we instead write the 10 control bits and the nonzero subsequence 111111100100111000. Note that this subsequence has three trailing 0s, but these 3 bits still take up less space than the 11 bits we would need to write out a new offset and length. We continue to omit the offset and length until the ninth row, where we reset to a new offset and length since the region of nonzero bits in the XOR does not fit within the previously used window.

This entire series (including more values not shown above) is compressed by a factor of 3.03. Even if we compare to just using 32-bit single precision floating point values to represent these numbers, we’re still compressing by a factor of 1.51.

See here for a simple implementation of both the encoding and decoding algorithm in less than 100 lines of Rust (not including the BitReader/BitWriter structs). It is adapted from https://github.com/jeromefroe/tsz-rs which has somewhat more complicated code but much better comments. My implementation contains two additional options: lossy compression that masks out some portion of the least significant bits in the mantissa, and a simple tweak described in the next section that can significantly improve compression for some time series.

Tweaking the algorithm

As we noted in the previous section, the Gorilla compression algorithm uses a particular heuristic for resetting the window size. It greedily minimizes the number of window resets, which works well in practice most of the time, but is not guaranteed to yield maximum compression. In particular, there is one pathological failure mode: If a series contains an outlier value that requires a very large window to encode, and all subsequent values have nonzero values only in a much smaller subwindow, the inefficient larger window will get locked in because we never hit the condition that would trigger a reset of the window size. Here’s an example of this happening:

image

text

chart

As a simple fix, I added a “regret” counter that keeps track of how many leading and trailing zeros we have encountered in the subsequences the algorithm outputs. If this this sum exceeds some threshold, we force a reset of the window. A “max regret” threshold of 100 seems to work quite well in practice, substantially improving compression for some series while only slightly worsening compression on others. Here is what compression of the same timeseries looks like with “max regret” of 100.

image

text

chart

I haven’t given this too much thought so there’s a good chance there exist better and more principled heuristics (EDIT: Hacker News commenter EdSchouten has a neat approach). It would also be interesting to figure out what the most efficient algorithm is for finding the optimal choices for when to reset the window and to what range when shown the entire timeseries. This would be useful in cases where you care much more about minimizing compressed size rather than compression throughput (decompression could still use the same algorithm and would have similar throughput as with the heuristic compression).

Performance

My naive implementation achieves a throughput of more than 100MiB/s for encoding and 200MiB/s for decoding random floats on an i7-10875H (more compressible data achieves more than 200MiB/s encode, 350MiB/s decode). Not amazing, but pretty decent for such little code, and still a lot faster than most network connections (my main use case for this was reducing the size of network transfers). There’s a SIMD Rust implementation that achieves multiple GB/s, though it seems to use unstable features that need some updating to make it compatible the most recent version of Rust.

Profiling the naive implementation reveals that it spends most if its time in the BitWriter/BitReader methods.

Swapping these out for bitbuffer and adding some other small optimizations gets us to 370MiB/s encode, 1GiB/s decode. I tried bitter as well, an early (broken) prototype that didn’t correctly handle reads with more than 56 bits got to > 2GiB/s decode even in auto mode and a (probably correct) version was still something like 30% faster than my initial bitbuffer implementation at > 1GiB/s.

More examples

Finally, here’s a collection of different timeseries for your viewing pleasure. Compression for all of these uses the default “max regret” threshold of 100. You can reproduce these by running cargo run --example gorilla_time -- --verbose at 4cfbda210. All the timeseries here contain 246 values in total, only the first few are shown.

Low-bit floating point

Values in this time series are mean values calculated by dividing a small integer with a power of two, which results in very few significant bits (5.68x compression).

image

text

chart

Highly redundant integers

Values in this series report the average throughput since program start rounded to an integer, which causes the timeseries to mostly converge to a single value as time goes on (13.5x compression).

image

text

chart

Incrementing integers

Integer that increments by 1 at every timestep. Treating this as an integer rather than float and using double delta compression would be much more effective, but the XOR algorithm still manages to reduce this to just over one byte per data point (6.98x compression).

image

text

chart

Unix timestamps

This series is a sequence of unix timestamps with ~microsecond precision. Our choice of using just 5 bits to store the number of leading 0s fails us here, since this limits us to at most 2^5 – 1 = 31 leading zeros. This series would also benefit from being stored as an integer and double delta compression (1.80x compression).

image

text

chart

Noisy single precision float

This is a fairly typical floating point time series which records the gradient norm at each iteration of a machine learning run. Most of the significant bits are incompressible noise. We still get some compression because the exponent bits don’t change and the number only uses single precision and does not use the least significant 29 bits in the mantissa (2.37x compression).

image

text

chart

We can save a tiny amount of extra space if we treat it as a single precision floating point number to begin with since in that case we need one less bit to store the length of the nonzero subsequence. This makes very little difference in this case because the length field is stored rarely (2.41x compression).

image

text

chart

If we’re willing to lossily compress, we can get significantly better compression by reducing the size of the mantissa even further to 4bits (8.48x compression).

image

text

chart (original)

chart (4 bit mantissa)

8 thoughts on “The Simple Beauty of XOR Floating Point Compression

  1. Pete

    Cool idea. Have you tried using standard compression algorithms, like gzip or xz, on the pattern after you apply your XORs (center column in above screenshots)? Many compressors are already adept at reducing long strings of 0-bits mixed with short bursts of higher entropy.

    Reply
    1. bulat ziganshin

      their lz77 parts are completely useless here since they find only 2+ byte repetitions, alogned to byte boundaries, and their entropy coding will be mostly useless since they also deal with byte-granular values. if you want to employ usual byte-oriented algprithms, you are better use plain xor or xor+transpose for preprocessing

      Reply
    1. clemenswinter Post author

      Oh very cool, thanks for the tip!

      I gave it a quick try, I’m getting around 150MiB/s encode,1.5GiB/s decode for random f64 for pco (my xor implementation is around 370MiB/s encode, 1GiB/s decode). Compression ratios are often similar, but in some cases a lot better for pco. Will have to take a closer look later!

      Code: https://github.com/cswinter/LocustDB/commit/46edbce88f2aceb3f5c12da2fcc123486390ff58
      Compression ratios: https://gist.github.com/cswinter/853e53f58c5745bc84db3ddcbe6abf35

      Reply
    1. clemenswinter Post author

      Interesting idea. The details of the proposed algorithm and the analysis are a bit unclear to me. In the general case it it’s not possible to losslessly convert floating point numbers into fixed point, it seems like this method would reduce precision on small values in series with large dynamic range? I’d also be curious to see how it performs across a wider corpus of timeseries data.

      Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.