The Lost Quarter-Century in Data Compression
Charles Bloom just posted a new blog entry: The Lost Huffman Paper. He found an optimization for faster Huffman decoding and later discovered it was published in a paper by Alistair Moffat and Andrew Turpin thirteen years earlier: "On the Implementation of Minimum Redundancy Prefix Codes"
Knowledge becoming lost or overlooked is a frequent occurrence. Even if you're an expert in your field (and Charles certainly qualifies) there is so much to know and so much that gets published that's it impossible to keep up with everything. Information overload is a kind of tax on progress.
What happens when a paper gets widely distributed and reaches the status of a seminal paper that is cited by everyone in the field? Surely, this is exactly the outcome we want and would run the least risk of impeding progress. Right? Surprisingly, in rare but important cases, it can actually have the opposite effect. Let me give you an example.
1951: David Huffman, a graduate student studying under Robert Fano (co-inventor of the Shannon-Fano coding method) is struggling with a term-paper. He's trying to solve the problem of assigning codes to symbols in such a way that the more probable symbols (such as the letter 'E' appearing more frequently than the letter 'V') would be assigned shorter codes. The goal is to do this optimally so that the resulting codes are provably the shortest ones possible. Scientific American described it like this:
Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. “It was the most singular moment of my life,” Huffman says. “There was the absolute lightning of sudden realization.”
That epiphany added Huffman to the legion of largely anonymous engineers whose innovative thinking forms the technical underpinnings for the accoutrements of modem living—in his case, from facsimile machines to modems and a myriad of other devices. “Huffman code is one of the fundamental ideas that people in computer science and data communications are using all the time,” says Donald E. Knuth of Stanford University, who is the author of the multivolume seriesThe Art of Computer Programming.
What David Huffman didn't know at the time was that his professor had already tried and failed to solve the same problem. Fano offered the term-paper as an optional assignment which could substitute for the final exam. He certainly didn't expect Huffman to succeed. Fortunately, he didn't reveal this. From the same article:
Huffman says he might never have tried his hand at the problem—much less solved it at the age of 25—if he had known that Fano, his professor, and Claude E. Shannon, the creator of information theory, had struggled with it. “It was my luck to be there at the right time and also not have my professor discourage me by telling me that other good people had struggled with this problem,” he says.
Intimidation would have gotten in the way. He probably wouldn't even have attempted it had he known these distinguished scientists had already tried and failed.
I have a couple of anecdotes to share. They are not first-hand (I wasn't even born at the time) and so they may not be true but they are too entertaining to not share. I don't even recall where I first heard them so please consider them strictly apocryphal.
The first: When David Huffman threw his notes in the trash some of them scattered on the floor and one, full of diagrams of symbol coding trees scribbled on it, happened to land upside-down. He instantly realized he was approaching the problem from the wrong direction. The solution was obvious: Work from the bottom up; not the top down!
The second: Claude Shannon, who had struggled with the same problem for a long time, was told of Huffman's solution. He responded almost with disbelief, "That's it? That's all there is to it?"
September, 1952: David Huffman publishes "A Method for the Construction of Minimum-Redundancy Codes"
The problem of assigning optimal codes to symbols is not only solved but provably solved. Huffman's paper quickly becomes one the most cited papers in the fields of information theory and data compression.
And then.... nothing. For the next 25 years the data compression field goes silent and no significant progress is made. Why should there be any progress? There's a proof in Huffman's paper. It's optimal.
Finally, in 1977, Abraham Lempel (also here) and Jacob Ziv publish the first of two papers that jump-started modern data compression: "A universal algorithm for sequential data compression". This was quickly followed by "Compression of individual sequences via variable-rate coding".
These papers spawned two popular classes of compression algorithms known, respectively, as LZ77 and LZ78. The popular PKZip and my own Quantum compressor are both based on LZ77.
What caused the field to go silent for so long? Twenty-five years with nothing to show for it is a very long time. I think there are at least two significant reasons:
1. Huffman's paper contained a proof that demonstrated it solved the problem. It's foolish to expend effort on a solved problem and this intimidated other people from even thinking much about it.
2. The distinction between modelling probabilities and coding them was not appreciated in 1952. It's easy for a novice to conflate them together (and any time you hear the phrase "Huffman compression" you're hearing a mistake.) This was the leap that Lempel and Ziv made. Coding the probabilities is one thing and Huffman's method addresses only this part of the problem. Modelling the probabilities is something entirely different and it is the place where hard work of compression really happens.
There you have it: A ground-breaking paper is so successful that it brings progress to a halt. It makes you wonder what other things that are well-known are impeding progress today. What else are we missing?
Finally, although it's a minor point, Huffman coding isn't quite as optimal as it seems. It is optimal for power-of-two probabilities but if you're willing to use fractional bits (Arithmetic coding) you can approximate the probabilities more accurately and do better.
Comments 4 Comments
Useful dictionary-based compression schemes contain an entropy encoder back end; this can be more or less something of the Huffman type or they can be something else, such as arithmetic coding. The use of a very large dictionary window and arithmetic coding is the reason why 7zip compresses so much better than gzip, for instance.
Related -- once you 'know' a problem is solved, it can be much easier to work out the solution yourself -- I'm sure Huffman encoding has been assigned as the solution to many a CS class in the last 50 years.
So, I guess this 'so good it crushes innovation' dynamic happens in the arts as well as the sciences..