Relation of Word Order and Compression Ratio and Degree of Structure

smallHaving a habit of compulsively wondering approximately every 34.765th day about how zip compression (bzip2 in this case) might be used to measure information contained in data – this time the question popped up in my head of whether or not and if then how permutation of a text’s words would affect its compression ratio. The answer is – it does – and it does so in a very weird and fascinating way.

Lo and behold James Joyce’s “A Portrait of the Artist as a Young Man” and its peculiar distribution of compression ratios …

a-portrait-of-the-artist-as-a-young-man-2So obviously the compression ratio is indeed affected by word order and even comes along with a very nice looking distribution. But the best part is that the compression ratio of the original word order is effectively statistically mega-super-über-hyper-significant! The same is true for all the book’s texts I checked out and, I guess, it is true for every natural language text in the world. Intuitively I thought that maybe there is a distribution which is typical for text but from what I see, they are pretty different from content to content. It even seems to me that there is a resemblance of shape per author (“authorship attribution”). Though, Ulysses – the literary troublemaker of last century – is of course not playing along nicely. Nonetheless – Twain features a plateau, the other two Joyces a double hill and Dickens a bell-shape (see six-pack below).

Compression Ratio as a Measure for Structure

The difference between a normal text and a permutation of it is of course that we kill its structure by doing so. This makes me wonder if it would be possible to measure a texts degree of structure by relating its original compression ratio to its distribution of compression ratios. F.x.:

\text{degree of structure} = \frac{\text{original compression ratio}}{\text{1st decile of distribution of compression ratios}}

A low DOS would indicate a high degree of structure because of the relatively high difference to the lower boundary of the distribution of compression ratios for randomly permuted word order. Looking at the DOS column of the table below this interpretation seems reasonable when we compare the degree f.x. for Dickens vs. Joyce. Assuming that Dickens narrates rather straightforward as opposed to Joyce who is notorious for his demanding literary style.

6hists-full

Technicalities

All texts are taken from Project Gutenberg – relieved of sections not belonging to the book itself – reduced to letters a to z and A to Z by replacing anything else with a space – the result is split at white spaces and all “words” of length 1 are discarded. This vector of words or string atoms is then permuted using sample().

The compression applied is of type bzip2 and performed using memCompress() with calculation of compression ratio being implemented as follows:


(original article published on www.joyofdata.de)

6 thoughts on “Relation of Word Order and Compression Ratio and Degree of Structure

  1. Interesting results, but as expected.

    I’ve actually done something very similar (comparing random sorts of lines of my website) in http://www.gwern.net/Archiving%20URLs#sort—key-compression-trick but I didn’t use Bzip2, I used xz because that seems to be a smarter algorithm on text and I think that yields more meaningful results. What happens if you try xz? Or better yet, zpaq on maximal settings?

    (I’ve also tried considering the length/difficulty of different fictional works by comparing not their wordcounts but their compressed sizes: http://www.gwern.net/Book%20reviews#textual-lengthcomplexity But I’m not sure that’s relevant to your points here.)

    • Hi Gwern,

      thanks for your input!

      Would you elaborate on why the original word ordering is that significant (consistently at level 10?¹? – to give a rule of thumb estimate) and what the shape of the distribution might depend on? Or is “expected” referring just to the result that word ordering affects compression?

      What makes xz and zpaq “smarter”? (not in detail, just an intuitive take on it)

      As you suggest, the compression ratio itself is also for the given specimen related to the content’s difficulty. Because as you can take from the table the original compression ratio is higher (less compression efficiency) for Dubliners and Ulysses than for Huckleberry Finn and David Copperfield. This observation I found quite expectable which is why I didn’t mention it.

      BTW your site seems quite … interesting … almost mesmerizing … reminds me somehow of isomorphisms. Will definitely have a closer look!
      Also interesting that you run mirrors for Silk Road … what’s on sale right now? ;)

      • > Would you elaborate on why the original word ordering is that significant (consistently at level 10?¹? – to give a rule of thumb estimate)

        This doesn’t surprise me because from my own experiment I know the loss from randomizing order is substantial, and so running on a whole book of text could be expected to yield high p-values (how many lines is a book, 10k?).

        > and what the shape of the distribution might depend on?

        Now that I have no idea about.

        > Or is “expected” referring just to the result that word ordering affects compression?

        Yes.

        > What makes xz and zpaq “smarter”? (not in detail, just an intuitive take on it)

        It’s been a long time since I read Mahoney’s data compression book (_ Data Compression Explained_ http://mattmahoney.net/dc/dce.html) so offhand I don’t have any rigorous definition of how xz (LZMA https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) improves on bzip2, and how zpaq improves on xz. I think the short answer for the latter is ‘zpaq is crazy and allows for things like neural networks as part of the compression process’.

        > Because as you can take from the table the original compression ratio is higher (less compression efficiency) for Dubliners and Ulysses than for Huckleberry Finn and David Copperfield. This observation I found quite expectable which is why I didn’t mention it.

        I actually didn’t notice that, but that is inline with my expectations from comparing _Umineko_ to _Wheel of Time_ and Shakespeare – it certainly would be surprising if _Ulysses_ compressed better than the others! I’m surprised _Dubliners_ was the highest of the normal books since I don’t remember the language being complex.

        > Also interesting that you run mirrors for Silk Road … what’s on sale right now? ;)

        There’s a 2-for-1 sale on early-finalization scams which has been going gangbusters for a month or two now…

        • … and so running on a whole book of text could be expected to yield high p-values (how many lines is a book, 10k?).

          Well, that merely offers the mathematical potential but is not at all implying that you going to see that  significance in the wild and consistently at that. When I look at those distributions and how far off the original CR is – I am totally fascinated by it, because it means there is a pattern in language so simple that bzip2 can pick up on it. That is actually also a reason why I think bzip2 is a good choice. Of course a compression algorithm applying NLP technology will show this performance – it would not be surprising at all. But bzip2 is – as you say – comparatively simple. That offers the potential to think about what is going on. An ANN will place a black box between you and insight.

          Additionally the shapes of the distribution cue me into believing that there is going on something quite interesting underneath.

Comments are closed.