April 04, 2008

Comparing communication efficiency across languages

In response to last week's post on comparative vocabulary size ("Ask Language Log: Comparing the vocabularies of different languages", 3/31/2008), a number of readers sent observations about a related but different topic, namely the comparative efficiency of communication. At least as measured by crude metrics such as bit counts, there are differences among languages that are not easy to explain.

Alex Baumans described a bilingual magazine's problems in equalizing space and word-count allocations between Dutch and French:

I read your discussion about the proportion of words of a language that is actually in use. A very thought provoking piece. My view is that any attempt to compare languages will fail, if the word formation rules of the languages differ too much.

I work as a journalist for a HVAC magazine in Belgium. As Belgium is bilingual, out publication exists in parallel Dutch and French versions. For obvious reasons, articles are supposed to be about as long in both languages. However, this provides endless problems, the Dutch text being on average 15-20% shorter, and the word count is way out.

One of the reasons (besides French orthography insisting on writing lots of letters that are not pronounced) is that Dutch, like German, can form compounds on the spot. Usually these are written together. Especially in technical terms, this is useful. A wall hung gas fired boiler is simply gaswandketel, as opposed to chaudière murale à gaz. How many words is that?

Similarly, if you accept these compounds as words, the upper limit to the number of Dutch words, is a lot less clear than, say, in French. Your example of Arabic and Spanish dealt with writing conventions and morphological variation. One can compensate for this in a way in a counting system. It becomes much more difficult if there are differences between languages as to what is exactly a word.

Alex's discussion of Dutch compounds underlines a point that I made in the earlier post, namely that spaces are not a very helpful way to define the boundaries of words, especially in comparisons across languages. But what I'd like to follow up on today is his observation about comparisons of word and character counts.

As discussed in a post a few years ago ("One world, how many bytes?", 8/5/2005), based on a variety of large collections of English-Chinese parallel texts, English texts are larger than their Chinese counterparts by a factor of between 1.37 and 2.27 before compression, or 1.19 to 1.41 after compression.

My impression is that there are several different factors at work here -- but they don't seem to me to account fully for the differences in length, especially in comparing compressed texts.

Consider a crude estimate of differences in character-encoding efficiency in uncompressed texts. Whether GB or Big-5, I believe that fewer than 5,000 characters are actually used in these Chinese texts, representing less than 10% of the 65536 that could be encoded in 16-bit characters. In the English texts, most of the 95 printable ascii characters will be used, representing something like 35% of the 256 possibilities afforded by 8-bit characters.

However, this difference should be eliminated by standard compression techniques, which will also take advantage of the entropy-reduction implicit in the non-uniform frequency of the characters as used.

English text puts spaces between words, and as a result, 15-20% of the characters in English text are white-space characters. Chinese doesn't bother with spaces, which should give it an advantage in concision. But again, compression techniques should eliminate most of this advantage, given that the spaces in English text are mostly redundant.

Chinese lacks the equivalent of the English articles a and the. These are common words, but (even with their separating spaces) they only amount to about 5% of English text, so not very much of the difference can be blamed on them. The impact of plural marking and verbal inflection must also be quite small, no more than a couple of percent in uncompressed text -- and Chinese loses some of this ground because of obligatory classifiers.

So I remain puzzled about why English texts, even after state-of-the-art compression, are 20%-40% longer than their Chinese equivalents.

A topic for another time: how do typical speech rates differ between languages? Do these interact with per-syllable measures of information content so as to equalize the average rate of information transmission?

Posted by Mark Liberman at April 4, 2008 06:35 AM