August 05, 2005

One world, how many bytes?

Victor Mair sent me some interesting observations about the slogan for the 2006 Beijing Olympics. The English version is "One World, One Dream", while the Chinese version is "tong2 yi1 ge shi4jie4, tong2 yi1 ge meng4xiang3" in pinyin, or 同一个世界 , 同一个梦想 in simplified characters, or  同一個世界 , 同一個夢想 in traditional characters.

Victor has interesting things to say about the source of the slogan (it was devised in English and then translated into Mandarin), the slogan's division into words, the history of the words, and so on. But early in his note he makes a quantitative comparison

Mandarin:   10 syllables, 8 words
  75 pen(cil) strokes (traditional) / 58 (simplified)
English:   4 syllables, 4 words
  approximately 25 pen(cil) strokes

asks (what I take to be) a rhetorical question about it:

In cybernetic / IT terms, which is more economical? This is NOT even taking into account that there are only 26 letters of the alphabet to deal with, in contrast to at least 26,000 characters that have to be separately considered when determining memory size.

Now, to a first approximation, I reckon that the cost of text storage is now zero -- a compressed copy of all the text I've ever written is roughly the size of one high resolution digital photograph -- and so the answer to Victor's question may not matter very much, since a mere factor of two or three hardly matters in a situation where pictures, audio and video consume many orders of magnitude more storage than text does. However, I was still curious about the facts of the matter in a larger sample than just this one slogan. So I turned to the LDC's catalog of Chinese/English parallel text.

One of our offerings in that area is a body of United Nations documents. There are 7,070 pairs of documents. I believe that in most cases, the documents were written in English and then translated into Chinese. These are essentially plain text documents -- no mark-up -- and they are not compressed. The Chinese is GB encoded. The totals byte counts are:

Chinese: 54,640,469
English: 123,301,197

Now, the fact that English puts spaces between words, while Chinese does not, accounts for some of this difference. But in any case, the direction of the difference in bytes is the opposite of Victor's counts of syllables, words, strokes etc., and the magnitude of the difference is a factor of about 2.26.

In the Olympic slogan, the Chinese version is 11 characters, including the comma. Even encoded as 2 bytes per character, that's only 22 bytes. The English version is 20 characters -- at one byte per character, that's 20 bytes. This suggests that the slogan is not typical of other material.

Another database on which we can make comparisons is some material from Hong Kong. There are three subcorpora: the "Hansards", which are the parliamentary records; the legal code; and an archive of news stories. This data includes some formatting information, but it's largely the same in both languages. This table shows the disk usage in megabytes for the various subcorpora:

  Chinese English English/Chinese Ratio
Hansards
158.454
270.472
1.76
Laws
50.094
68.796
1.37
News
78.898
117.890
1.49

I'm not sure why the ratios vary so much, nor why they're all lower than the UN ratio (perhaps because these were written in Chinese and translated to English?), but they certainly all favor Chinese texts as being smaller than the corresponding English texts.

Several people have written in to ask about the size relationship once the files have been compressed. I wondered too, but didn't have time earlier to check. The results of a couple of experiements suggest that it reduces but does not eliminate the discrepancy in size. For example, the Hong Kong News corpus, put into a tar archive and compressed with gzip, is 33,535,939 bytes in English, and 29,291,135 bytes in Chinese, for a ratio of 1.14. This is smaller than 1.49, but it's not 1.

[Update: Xiaoyi Ma observes that the LDC parallel Chinese/English corpora in general amount to some 218M English words and 370M Chinese characters, or about 1.7 Chinese characters per English word. In terms of byte count, he gets the following English/Chinese ratios:

 
text
gzipped
FBIS
2.27
1.41
Sinorama
1.95
1.19
UN
1.96
1.24

All FBIS and Sinorama text was translated Chinese to English, while 90% of the UN data was translated English to Chinese. The ratios again are variable, but clearly show that Chinese texts are smaller than the corresponding English texts, with the difference shrinking but not disappearing under compression.]

Posted by Mark Liberman at August 5, 2005 07:45 PM