“Entropy of Natural Languages: Theory and Experiment”, 1994-05-01 (; similar):
The concept of the entropy of natural languages, first introduced by 1948 and its importance is discussed. A review of various known approaches to and results of previous studies of language entropy is presented.
A new improved method for evaluation of both lower and upper bounds of the entropy of printed texts is developed. This method is a refinement of 1951’s prediction (guessing) method. The evaluation of the lower bound is shown to be a classical linear programming problem.
Statistical analysis of the estimation of the bounds is given and procedures for the statistical treatment of the experimental data (including verification of statistical validity and statistical-significance) are elaborated.
The method has been applied to printed Hebrew texts in a large experiment (1000 independent samples) in order to evaluate entropy and other information-theoretical characteristics of the Hebrew language. The results have demonstrated the efficiency of the new method: the gap between the upper and lower bounds of entropy has been reduced by a factor of 2.25 compared to the original Shannon approach.
Comparison with other languages is given.
Possible applications of the method are briefly discussed.