“The Distribution of N-Grams”, Leo Egghe2000-02 (, )⁠:

n-grams are generalized words consisting of n consecutive symbols, as they are used in a text. This paper determines the rank-frequency distribution for redundant n-grams. For entire texts this is known to be Zipf’s law (ie. an inverse power law). For n-grams, however, we show that the rank (r)-frequency distribution is

Pn(r) = C / (ΨN (r))β

where ψN is the inverse function of fN(10) = x lnN−1x. (Here we assume that the rank-frequency distribution of the symbols follows Zipf’s law with exponent β.)

[Sums of power laws are not power laws, so can be quite different.]