N-Gram Counts and Language Models from the Common Crawl”, Christian Buck, Kenneth Heafield, Bas van Ooyen2014 (, )⁠:

We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages.

This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the corpus was processed with emphasis on the problems that arise in working with data at this scale. Our unpruned Kneser-Ney English 5-gram language model, built on 975 billion deduplicated tokens, contains over 500 billion unique n-grams.

We show gains of 0.5–1.4 BLEU by using large language models to translate into various languages.

[Keywords: web corpora, language models, multilingual]

…By using disk-based streaming (Heafield et al 2013) we are able to efficiently estimate language models much larger than the physical memory on our machines. For example, estimating a language model on 535 billion tokens took 8.2 days a single machine with 140 GiB RAM. For all languages for which we have sufficient data and a preprocessing pipeline, we produce unpruned 5-gram models using interpolated modified Kneser-Ney smoothing (Kneser & Ney1995; Chen & Goodman1998).

We don’t give details for all models but the largest one. Table 4 shows n-gram counts for the English language model that was estimated on almost a trillion tokens. The resulting model has a size of 5.6TB.