This paper reports on the benefits of large-scale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large datasets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.
Figure 5, modified by Chris Dyer in a 2020 talk: data vs translation quality (BLEU score) scaling of n-grams, and later, RNNs.