“Recurrent Neural Network Based Language Model”, Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur2010-09-26 (, ; backlinks)⁠:

[slides] A new recurrent neural network based language model (RNN LM) with applications to speech recognition is presented.

Results: indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state-of-the-art backoff language model. [see their dynamic evaluation results] Speech recognition experiments show around 18% reduction of word error rate on the Wall Street Journal task when comparing models trained on the same amount of data, and around 5% on the much harder NIST RT05 task, even when the backoff model is trained on much more data than the RNN LM.

We provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity.

[Keywords: language modeling, recurrent neural networks, speech recognition, dynamic evaluation, Elman networks, backpropagation, hidden Markov models, overfitting]

…In our experiments, networks do not over-train substantially, even if very large hidden layers are used—regularization of networks to penalize large weights did not provide any substantial improvements…For comparison, it takes around 6 hours for our basic implementation to train RNN model based on Brown corpus (800K words, 100 hidden units and vocabulary threshold 5), while Bengio reports 113 days for basic implementation and 26 hours with importance sampling, when using similar data and size of neural network. We use only a BLAS library to speed up computation…As it is very time consuming to train RNN LM on large data, we have used only up to 6.4M words for training RNN models (300K sentences)—it takes several weeks to train the most complex models.

…All LMs in the preceding experiments were trained on only 6.4M words, which is much less than the amount of data used by others for this task. To provide a comparison with Xu8 and Filimonov,9 we have used 37M words based backoff model (the same data were used by Xu, Filimonov used 70M words). Results are reported in Table 3, and we can conclude that RNN based models can reduce WER by around 12% relatively, compared to backoff model trained on 5× more data.

5. Conclusion & future work: Recurrent neural networks outperformed importantly state-of-the-art backoff models in all our experiments, most notably even in case when backoff models were trained on much more data than RNN LMs. In WSJ experiments, word error rate reduction is around 18% for models trained on the same amount of data, and 12% when backoff model is trained on 5× more data than RNN model. For NIST RT05, we can conclude that models trained on just 5.4M words of in-domain data can outperform big backoff models, which are trained on hundreds times more data. Obtained results are breaking myth that language modeling is just about counting n-grams, and that the only reasonable way how to improve results is by acquiring new training data.