“Recurrent Neural Network Based Language Model § Dynamic Evaluation”, 2010-09-26 ():
[context; cf. 2021] …Note that training phase and testing phase in statistical language modeling usually differs in the fact that models do not get updated as test data are being processed. So, if a new person name occurs repeatedly in the test set, it will repeatedly get a very small probability, even if it is composed of known words.
It can be assumed that such long term memory should not reside in activation of context units (as these change very rapidly), but rather in synapses themselves—that the network should continue training even during testing phase. We refer to such model as dynamic. For dynamic model, we use fixed learning rate α = 0.1. While in training phase all data are presented to network several times in epochs, dynamic model gets updated just once as it processes testing data. This is of course not optimal solution, but as we shall see, it is enough to obtain large perplexity reductions against static models. Note that such modification is very similar to cache techniques for backoff models, with the difference that neural networks learn in continuous space, so if ‘dog’ and ‘cat’ are related, frequent occurrence of ‘dog’ in testing data will also trigger increased probability of ‘cat’.
…The training algorithm described here is also referred to as truncated backpropagation through time with τ = 1. It is not optimal, as weights of network are updated based on error vector computed only for current time step. To overcome this simplification, backpropagation through time (BPTT) algorithm is commonly used.
…The apparently strange result [in Table 3] obtained with dynamic models on evaluation set is probably due to the fact that sentences in eval set do not follow each other. As dynamic changes in model try to capture longer context information between sentences, sentences must be presented consecutively to dynamic models.
…The improvement keeps getting larger with increasing training data, suggesting that even larger improvements may be achieved simply by using more data. As shown in Table 2, WER reduction when using mixture of 3 dynamic RNN LMs against 5-gram with modified Kneser-Ney smoothing is about 18%. Also, perplexity reductions are one of the largest ever reported, almost 50% when comparing KN 5-gram and mixture of 3 dynamic RNN LMs—actually, by mixing static and dynamic RNN LMs with larger learning rate used when processing testing data (α = 0.3), the best perplexity result was 112.
…Perplexity improvements reported in Table 2 are one of the largest ever reported on similar data set, with very large effect of on-line learning (also called dynamic models in this paper, and in context of speech recognition very similar to unsupervised LM training techniques). While WER is affected just slightly and requires correct ordering of testing data, online learning should be further investigated as it provides natural way how to obtain cache-like and trigger-like information (note that for data compression, on-line techniques for training predictive neural networks have been already studied for example by 2000). If we want to build models that can really learn language, then on-line learning is crucial—acquiring new information is definitely important.