We present methodology for using dynamic evaluation to improve neural sequence models. RNN Models are adapted to recent history via a gradient descent based mechanism, causing them to assign higher probabilities to re-occurring sequential patterns.
Dynamic evaluation outperforms existing adaptation approaches in our comparisons. Dynamic evaluation improves the state-of-the-art word-level perplexities on the Penn Treebank and WikiText-2 datasets to 51.1 and 44.3 respectively, and the state-of-the-art character-level cross-entropies on the text8 and Hutter Prize datasets to 1.19 bits/char and 1.08 bits/char respectively.
Figure 2: Average losses in bits/char of dynamic evaluation and static evaluation plotted against number of characters processed; on sequences from the Hutter Prize test set (left) and European Parliament dataset in Spanish (right), averaged over 500 trials for each.
Losses at each data point are averaged over sequence segments of length 100, and are not cumulative. Note the different y-axis scales in the two plots.
ā¦7.3 Time-Scales Of Dynamic Evaluation:
ā¦The Spanish experiments measure how dynamic evaluation handles large distribution shifts between training and test time, as Hutter Prize contains very little Spanish. We used the first 5 million characters of the Spanish European Parliament data in place of the Hutter Prize test set. The Spanish experiments used the same base model and dynamic evaluation settings as Hutter Prize. Plots of the average cross-entropy errors against the number of Spanish characters sequenced are given in Figure 2b.
On both datasets, dynamic evaluation gave a very noticeable advantage after a few hundred characters. For Spanish this advantage continued to grow as more of the sequence was processed, whereas for Hutter, this advantage was maximized after viewing around 2ā3k characters. The advantage of dynamic evaluation was also much greater on Spanish sequences than Hutter sequences.
We also drew 300 character conditional samples from the static and dynamic versions of our model after viewing 10k characters of Spanish. For the dynamic model, we continued to apply dynamic evaluation during sampling as well, by the process described in §3. The conditional samples are given in the Appendix. The static samples quickly switched to English that resembled Hutter Prize data. The dynamic model generated data with some Spanish words and a number of made up words with characteristics of Spanish words for the entirety of the sample. This is an example of the kinds of features that dynamic evaluation was able to learn to model on the fly.