“Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Dynamic Evaluation”, Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, Phil Blunsom2021-02-03 (, )⁠:

[background] …6. Keeping models up-to-date: Online learning through dynamic evaluation:

One way to mitigate LMs’ degradation over time is to continually update the models’ knowledge with new information as new documents arrive into the stream. One of the ways to do this is through dynamic evaluation (Mikolov et al 2010; Graves2013; Krause et al 2019)—a form of online learning that continually updates the parameters of a pretrained model by performing gradient descent on the new data. While most prior work used dynamic evaluation to perform updates within a document, hence adapting to local topical shifts, here we use it to adapt to the temporal dynamics that occur within a stream of chronologically ordered documents, hence adapting to temporal trends across documents. Appendix B has more details on dynamic evaluation and our empirical settings.

Figure 5: Relative perplexity increase with (solid lines) and without (dotted lines) dynamic evaluation, for the TIME-STRATIFIED model.

We plot the results in Figure 5: Dotted lines reflect the perplexity increase when comparing the CONTROL model to the TIME-STRATIFIED model, ie. the same graph as in Figure 1, whereas solid lines reflect the perplexity increase achieved when comparing the same CONTROL model with the TIME-STRATIFIED model augmented with dynamic evaluation (TIME-STRATIFIEDdyn). In all datasets, dynamic evaluation reduces the speed of the model becoming outdated, as evidenced by the reduced upward slope, with a statistically-significant effect for arXiv and WMT (p < 0.05, assessed using a t-test on the slopes found by fitting a linear regression). The improvements are more pronounced for arXiv, where a more granular analysis over weeks reveals that the model needs only about one week worth of data to overtake the CONTROL model. Moreover, we see much larger improvements for predicting EMERGING NEW WORDS, which exhibit strong temporal dynamics (§3.1, see Figure 3): We observe a 39.62% ppl. reduction 109.73 → 66.2 for EMERGING NEW WORDS, compared to the overall ppl. reduction (a 1.25% reduction 22.45 → 22.17 for WMT; Figure 4).

When aiming to keep models up-to-date (especially for larger models), lightweight yet effective approaches are preferable because they allow the model to rapidly digest new information with minimal time, computation, and carbon costs. We thus experiment with updating only the embedding layer (ie. 52M parameters), capturing lexical semantic changes, as well as updating only the bias terms at all layers (ie. 0.198M parameters), as recently introduced by Ben-Zaken et al 2021. Figure 4 presents the results: In line with the findings of Ben-Zaken et al 2021, updating only the bias terms performs nearly as well as updating the full model. [Large models are sample-efficient & good at continual-learning…]