One way to mitigate LMs’ degradation over time is to continually update the models’ knowledge with new information as new documents arrive into the stream. One of the ways to do this is through dynamic evaluation (Mikolovet al2010; Graves2013; Krauseet al2019)—a form of online learning that continually updates the parameters of a pretrained model by performing gradient descent on the new data. While most prior work used dynamic evaluation to perform updates within a document, hence adapting to local topical shifts, here we use it to adapt to the temporal dynamics that occur within a stream of chronologically ordered documents, hence adapting to temporal trends across documents. Appendix B has more details on dynamic evaluation and our empirical settings.
Figure 5: Relative perplexity increase with (solid lines) and without (dotted lines) dynamic evaluation, for the TIME-STRATIFIED model.
We plot the results in Figure 5: Dotted lines reflect the perplexity increase when comparing the CONTROL model to the TIME-STRATIFIED model, ie. the same graph as in Figure 1, whereas solid lines reflect the perplexity increase achieved when comparing the same CONTROL model with the TIME-STRATIFIED model augmented with dynamic evaluation (TIME-STRATIFIEDdyn). In all datasets, dynamic evaluation reduces the speed of the model becoming outdated, as evidenced by the reduced upward slope, with a statistically-significant effect for arXiv and WMT (p < 0.05, assessed using a t-test on the slopes found by fitting a linear regression). The improvements are more pronounced for arXiv, where a more granular analysis over weeks reveals that the model needs only about one week worth of data to overtake the CONTROL model. Moreover, we see much larger improvements for predicting EMERGING NEW WORDS, which exhibit strong temporal dynamics (§3.1, see Figure 3): We observe a 39.62% ppl. reduction 109.73 → 66.2 for EMERGING NEW WORDS, compared to the overall ppl. reduction (a 1.25% reduction 22.45 → 22.17 for WMT; Figure 4).
When aiming to keep models up-to-date (especially for larger models), lightweight yet effective approaches are preferable because they allow the model to rapidly digest new information with minimal time, computation, and carbon costs. We thus experiment with updating only the embedding layer (ie. 52M parameters), capturing lexical semantic changes, as well as updating only the bias terms at all layers (ie. 0.198M parameters), as recently introduced by Ben-Zakenet al2021. Figure 4 presents the results: In line with the findings of Ben-Zakenet al2021, updating only the bias terms performs nearly as well as updating the full model. [Large models are sample-efficient & good at continual-learning…]