“Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling”, Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, Phil Blunsom2021-02-03 (, , , , )⁠:

Our world is open-ended, non-stationary, and constantly evolving; thus what we talk about and how we talk about it change over time. This inherent dynamic nature of language contrasts with the current static language modeling paradigm, which trains and evaluates models on utterances from overlapping time periods.

Despite impressive recent progress, we demonstrate that Transformer-XL language models perform worse in the realistic setup of predicting future utterances from beyond their training period, and that model performance becomes increasingly worse with time. We find that, while increasing model size alone—a key driver behind recent progress—does not solve this problem, having models that continually update their knowledge with new information can indeed mitigate this performance degradation over time.

Hence, given the compilation of ever-larger language modeling datasets, combined with the growing list of language-model-based NLP applications that require up-to-date factual knowledge about the world, we argue that now is the right time to rethink the static way in which we currently train and evaluate our language models, and develop adaptive language models that can remain up-to-date with respect to our ever-changing and non-stationary world.

We publicly release our dynamic, streaming language modeling benchmarks for WMT and arXiv to facilitate language model evaluation that takes temporal dynamics into account.

4. The effect of outdated models persists even when increasing model sizes: …can increasing model size also improve temporal generalization? To this end, we train a bigger TIME-STRATIFIED model with 448M parameters—a 60% increase over the previous 287M model and 30% larger than GPT-2medium.

Figure 3: Relative perplexity increase of the TIME-STRATIFIED models with 287M (dotted line) and 448M parameters (solid line), respectively, over the CONTROL model with 287M parameters, for WMT and CustomNews (§4).

…If increasing the model size was able to delay temporal degradation, we would expect to see the solid lines produced by the bigger models to have reduced (ie. flatter) slopes compared to the dotted lines produced by the smaller models. While larger TIME-STRATIFIED models, as expected, achieve lower absolute perplexities (5.5% improvement), model size has no statistically-significant effect on the slope of these lines (p > 0.05, assessed using a t-test on the slopes found by fitting a linear regression). On both datasets, by the end of the test period (ie. late-2019), a smaller but more up-to-date CONTROL287M model outperforms a 60% larger but 2-year out-of-date TIME-STRATIFIED448M model. Hence, building models that perform well in this setup requires solutions that more directly tackle the specific challenges we emphasized through our findings so far, and update the model’s knowledge with new information.

[This is a poor evaluation that does not justify their claims. They did not scale up dataset size or compute, by their description, so this is an improper inefficient scale-up which was not compute-optimal. They didn’t even increase the model size by an OOM, so it’s a tiny scale-up which couldn’t show much. And evaluating the benefit by ‘p > 0.05’ is fallacious in the standard NHST way: a failure to reject the null does not support the null. In fact, while they don’t provide the actual effect size, eyeballing their graph, it looks like the effect was in the predicted direction of the larger model scaling better! Not that this would be a great way to evaluate it even if they had done a real scale-up of multiple orders of magnitude and enough samples that they had genuine statistical power for their NHST tests, because the sensible way to handle temporal decay is to finetune on new data—and given the greater sample-efficiency of larger models, we can be sure that the larger models will do much better in staying up to date. Strange that they prefer the much more complex dynamic evaluation approach over simple scaling approaches, given that no one seems to want to deploy dynamic evaluation…]

6. Keeping models up-to-date: Online learning through dynamic evaluation:

One way to mitigate LMs’ degradation over time is to continually update the models’ knowledge with new information as new documents arrive into the stream. One of the ways to do this is through dynamic evaluation (Mikolov et al 2010; Graves2013; Krause et al 2019)—a form of online learning that continually updates the parameters of a pretrained model by performing gradient descent on the new data. While most prior work used dynamic evaluation to perform updates within a document, hence adapting to local topical shifts, here we use it to adapt to the temporal dynamics that occur within a stream of chronologically ordered documents, hence adapting to temporal trends across documents. Appendix B has more details on dynamic evaluation and our empirical settings.

Figure 5: Relative perplexity increase with (solid lines) and without (dotted lines) dynamic evaluation, for the TIME-STRATIFIED model.

We plot the results in Figure 5: Dotted lines reflect the perplexity increase when comparing the CONTROL model to the TIME-STRATIFIED model, ie. the same graph as in Figure 1, whereas solid lines reflect the perplexity increase achieved when comparing the same CONTROL model with the TIME-STRATIFIED model augmented with dynamic evaluation (TIME-STRATIFIEDdyn). In all datasets, dynamic evaluation reduces the speed of the model becoming outdated, as evidenced by the reduced upward slope, with a statistically-significant effect for arXiv and WMT (p < 0.05, assessed using a t-test on the slopes found by fitting a linear regression). The improvements are more pronounced for arXiv, where a more granular analysis over weeks reveals that the model needs only about one week worth of data to overtake the CONTROL model. Moreover, we see much larger improvements for predicting EMERGING NEW WORDS, which exhibit strong temporal dynamics (§3.1, see Figure 3): We observe a 39.62% ppl. reduction 109.73 → 66.2 for EMERGING NEW WORDS, compared to the overall ppl. reduction (a 1.25% reduction 22.45 → 22.17 for WMT; Figure 4).

When aiming to keep models up-to-date (especially for larger models), lightweight yet effective approaches are preferable because they allow the model to rapidly digest new information with minimal time, computation, and carbon costs. We thus experiment with updating only the embedding layer (ie. 52M parameters), capturing lexical semantic changes, as well as updating only the bias terms at all layers (ie. 0.198M parameters), as recently introduced by Ben-Zaken et al 2021. Figure 4 presents the results: In line with the findings of Ben-Zaken et al 2021, updating only the bias terms performs nearly as well as updating the full model. [Large models are sample-efficient & good at continual-learning…]