âDo Long-Range Language Models Actually Use Long-Range Context?â, 2021-09-19 ()â :
Language models are generally trained on short, truncated input sequences, which limits their ability to use discourse-level information present in long-range context to improve their predictions. Recent efforts to improve the efficiency of self-attention have led to a proliferation of long-range Transformer language models, which can process much longer sequences than models of the past. However, the ways in which such models take advantage of the long-range context remain unclear.
In this paper, we perform a fine-grained analysis of two long-range Transformer language models (Local Transformer vs the Routing Transformer, which achieves state-of-the-art perplexity on the PG-19 long-sequence LM benchmark dataset) that accept input sequences of up to 8K tokens.
Our results reveal that providing long-range context (ie. beyond the previous 2K tokens) to these models only improves their predictions on a small set of tokens (eg. those that can be copied from the distant context [such as proper nouns]) and does not help at all for sentence-level prediction tasks.
Finally, we discover that PG-19 contains a variety of different document types and domains, and that long-range context helps most for literary novels (as opposed to textbooks or magazines).