“Compressive Transformers for Long-Range Sequence Modeling”, 2019-11-13 (; similar):
We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning.
We find the Compressive Transformer obtains state-of-the-art language modeling results in the WikiText-103 and enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task.
To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modeling benchmark derived from books, PG-19.