“Mogrifier LSTM”, 2019-09-04 ():
Many advances in Natural Language Processing have been based upon more expressive models for how inputs interact with the context in which they occur. Recurrent networks, which have enjoyed a modicum of success, still lack the generalization and systematicity ultimately required for modeling language.
In this work, we propose an extension [Mogrifier] to the venerable Long Short-Term Memory (LSTM) in the form of mutual gating of the current input and the previous output. This mechanism affords the modeling of a richer space of interactions between inputs and their context. Equivalently, our model can be viewed as making the transition function given by the LSTM context-dependent.
Experiments demonstrate markedly improved generalization on language modeling in the range of 3–4 perplexity points on Penn Treebank and Wikitext-2, and 0.01-0.05 bpc on 4 character-based datasets. We establish a new state-of-the-art on all datasets with the exception of enwik8, where we close a large gap between the LSTM and Transformer models.
…Of particular note is the comparison to Transformer-XL ( et al 2019), a state-of-the-art model on larger datasets such as Wikitext-103 and enwik8. On PTB, without dynamic evaluation, the Transformer-XL is on par with our LSTM baseline which puts it about 3.5 perplexity points behind the Mogrifier. On enwik8, also without dynamic evaluation, the Transformer-XL has a large, 0.09 bpc advantage at similar parameter budgets, but with dynamic evaluation this gap disappears. However, we did not test the Transformer-XL ourselves, so fair comparison is not possible due to differing experimental setups and the rather sparse result matrix for the Transformer-XL.
View PDF: