ā€œBlock-Recurrent Transformersā€, DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur2022-03-11 (, ; backlinks; similar)⁠:

[Github] We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length.

Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware. The cell itself is strikingly simple. It is merely a transformer layer: it uses self-attention and cross-attention to efficiently compute a recurrent function over a large set of state vectors and tokens. Our design was inspired in part by LSTM cells, and it uses LSTM-style gates, but it scales the typical LSTM cell up by several orders of magnitude.

Our implementation of recurrence has the same cost in both computation time and parameter count as a conventional transformer layer, but offers dramatically improved perplexity in language modeling tasks over very long sequences.

Our model out-performs a long-range Transformer-XL baseline by a wide margin, while running twice as fast. We demonstrate its effectiveness on PG-19 (books), arXiv papers, and GitHub source code.

…Appendix G: Qualitative Analysis Results: The following are excerpts from our qualitative study. We selected 5 books at random from PG19 test set, and ran two different models on each book…For each token, we compute the difference between the cross-entropy loss (ie. the negative log likelihood (NLL)) output by both models, and then sort the results. Figure 4 shows an example of the per-token difference in NLL between the two models on the first book; the x-axis is the index of the token. On average, the recurrent model does slightly better than Transformer-XL, but it does not necessarily make a better prediction for any individual token.

The following excerpts show the top 4 tokens where the Block-Recurrent Transformer made a better prediction than Transformer-XL; these tokens correspond to spikes in Figure 4. We show the token number, the NLL returned by the recurrent model, the NLL returned by Transformer-XL, and an excerpt of text, with the token itself marked with |token|. Almost all of the top tokens are proper names of characters and places. In all cases except one, the mis-predicted name does not appear within the attention window of the previous 512 tokens. These names are thus invisible to Transformer-XL, but visible to the recurrent model.

Note that these are not cherry picked examples; the 5 books are chosen at random. Moreover, the same pattern still holds if the search is expanded to the top 40 tokens for each book. In fact, even the names are often the same; Transformer-XL often seems to mispredict the same names over and over again; these are likely the names of main characters.

…Our second qualitative study is structured similarly to the first, except that instead of comparing two different models, we compare two different runs of the same model…The first run processes the book normally, while the second run clears the recurrent states at the beginning of each 4,096-token segment. In the second run, the model can use recurrence within a segment to look beyond the local attention window of 512 tokens, but it cannot use recurrence to carry information from one segment to the next…The overall pattern is very similar to the first qualitative experiment: most of the tokens involve proper names. We verified that in most cases, the mis-predicted name not only does not occur within the 512-token attention window, but does not occur within the 4,096-token segment. In addition to proper names, chapter titles and illustration captions occur frequently within the top 40 results; the recurrent model seems to be remembering these from a previous occurrence in the table of contents…Perhaps most interestingly, in two of the books, one of the highest ranked mispredictions was the title and author of the book itself. The Gutenberg project inserts boilerplate at both the beginning and end of each book; the title and author are listed multiple times at the beginning, and once at the end. This experiment thus shows that the model is able to ā€œrememberā€ this information in the recurrent state, across a distance of 60,000 tokens or more.

Figure 6: Cumulative cross-entropy on PG19 of a 13-layer Transformer-XL and Block Recurrent model. Though comparable at the first few thousand tokens, the recurrent model performs better at longer sequences. In red we show the number of documents at a given token length.

…In Figure 6 we plot the cumulative cross-entropy, which is the average bits-per-token (ie. log2 perplexity) averaged up to the given length, and we compare the Block Recurrent Transformer against the Transformer-XL baseline. Performance of the two architectures is comparable for the first few thousand tokens, but the recurrent architecture clearly outperforms Transformer-XL at longer document lengths.