ā€œHow Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformersā€, Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah Smith, Roy Schwartz2022-11-07 (, ; backlinks)⁠:

[code] The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices.

We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones—the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on 6 downstream tasks.

We find that without any input-dependent attention, all models achieve competitive performance—an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones.

Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.

Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.

Figure 2: Probing results (y-axis) with decreasing number of attention heads (x-axis). BASE models are shown in Figure 2a, and LARGE models are shown in Figure 2b. Higher is better in all cases.

…Half of the attention matrices can be replaced without loss in performance: We note that in almost all cases replacing half of the models’ attention matrices leads to no major drop in performance. In fact, in some cases, performance even improves compared to the original model (eg. BERTBASE and DeBERTaLARGE), suggesting that some of the models’ heads have a slight preference towards constant matrices. This result is consistent with some of the findings of recent hybrid models that use both constant and regular attention (Liu et al 2021; Lee-Thorp et al 2021) to build efficient models.

…We first notice a diagonal pattern, in which each token mostly attends to itself or to its neighboring words. This pattern is observed in about 90% of the constant matrices produced by PAPA. Second, about 40% of the heads put most of their weight mass on the [CLS] and/or [SEP] tokens (perhaps in combination with the diagonal pattern described above). Lastly, while for some of the heads the weight mass is concentrated only in specific entry per row (which corresponding only to a specific token), in most of cases the weight mass is distributed over several entries (corresponding to several different tokens). These patterns are similar to those identified by Clark et al 2019, and explain in part our findings—many of the attention heads mostly focus on fixed patterns that can also be captured by a constant matrix.

Figure 3: Stronger-performing PLMs use their attention capability more. y-axis: original model average performance; x-axis: relative reduced score when all attention matrices are replaced with constant ones.

…Performant models rely more on attention: Figure 3 shows for each model the relation between the original performance (averaged across tasks) and the averaged (relative) reduced score when replacing all attention heads. We observe a clear trend between the models’ performance and their relative reduced score, which suggests that better performing models use their attention mechanism more.