[code] The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices.
We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant onesāthe average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on 6 downstream tasks.
We find that without any input-dependent attention, all models achieve competitive performanceāan average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones.
Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.
Figure 2: Probing results (y-axis) with decreasing number of attention heads (x-axis). BASE models are shown in Figure 2a, and LARGE models are shown in Figure 2b. Higher is better in all cases.
ā¦Half of the attention matrices can be replaced without loss in performance: We note that in almost all cases replacing half of the modelsā attention matrices leads to no major drop in performance. In fact, in some cases, performance even improves compared to the original model (eg. BERTBASE and DeBERTaLARGE), suggesting that some of the modelsā heads have a slight preference towards constant matrices. This result is consistent with some of the findings of recent hybrid models that use both constant and regular attention (Liuet al2021; Lee-Thorpet al2021) to build efficient models.
ā¦We first notice a diagonal pattern, in which each token mostly attends to itself or to its neighboring words. This pattern is observed in about 90% of the constant matrices produced by PAPA. Second, about 40% of the heads put most of their weight mass on the [CLS] and/or [SEP] tokens (perhaps in combination with the diagonal pattern described above). Lastly, while for some of the heads the weight mass is concentrated only in specific entry per row (which corresponding only to a specific token), in most of cases the weight mass is distributed over several entries (corresponding to several different tokens). These patterns are similar to those identified by Clarket al2019, and explain in part our findingsāmany of the attention heads mostly focus on fixed patterns that can also be captured by a constant matrix.
Figure 3: Stronger-performing PLMs use their attention capability more. y-axis: original model average performance; x-axis: relative reduced score when all attention matrices are replaced with constant ones.
ā¦Performant models rely more on attention: Figure 3 shows for each model the relation between the original performance (averaged across tasks) and the averaged (relative) reduced score when replacing all attention heads. We observe a clear trend between the modelsā performance and their relative reduced score, which suggests that better performing models use their attention mechanism more.