It is notoriously difficult to train Transformers on small datasets; typically, large pre-trained models are instead used as the starting point. We explore the weights of such pre-trained Transformers (particularly for vision) to attempt to find reasons for this discrepancy.
Surprisingly, we find that simply initializing the weights of self-attention layers so that they “look” more like their pre-trained counterparts allows us to train vanilla Transformers faster and to higher final accuracies, particularly on vision tasks such as CIFAR-10 and ImageNet classification, where we see gains in accuracy of over 5% and 4%, respectively.
Our initialization scheme is closed-form, learning-free, and very simple: we set the product of the query and key weights to be the identity, and the product of the value and projection weights to the negative identity.
As this mimics the patterns we saw in pre-trained Transformers, we call the technique mimetic initialization.
Figure 2: Attention maps computed from one CIFAR-10 batch for ViT-Tiny:
(a) untrained (b) CIFAR-10 trained (c) ImageNet pretrained (d) using our init (e) our init & then CIFAR-10 trained.
Rows: ↓ Layers #1, 4, 11.
…In Figure 2, we show the attention maps in a ViT-Tiny for a variety of training settings, averaged over the 3 heads and over a batch of CIFAR-10 inputs. Note the difference between the untrained model (a) and the untrained one using our initialization (d). Further, there is some degree of similarity between the ImageNet-pretrained model (c) and our untrained one (d).
After training our initialized ViT on CIFAR-10, the early layers are similar to those of the ImageNet-pretrained ViT while the later layers are more like those of the only-CIFAR-trained ViT (b).
The last layers of the ImageNet-pretrained ViT implement a kind of broadcasting operation which we do not attempt to mimick.
…7. Language Modeling While our method was primarily inspired by pretrained Vision Transformers, in this section we investigate its potential for use in language models. As noted in Sec. 3 and seen in Figure 9, we do not see precisely the same pattern in a pre-trained GPT-2 model as we do in a ViT. Nonetheless, we use the same technique here without modification; we saw no improvement from, eg. attempting to model the positive diagonals of WV Wproj.
Figure 7: A pretrained GPT-2 shows considerably different patterns in the products of WQWKT and WVWproj, compared to ViTs.
…While our initialization does not make a large amount of difference for these small-scale language tasks as it does for vision tasks, it does show a small amount of improvement. We suspect that it may be the case that a mimetic initialization scheme more finely-tuned to the language setting may show still better performance…We speculate that it may be possible to use domain knowledge to “program” models before training in order to reach more desirable optima that may have been out of reach with a completely random initialization. With better structured initialization techniques like our own, perhaps Transformers really are the universal architecture.