âNot All Layers Are Equally As Important: Every Layer Counts BERTâ, 2023-11-03 ()â :
This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models. Our approach allows each transformer layer to select which outputs of previous layers to process. [Itâs just a DenseNet, but for LTG-BERT.]
This aspect is evaluated by participating in the BabyLM challenge, where our solution won both the strict and strict-small tracks.
The empirical results verify the potential of this simple modification and show that not all layers are equally as important.
To our surprise, the winning approach beat LLaMA-2 70B (trained on 2 trillion tokens [rather than 0.001t words]) on 3â4 evals! Howâd they do it?
Flashy LTG-BERT architecture ( et al 2023)
Some small architecture mods
Train for ~500 epochs đ±
They also won
strict-small!