“Not All Layers Are Equally As Important: Every Layer Counts BERT”, Lucas Georges Gabriel Charpentier, David Samuel2023-11-03 ()⁠:

This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models. Our approach allows each transformer layer to select which outputs of previous layers to process. [It’s just a DenseNet, but for LTG-BERT.]

This aspect is evaluated by participating in the BabyLM challenge, where our solution won both the strict and strict-small tracks.

The empirical results verify the potential of this simple modification and show that not all layers are equally as important.


Alex Warstadt:

To our surprise, the winning approach beat LLaMA-2 70B (trained on 2 trillion tokens [rather than 0.001t words]) on 3⁄4 evals! How’d they do it?

  1. Flashy LTG-BERT architecture (Samuel et al 2023)

  2. Some small architecture mods

  3. Train for ~500 epochs đŸ˜±

They also won strict-small!