Teaching Question: What are there technical ideas around Scaling (but not Alignment) that grad-level students should know or study? Is it thinking about systems? Order of magnitude thinking? Abstraction? Experimental design?
Replying to @srush_nlp
I find people unfamiliar with scaling are shocked by this:

Oct 12, 2022 · 12:31 AM UTC

Also all modern scaling strategies are highly synchronous. Which means one bad node can tank the entire system. I would love our future researchers to be thinking about this.
is this really true? it's not worth ddp
Would be nice to have a column for number of layers. If you change hidden dim, but keep number of layers same, then some of this is less surprising?
Hidden from view in screenshot. Left as an intentional exercise to the reader
Tbh building a flops calculator is a pretty good homework assignment…
Yeah, this makes a ton of sense. Quantized, backward, and forward memory as well.
This is pretty interesting. If the original Transformer has almost a linear computational cost wrt seq len (at scale), why so many people spend time on trying to make the attention cost O(n) (e.g., LongT5, etc)?
I suspect bc so many people are still thinking in BERT regime
Time to start pushing up those sequence lengths? :)
It’s still n^2 and one of the most memory heavy operations. But yeah, O(n) attention doesn’t have the same appeal as it did at BERT scale.