Sasha Rush · Oct 11, 2022 · 5:10 PM UTC

Sasha Rush

11 Oct 2022

Teaching Question: What are there technical ideas around Scaling (but not Alignment) that grad-level students should know or study? Is it thinking about systems? Order of magnitude thinking? Abstraction? Experimental design?

Stephen Roller · Oct 12, 2022 · 12:31 AM UTC

Stephen Roller · Oct 12, 2022 · 12:31 AM UTC

Stephen Roller @stephenroller

12 Oct 2022

Replying to @srush_nlp

I find people unfamiliar with scaling are shocked by this:

Oct 12, 2022 · 12:31 AM UTC

252

Stephen Roller · Oct 12, 2022 · 1:45 AM UTC

Stephen Roller @stephenroller

12 Oct 2022

Replying to @stephenroller @srush_nlp

Also all modern scaling strategies are highly synchronous. Which means one bad node can tank the entire system. I would love our future researchers to be thinking about this.

Jules Gagnon-Marchand · Oct 14, 2022 · 6:04 PM UTC

Jules Gagnon-Marchand

@julesgm4

14 Oct 2022

is this really true? it's not worth ddp

more replies

Andrew Drozdov · Oct 12, 2022 · 2:43 AM UTC

Andrew Drozdov @mrdrozdov

12 Oct 2022

Replying to @stephenroller @srush_nlp

Would be nice to have a column for number of layers. If you change hidden dim, but keep number of layers same, then some of this is less surprising?

Stephen Roller · Oct 12, 2022 · 2:47 AM UTC

Stephen Roller @stephenroller

12 Oct 2022

Hidden from view in screenshot. Left as an intentional exercise to the reader

more replies

Stephen Roller · Oct 12, 2022 · 12:59 AM UTC

Stephen Roller @stephenroller

12 Oct 2022

Replying to @stephenroller @srush_nlp

Tbh building a flops calculator is a pretty good homework assignment…

Sasha Rush · Oct 12, 2022 · 1:54 AM UTC

Sasha Rush @srush_nlp

12 Oct 2022

Yeah, this makes a ton of sense. Quantized, backward, and forward memory as well.

more replies

Rodrigo Nogueira · Oct 13, 2022 · 7:40 PM UTC

Rodrigo Nogueira @rodrigfnogueira

13 Oct 2022

Replying to @stephenroller @srush_nlp

This is pretty interesting. If the original Transformer has almost a linear computational cost wrt seq len (at scale), why so many people spend time on trying to make the attention cost O(n) (e.g., LongT5, etc)?

Stephen Roller · Oct 13, 2022 · 7:47 PM UTC

Stephen Roller @stephenroller

13 Oct 2022

I suspect bc so many people are still thinking in BERT regime

more replies

Ross Wightman · Oct 12, 2022 · 12:55 AM UTC

Ross Wightman @wightmanr

12 Oct 2022

Replying to @stephenroller @srush_nlp

Time to start pushing up those sequence lengths? :)

Stephen Roller · Oct 12, 2022 · 12:57 AM UTC

Stephen Roller @stephenroller

12 Oct 2022

It’s still n^2 and one of the most memory heavy operations. But yeah, O(n) attention doesn’t have the same appeal as it did at BERT scale.

more replies