Teaching Question: What are there technical ideas around Scaling (but not Alignment) that grad-level students should know or study? Is it thinking about systems? Order of magnitude thinking? Abstraction? Experimental design?
Also all modern scaling strategies are highly synchronous. Which means one bad node can tank the entire system. I would love our future researchers to be thinking about this.
This is pretty interesting. If the original Transformer has almost a linear computational cost wrt seq len (at scale), why so many people spend time on trying to make the attention cost O(n) (e.g., LongT5, etc)?