“Ethan Caballero on Private Scaling Progress”, Ethan Caballero, Michaël Trazzi2022-05-05 (, , )⁠:

  1. [cf. Broken Neural Scaling Laws] Alignment as an Inverse Scaling Problem:

    Ethan Caballero: [GS, Twitter] All alignment is inverse scaling problems. It’s all downstream inverse scaling problems. All of alignment is stuff that doesn’t improve monotonically as compute, data and parameters increase […] because sometimes there’s certain things where it improves for a while, but then at a certain point, it gets worse. So interpretability and controllability are the two kind of thought experiment things where you could imagine they get more interpretable and more controllable for a long time until they get superintelligent. At that point, they’re less interpretable and less controllable.

    …Then the hard problem though is measurement and finding out what are the downstream evaluations because say you got some fancy deceptive AI that wants to do a treacherous turn or whatever. How do you even find the downstream evaluations to know whether it’s gonna try to deceive you? Because when I say, it’s all a downstream scaling problem, that assumes you have the downstream test, the downstream thing that you’re evaluating it on. But if it’s some weird deceptive thing, it’s hard to even find what’s the downstream thing to evaluate it on to know whether it’s trying to deceive.

  2. On Private Research at Google, DeepMind:

    E Caballero: I know a bunch of people at Google said, yeah, we have language models that are way bigger than GPT-3, but we just don’t put them in papers…The DeepMind language models papers, they were a year old when they finally put them out on ArXiv, Gopher and Chinchilla. They had the language model finished training a year before the paper came out.

  3. On Thinking about the Fastest Path:

    Cabellero: You have to be thinking in terms of the fastest path, because there is extremely huge economic and military incentives that are selecting for the fastest path, whether you want it to be that way or not. So, you got to be thinking in terms of, what is the fastest path and then how do you minimize the alignment tax on that fastest path. Because the fastest path is the way it’s probably gonna happen no matter what.

    …The person who wins AGI is whoever has the best funding model for supercomputers. Whoever has the best funding model for supercomputers wins. You have to assume all entities have the nerve, ‘we’re gonna do the biggest training run ever’, but then given that’s your pre-filter, then it’s just whoever has the best funding models for supercomputers.

  4. On the funding of Large Language Models:

    • E C: A zillion Googlers have left Google to start large language model startups. There’s literally 3 large language model startups by ex-Googlers now [Adept.ai, Character.ai, and Inflection]. OpenAI is a small actor in this now because there’s multiple large language model startups founded by ex-Googlers that all were founded in the last 6 months. There’s a zillion VCs throwing money at large language model startups right now. The funniest thing, Leo Gao, he’s like: ‘we need more large language model startups because the more startups we have, then it splits up all the funding so no organization can have all the funding to get the really big supercomputer’ […] they were famous people like the founder of the DeepMind scaling team. Another one is the inventor of the Transformer. Another one was founded by a different person on the Transformer paper. In some ways, they have more clout than like OpenAI had.

    • C: …Like most entities won’t be willing to like do the largest training when they can, given their funding.

    • Michaël Trazzi: So maybe China, but I see Google as being more helpful because of they do it on paper, but maybe I’m wrong.


    • C: Jared Kaplan says like most like Anthropic and OpenAI are kind of unique in that they’re like, “okay. We’re gonna like throw all our funding into this one big training run.” But like Google and like ’cause Google and Amazon, they have like—he said—like at least, 10× or like 100× times the compute that OpenAI and Anthropic have, but they never like use all the compute for single training runs. They just have all these different teams that use to compute for these different things.

    • M Trazzi: Yeah, so they have like a different hypothesis. OpenAI is like scale is all that matters, somehow that they’re secrets itself and you just scale things and we are going to get better results, and Google is maybe there’s more bureaucracy and it’s maybe harder to get a massive budget.

  5. Scaling Exponent for Different Modalities:

    • Trazzi: What do you think is the exponent for video? Would it be like much worse?

    • C: I know the model size. The model size relation was the big point of the scaling laws. For autoregressive generative models, the paper [“Broken Neural Scaling Laws”, Caballero et al 2022; cf. Henighan et al 2020] says that the rate at which the model size grows, given your compute budget grows, is the same for every modality. So that was kind of like, that’s like a big unexplained thing. Like that was the biggest part just of that paper and no one’s been able to explain why that is yet.

    • M T: So there might be some universal law where scaling goes for all modality and nobody knows why.

    • C: Just stuff. The rate at which your model size grows given your compute budget is increasing is the same for every modality, which is kind of weird and no one, like I haven’t really heard a good explanation why.

      …In my mind, like, the video scaling was like a lot worse than text basically. That’s the main reason why I think AGI will probably take longer than the 5 years or whatever, in my mind.