“What Do We Mean by ‘Diminishing Returns’ in Scaling?”, 2024-11-14 (; similar):
[Scott Sumner:] …In the past few years, I’ve had a number of interesting conversations with younger people who are involved in the field of artificial intelligence. These people know much more about AI than I do, so I would encourage readers to take the following with more than grain of salt. During the discussions, I sometimes expressed skepticism about the future pace of improvement in large language models such as ChatGPT. My argument was that there were some pretty severe diminishing returns to exposing LLMs to additional data sets. Think about a person that reads and understood 10 well-selected books on economics, perhaps a macro and micro principles text, as well as some intermediate and advanced textbooks. If you fully absorbed this material, you would actually know quite a bit of economics. Now have them read 100 more well-chosen textbooks. How much more economics would they actually know? Surely not 10× as much. Indeed I doubt they would even know twice as much economics. I suspect the same could be said for other fields like biochemistry or accounting.
…Rather, my point is that the advancement to some sort of super general intelligence may happen more slowly than some of its proponents expect. Why might I be wrong?
[previously; comments: 1, 2] The key point here is that the ‘severe diminishing returns’ were well-known and had been quantified extensively and the power-laws were what were being used to forecast and design the LLMs. So when you told anyone in AI “well, the data must have diminishing returns”, this was definitely true—but you weren’t telling anyone anything they shouldn’t’ve already known in detail. The returns have always diminished, right from the start. There has never been a time in AI where the returns did not diminish. (And in computing in general: “We went men to the moon with less total compute than we waste to animate your browser tab’s favicon now!” Nevertheless, computers are way more important to the world now than they were back then. The returns diminished, but Moore’s law kept lawing.)
The all-important questions are exactly how much it diminishes and why and what the other scaling laws are (eg. any specific diminishing returns in data would diminish slower if you were able to use more compute to extract more knowledge from each datapoint) and how they inter-relate, and what the consequences are.
The importance of the current rash of rumors about Claude/Gemini/GPT-5 is that they seem to suggest that something has gone wrong above and beyond the predicted power law diminishing returns of data.
The rumors are vague enough, however, that it’s unclear where exactly things went wrong. Did the LLMs explode during training? Did they train normally, but just not learn as well as they were supposed to, and they wind up not predicting text that much better, and did that happen at some specific point in training? Did they just not train enough because the datacenter constraints appear to have blocked any of the real scaleups we have been waiting for, like systems trained with 100×+ the compute of GPT-4? (That was the sort of leap which takes you from GPT-2 to GPT-3, and GPT-3 to GPT-4. It’s unclear how much “GPT-5” is over GPT-4; if it was only 10×, say, then we would not be surprised if the gains are relatively subtle and potentially disappointing.) Are they predicting raw text as well as they are supposed to but then the more relevant benchmarks like GPQA are stagnant, and they just don’t seem to act more intelligently on specific tasks, the way past models were clearly more intelligent in close proportion to how well they predicted raw text? Are the benchmarks better, but then the end-users are shrugging their shoulders and complaining the new models don’t seem any more useful? Right now, seen through the glass darkly of journalists paraphrasing second-hand simplifications, it’s hard to tell.
Each of these has totally different potential causes, meanings, and implications for the future of AI. Some are bad if you are hoping for continued rapid capability gains; others are not so bad.
Thanks for that very helpful comment. It seems my skepticism about the pace of improvement may have been correct, but perhaps for the wrong reason.
But I do recall one or two people I spoke with claiming that more data alone would produce big gains. So my sense is I was more pessimistic than some even on the specific topic of diminishing returns.
My guess is that when they said more data would produce big gains, they were referring to the Chinchilla scaling law breakthrough. They were right but there might have been some miscommunications there.
First, more data produced big gains in the sense that cheap small models suddenly got way better than anyone was expecting in 2020 by simply training them on a lot more data, and this is part of why ChatGPT-3 is now free and a Claude-3 or GPT-4 can cost like $10/month for unlimited use and you have giant context windows and can upload documents and whatnot. That’s important. In a Kaplan-scaling scenario, all the models would be far larger and thus more expensive, and you’d see much less deployment or ordinary people using them now. (I don’t know exactly how much but I think the difference would often be substantial, like 10×. The small model revolution is a big part of why token prices can drop >99% in such a short period of time.)
Secondly, you might have heard one thing when they said ‘more data’ when they were thinking something entirely different, because you might reasonably have thought that ‘more data’ had to be something small. While when they said ‘more data’, what they might have meant, because this was just obvious to them in a scaling context, was that ‘more’ wasn’t like 10% or 50% more data, but more like 1,000% more data. Because the datasets being used for things like GPT-3 were really still very small compared to the datasets possible, contrary to the casual summary of “training on all of the Internet” (which gives a good idea of the breadth and diversity, but is not even close to being quantitatively true). Increasing them 10× or 100× was feasible, so that would lead to a lot more knowledge.
It was popular in 2020–22022 to claim that all the text had already been used up and so scaling had hit a wall and such dataset increases were impossible, but it was just not true if you thought about it. I did not care to argue about it with proponents because it didn’t matter and there was already too much appetite for capabilities rather than safety, but I thought it was very obviously wrong if you weren’t motivated to find a reason scaling had already failed. For example, a lot of people seemed to think that Common Crawl contains ‘the whole Internet’, but it doesn’t—it doesn’t even contain basic parts of the Western Internet like Twitter. (Twitter is completely excluded from Common Crawl.) Or you could look at the book counts: the papers report training LLMs on a few million books, which might seem like a lot, but Google Books has closer to a few hundred million books-worth of text and a few million books get published each year on top of that. And then you have all of the newspaper archives going back centuries, and institutions like the BBC, whose data is locked up tight, but if you have billions of dollars, you can negotiate some licensing deals. Then you have millions of users each day providing unknown amounts of data. Then also if you have a billion dollars cash and you can hire some hard-up grad students or postdocs at $20/hour to write a thousand high-quality words, that goes a long way. And if your models get smart enough, you start using them in various ways to curate or generate data. And if you have more raw data, you can filter it more heavily for quality/uniqueness so you get more bang per token. And so on and so forth.
There was a lot of stuff you can do if you wanted to hard enough. If there was demand for the data, supply would be found for it. Back then, LLM creators didn’t invest much in creating data because it was so easy to just grab Common Crawl etc. If we ranked them on a scale of research diligence from “student making stuff up in class based on something they heard once” to “hedge fund flying spy planes and buying cellphone tracking and satellite surveillance data and hiring researchers to digitize old commodity market archives”, they were at the “read one Wikipedia article and looked at a reference or two” level. These days, they’ve leveled up their data game a lot and can train on far more data than they did back then.
View External Link: