“Context on the NVIDIA ChatGPT Opportunity—And Ramifications of Large Language Model Enthusiasm”, Morgan Stanley2023-02-10 (, , , ; backlinks)⁠:

…We think that GPT-5 is currently being trained on 25k GPUs—$225m or so of NVIDIA hardware—and the inference costs are likely much lower than some numbers we have seen. Further, reducing inference costs will be critical in resolving the “cost of search” debate from cloud titans…we have talked to several industry participants about these workloads and do think we have some important context.

…(1) How much does ChatGPT drive incremental training demand?

It doesn’t—the training opportunity for ChatGPT is large, but the hardware used to train the next generation model is done in facility that was equipped over the last two years; growing model complexity will drive higher investment over time. GPT-3, a version of the model that ChatGPT is based on, was trained years ago. GPT-4 training was also completed some time ago. The current version of the model, GPT-5, will be trained in the same facility—announced in 2020, the supercomputer designed specifically for OpenAI has 285k CPU cores, 10k GPU cards, and 400 Gb/s connectivity for each GPU server; our understanding is that there has been substantial expansion since then. From our conversation, GPT-5 is being trained on about 25k GPUs, mostly A100s, and it takes multiple months; that’s about $225m of NVIDIA hardware, but importantly this is not the only use, and many of the same GPUs were used to train GPT-3 and GPT-4.

…We also would expect the number of large language models under development to remain relatively small. If the training hardware for GPT-5 is $225m worth of NVIDIA hardware, that’s close to $1b of overall hardware investment; that isn’t something that will be undertaken lightly. We see large language models at a similar scale being developed at every hyperscaler, and at multiple startups.

…But the major scaling factor will be the complexity of the models. There is ultimately a diminishing return to these investments, and we are hearing comments such as models that are 15% more intelligent are 10× as complex. [This just sounds like repeating the scaling laws on perplexity, which is not that useful a comment because no one can predict intelligence/emergence from perplexity…]

…The A100 is a chip designed for training, and while it can handle inference workloads, it’s a very cost inefficient approach; we note that Azure in its cloud offerings suggests that inference workloads should be on NVIDIA T4, rather than A100, which has a hardware cost that is 80% lower than A100 with more power efficiency.

…We have talked with multiple industry contacts who have said that there are a variety of inference implementations in different regions, with some CPU, some GPU, and some specialty silicon.

We found a recent quote from the CEO of Character.ai [Noam Shazeer] quite interesting on this topic on an interview in The Information: training is “completely different from inference costs, of course. You want inference to be cheap. And fundamentally, I think it can be cheap. Theoretically, if you can serve something on the GPT-3 scale model efficiently, you could produce like a million words per dollar. You just need to know how to trim the fat and do things…efficiently and you can serve the world.” That’s a tiny fraction of the types of numbers we are hearing today for cost/word. This points to a very substantial semiconductor opportunity…

NVIDIA’s data center business was about $15b revenue this year. We think that >65% of that—$10b—came form various deep learning businesses, with the balance coming from Mellanox (networking), academic supercomputers, and other hardware.

Within that mix, we would estimate that 90% of the AI inference—$9b—comes from various forms of training, and about $1b from inference. On the training side, some of that is in card form, and some of that—the smaller portion—is DGX servers, which monetize at 10× the revenue level of the card business. There are a variety of workloads that are trained and large language models are only one of those, but in our view likely the largest portion.