×
all 9 comments

[–]StartledWatermelon 16 points17 points  (8 children)

Andrej has generously put the value of his time working on this at 0 dollars per hour. But I doubt I can hire him at this rate, even if I asked super nicely.

Training GPT-2 (1.5B) on 10B tokens in 2019 cost $50,000. I think it is pretty evident that the so-called "soft costs", the talent cost for the development of this model was at least an order of magnitude higher. And, unfortunately, we haven't seen comparable cost reduction in this area over the past 5 years.

Another important thing to consider is that Andrej has reproduced the model, not the research effort needed to make this model at the frontier of knowledge. Which involves a lot of exploration and a lot of experiments. Say, I'm not certain the community knew the optimal learning rates and batch sizes to train language models on large-scale corpus back then.

Anyway, the pace of progress in ML is such that a frontier model in 2019 is a toy problem in 2024 (or at least a toy problem for a brilliant reseacher with low resources). Hope we'll keep up the pace. GPT-4o for twenty bucks in 2029 doesn't sound bad.

[–]ResidentPositive4122 6 points7 points  (2 children)

GPT-4o for twenty bucks in 2029 doesn't sound bad.

Ha, exactly! And it might be even closer than that, I saw a post today about L3-8b + visual model for ~500bucks, claiming pretty good results over the other VLMs out there.

[–]gwerngwern.net 2 points3 points  (0 children)

I saw a post today about L3-8b + visual model for ~500bucks, claiming pretty good results over the other VLMs out there.

I believe that one turned out to be fraudulent: they plagiarized MiniCPM (and the author blamed by the co-authors turns out to have a history)?

[–]furrypony2718[S] 4 points5 points  (0 children)

This is not to demonstrate the cost of technology as it is first developed, but the *eventual* cost. It's the learning curve for technology.

https://en.wikipedia.org/wiki/File:Learning_curve_example_from_WWII_production_in_the_US_airframe_industry.jpg

[–]az226 2 points3 points  (0 children)

And this is even GPT-2.

We have made about 400-1000x improvement in training efficiency since what was known/done for GPT-3.

I’m experimenting with some infrastructure and think the training cost could go down 15x further. So the training of GPT-2 1.5B could be done for $150 and be done in 15 hours.

[–]TenshiS 1 point2 points  (1 child)

Not to mention it's easy to train small models using instruction input from the big models. RLHF for frontier models required armies of people giving feedback.

[–]damhack 0 points1 point  (0 children)

Sure, if you want a model that you can’t legally distribute for commercial use and are happy with a higher incidence of mode collapse..

[–]KallistiTMP 1 point2 points  (0 children)

Yeah, also any model that is small enough to be trained within a single host is going to be absurdly faster and easier to train. It's not a linear equation. Once you go past a certain point, the GPUs aren't even the bottleneck anymore, the bottleneck becomes the inter-node communication.

That's not even getting into the automation required to actually keep the damn thing running. A100 and H100 GPU's are notoriously prone to hardware failures. And at hero job scale manual intervention is not feasible, you have to have automated remediation and frequent checkpointing to minimize impact whenever a GPU or any other hardware component fails. And that's assuming it fails loudly, if it fails silently a bad GPU can silently corrupt your training run results. So now you need a burn-in process and comprehensive validation testing. Also storage, also all the bottlenecks you run into trying to simultaneously spin up thousand of workers for anything, also the default slow progressive rollout strategies used by cloud providers and designed to minimize disruption to generic web apps are murder to large training clusters, etc, etc, etc.

Hero job training clusters are a whoooole different ballgame.