Addendum: Evaluation of My Model

As a mercifully short addendum, I’d like to quickly address a few questions about my model. Please read my update post to hear my important updated beliefs on this situation, because I believe the details of how powerful my model is or not are not actually very important to the overall situation.

As described in my technical post, my model is not identical to OpenAI’s, because I simply didn’t have all the details of what they did. The truth is also that the samples and metrics I have shown aren’t 100% accurate. For one, my metric code is flawed, I made several rookie mistakes in setting up accurate evaluation (let train and eval data mix, used metrics whose math I didn’t understand etc), and the model I used to generate the samples is in fact not the final trained model, but one about halfway through the training. I didn’t take my time to evaluate the strength of my model, I simply saw I had the same amount of hardware as OpenAI and code as close to the paper as possible and went with it. The reason for this is a simple human flaw: I got cold feet once I realized what I was sitting on and acted rashly. I made a mistake, I did something stupid, that’s all there is to it.

Thanks to help from OpenAI it is now safe to say that my model is not as powerful as OpenAI’s. The metric results for WikiText2, LAMBADA and PTB are (lower is better):

GPT2: 18.67 / 8.63 / 36.51
Mine: 43.79 / 109.47 / 202.29

Although I used the same amount of hardware (or more), the differences in my training setup and hyperparameters made a significant difference. Which is an unfortunate reality to anyone familiar with reproducing deep learning papers. I don’t think my model in its current state is even as dangerous as 117M in its text generating abilities. But I believe to have found the quirks in my setup that have held the model back, and they are easy to fix. I am very tempted to continue tinkering with the model and seeing if I can improve it…but I will be holding back for now.

I think that even if the model isn’t perfect, it might be a useful “shortcut” to a more powerful model. In other words I think someone could save a good chunk of compute while creating a more powerful 1.5B by starting with my model. Even if it isn’t very useful for generating text, it may still be useful for creating a model that is.

So while its release definitely wouldn’t have the same impact as the original 1.5B, I have reasonable reasons to suspect it would still be a non-zero effect, and releasing it would undermine my overall message either way.

I think the power of my model doesn’t actually affect my main (updated) arguments, though. Please read the main update post.

@NPCollapse

I have decided to not release my model, and explain why below. I have also written a small addendum answering some questions about my model and its quality. I would like to thank every single person that engaged with me and helped me come to these conclusions. I have learned so much and could never have done it on my own. Thank you.

A few days ago, I trained a 1.5bn parameter language model, with the goal of replicating GPT2 (though subsequent analysis showed that although my model is identical in most technical details, its performance characteristics are significantly worse)…


Disclaimer: I would like it to be made very clear that I am absolutely 100% open to the idea that I am wrong about anything in this post. I don’t only accept but explicitly request arguments that could convince me I am wrong on any of these issues. If you think I am wrong about anything here, and have an argument that might convince me, please get in touch and present your argument. I am happy to say “oops” and retract any opinions presented here and change my course of action.

As the saying goes: “When the facts change, I…


In this post, I want to quickly talk about the technical and organizational questions around my recent replication of GPT2–1.5B. Please read my main post for the full story. I will try to keep this post brief.

The important facts

Code: https://github.com/ConnorJL/GPT2

Samples: https://github.com/ConnorJL/GPT2/tree/master/samples

The code should run out of the box on GPUs and TPUs (and CPUs, if you’re really desperate). I used the parameters specified in 1.5B.json and trained it on a preemptible v3–512 TPU pod (which is actually more powerful than the machine OpenAI used) for around a week (with interruptions). …


Get the Medium app