Addendum: Evaluation of My Model

Addendum: Evaluation of My Model
Connor Leahy
Jun 13, 2019·2 min read
As a mercifully short addendum, I’d like to quickly address a few questions about my model. Please read my update post to hear my important updated beliefs on this situation, because I believe the details of how powerful my model is or not are not actually very important to the overall situation.
As described in my technical post, my model is not identical to OpenAI’s, because I simply didn’t have all the details of what they did. The truth is also that the samples and metrics I have shown aren’t 100% accurate. For one, my metric code is flawed, I made several rookie mistakes in setting up accurate evaluation (let train and eval data mix, used metrics whose math I didn’t understand etc), and the model I used to generate the samples is in fact not the final trained model, but one about halfway through the training. I didn’t take my time to evaluate the strength of my model, I simply saw I had the same amount of hardware as OpenAI and code as close to the paper as possible and went with it. The reason for this is a simple human flaw: I got cold feet once I realized what I was sitting on and acted rashly. I made a mistake, I did something stupid, that’s all there is to it.
Thanks to help from OpenAI it is now safe to say that my model is not as powerful as OpenAI’s. The metric results for WikiText2, LAMBADA and PTB are (lower is better):
GPT2: 18.67 / 8.63 / 36.51
Mine: 43.79 / 109.47 / 202.29
Although I used the same amount of hardware (or more), the differences in my training setup and hyperparameters made a significant difference. Which is an unfortunate reality to anyone familiar with reproducing deep learning papers. I don’t think my model in its current state is even as dangerous as 117M in its text generating abilities. But I believe to have found the quirks in my setup that have held the model back, and they are easy to fix. I am very tempted to continue tinkering with the model and seeing if I can improve it…but I will be holding back for now.
I think that even if the model isn’t perfect, it might be a useful “shortcut” to a more powerful model. In other words I think someone could save a good chunk of compute while creating a more powerful 1.5B by starting with my model. Even if it isn’t very useful for generating text, it may still be useful for creating a model that is.
So while its release definitely wouldn’t have the same impact as the original 1.5B, I have reasonable reasons to suspect it would still be a non-zero effect, and releasing it would undermine my overall message either way.
I think the power of my model doesn’t actually affect my main (updated) arguments, though. Please read the main update post.

The important facts

The code should run out of the box on GPUs and TPUs (and CPUs, if you’re really desperate). I used the parameters specified in 1.5B.json and trained it on a preemptible v3–512 TPU pod (which is actually more powerful than the machine OpenAI used) for around a week (with interruptions). …

Responses

Addendum: Evaluation of My Model

More from Connor Leahy

The Hacker Learns to Trust

GPT2, Counting Consciousness and the Curious Hacker

Replicating GPT2–1.5B

The important facts

More From Medium

GPT-3 vs PET: Not Big but Beautiful!

TLDR: Writing a Slack bot to Summarize Articles

Is it possible for language models to achieve language understanding?

GPT-3: The good, the bad and the ugly

GPT-3 Primer

Haystack: The State of Search in 2021

GPT-3 Temperature Setting 101

Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code