āAddendum: Evaluation of My Modelā, 2019-06-12 (; backlinks; similar)ā :
As a mercifully short addendum, Iād like to quickly address a few questions about my model. Please read my update post to hear my important updated beliefs on this situation, because I believe the details of how powerful my model is or not are not actually very important to the overall situation.
As described in my technical post, my model is not identical to OpenAIās, because I simply didnāt have all the details of what they did. The truth is also that the samples and metrics I have shown arenāt 100% accurate. For one, my metric code is flawed, I made several rookie mistakes in setting up accurate evaluation (let train and eval data mix, used metrics whose math I didnāt understand etc), and the model I used to generate the samples is in fact not the final trained model, but one about halfway through the training. I didnāt take my time to evaluate the strength of my model, I simply saw I had the same amount of hardware as OpenAI and code as close to the paper as possible and went with it. The reason for this is a simple human flaw: I got cold feet once I realized what I was sitting on and acted rashly. I made a mistake, I did something stupid, thatās all there is to it.
Thanks to help from OpenAI it is now safe to say that my model is not as powerful as OpenAIās. The metric results for WikiText2, LAMBADA and PTB are (lower is better):
GPT-2: 18.67 / 8.63 / 36.51
Mine: 43.79 / 109.47 / 202.29
Although I used the same amount of hardware (or more), the differences in my training setup and hyperparameters made a substantial difference. Which is an unfortunate reality to anyone familiar with reproducing deep learning papers. I donāt think my model in its current state is even as dangerous as 117M in its text generating abilities. But I believe to have found the quirks in my setup that have held the model back, and they are easy to fix. I am very tempted to continue tinkering with the model and seeing if I can improve itā¦but I will be holding back for now.