I'm starting to think loss is harmful. Our loss has been a flat 2.2 for the past five days training GPT-2 1.5B. Yet according to human testing, it's been getting noticeably better every day. It's now good enough to amuse /r/dota2: teddit.net/r/DotA2/comments/…

Dec 31, 2019 · 11:40 PM UTC

The dota2 data is only 0.73% of the overall training data (73MB out of 10GB). Yet the bot is adept enough to convince /r/dota2 that it's talking about dota. Again: Loss has been a flat 2.2 for the last five days. And five days ago, the model wasn't this good.
The history of science shows that when something seems out of place, we should pay close attention. Loss != quality. This topic deserves thorough analysis. And as far as I know, no one has done it yet. Otherwise people wouldn't be using loss as a quality metric.
Replying to @theshawwn
I wonder if there is a connection to ”deep double descend” openai.com/blog/deep-double-…
Replying to @theshawwn
Are there other interesting metrics that are evolving and visible? Distribution of weights, some entropy measures, etc.
Replying to @theshawwn @heghbalz
This would suggest that the weights still travel to "better" arrangement despite near constant loss. I thought that at that point it's more of a random walk.
Replying to @theshawwn
Is the loss evaluated on dota2 data only or the whole corpus? It may only progress on dota2 and stall on the remaining
Replying to @theshawwn
Something I don't understand here: the optimizer goal is to minimize the loss no? So how can it converge towards better solutions without actually reducing the loss? What tells it "this is a better sol" if the only measure it has for the direction is the gradient of the loss?
Replying to @theshawwn
Need to train a model to detect the 'getting better' trend here.