I'm happy for Tomas Mikolov and his co-authors to have won the test of time award. It's well deserved and I congratulated him on Facebook already a few days ago.
He is Schmidhuber-ing a little but it's understandable.
As is often the case in academia: Success has a lot of parents, only failure is an orphan. Often, intelligent and creative minds see connections and isomorphisms between ideas and models.
Once a topic has excited smart and creative minds they tend to have similar ideas. Many ideas are sort of "in the air." It's also true that a lot of ideas have been mentioned at some point but not implemented or executed on at scale.
I try not to attribute malice where ignorance can suffice. When I first started playing around with nnets in 2009 I tried to merge ideas from standard neural nets (that were all on fixed length inputs, images, sound snippets, fixed text windows) with those of NLP (like grammar). I was just excited, exploring and had this idea to merge word vectors one pair at a time in a syntactic tree structure. Then I had the idea that I should call those recursive neural nets since I apply the same nnet composition function on its own outputs inside a DAG... a few days later I realized I should google that term "Recursive Neural Network" (now I'd use YOU . com of course ;) and was bummed that the same idea was already invented in the 1980s with RAAMs. So just by playing around with the ideas I reinvented the wheel. At the same time, those old RAAMs didn't work at all, they were applied to binary vectors over vocabularies of ~32 words or so. There was no proper SGD, no dynamic tree structures, no large data, no useful objective functions on top, etc. Now, I did cite the old work of course but it wasn't really influential for me at all. I had grown attached to the idea way more by having invented it myself from first principles.
When Mikolov writes in his somewhat disgruntled fashion that we "copied many tricks", it's a bit unclear why since we actually cited him 7 times in the GloVe paper. I also think that when you do a linear computation of embeddings, it's still a clever idea to do it over the statistics rather than for each sample individually 🤷 Among many other things, it made it easier to scale.
Scaling is still under-appreciated in academia. Less so thanks to OpenAI but still. I often think transformers are just easier to scale than LSTMs but one could achieve similar results with both. Having conviction that the effort of scaling will bear fruits has alpha in itself. One might argue that scaling the hardware + data + network + engineering + teams + processes is just as important as finding a more scalable architecture for overall progress.
In 2010, there were only a few folks really focused and actively working on nnets for language: Ronan Collobert, Jason Weston, Tomas Mikolov, Yoshua Bengio, myself, Chris Manning and a handful others. The field moves so fast now that unless you keep doing amazing work the new generation will quickly forget.
The fields of AI and deep NLP have expanded so much in the last year and many folks who are just now joining or noticing it think "it came out of nowhere". Understandably, that upsets some folks who have been at it for a while and laid the groundwork. But hey, we should be mostly stoked that our ideas are scaling at this massive rate :)
Since language is inextricably connected to thought, there's still so much more to be explored. When we combine NLP with exploration, simulation, search and tool usage, we will be able to do even more than what's possible if we rely only on predicting the next token in human text.
Let's keep accelerating and building.
Dec 16, 2023 · 11:08 PM UTC