Richard Socher · Dec 16, 2023 · 11:08 PM UTC

Richard Socher · Dec 16, 2023 · 11:08 PM UTC

Richard Socher

Richard Socher

@RichardSocher

16 Dec 2023

I'm happy for Tomas Mikolov and his co-authors to have won the test of time award. It's well deserved and I congratulated him on Facebook already a few days ago. He is Schmidhuber-ing a little but it's understandable. As is often the case in academia: Success has a lot of parents, only failure is an orphan. Often, intelligent and creative minds see connections and isomorphisms between ideas and models. Once a topic has excited smart and creative minds they tend to have similar ideas. Many ideas are sort of "in the air." It's also true that a lot of ideas have been mentioned at some point but not implemented or executed on at scale. I try not to attribute malice where ignorance can suffice. When I first started playing around with nnets in 2009 I tried to merge ideas from standard neural nets (that were all on fixed length inputs, images, sound snippets, fixed text windows) with those of NLP (like grammar). I was just excited, exploring and had this idea to merge word vectors one pair at a time in a syntactic tree structure. Then I had the idea that I should call those recursive neural nets since I apply the same nnet composition function on its own outputs inside a DAG... a few days later I realized I should google that term "Recursive Neural Network" (now I'd use YOU . com of course ;) and was bummed that the same idea was already invented in the 1980s with RAAMs. So just by playing around with the ideas I reinvented the wheel. At the same time, those old RAAMs didn't work at all, they were applied to binary vectors over vocabularies of ~32 words or so. There was no proper SGD, no dynamic tree structures, no large data, no useful objective functions on top, etc. Now, I did cite the old work of course but it wasn't really influential for me at all. I had grown attached to the idea way more by having invented it myself from first principles. When Mikolov writes in his somewhat disgruntled fashion that we "copied many tricks", it's a bit unclear why since we actually cited him 7 times in the GloVe paper. I also think that when you do a linear computation of embeddings, it's still a clever idea to do it over the statistics rather than for each sample individually 🤷 Among many other things, it made it easier to scale. Scaling is still under-appreciated in academia. Less so thanks to OpenAI but still. I often think transformers are just easier to scale than LSTMs but one could achieve similar results with both. Having conviction that the effort of scaling will bear fruits has alpha in itself. One might argue that scaling the hardware + data + network + engineering + teams + processes is just as important as finding a more scalable architecture for overall progress. In 2010, there were only a few folks really focused and actively working on nnets for language: Ronan Collobert, Jason Weston, Tomas Mikolov, Yoshua Bengio, myself, Chris Manning and a handful others. The field moves so fast now that unless you keep doing amazing work the new generation will quickly forget. The fields of AI and deep NLP have expanded so much in the last year and many folks who are just now joining or noticing it think "it came out of nowhere". Understandably, that upsets some folks who have been at it for a while and laid the groundwork. But hey, we should be mostly stoked that our ideas are scaling at this massive rate :) Since language is inextricably connected to thought, there's still so much more to be explored. When we combine NLP with exploration, simulation, search and tool usage, we will be able to do even more than what's possible if we rely only on predicting the next token in human text. Let's keep accelerating and building.

Dec 16, 2023 · 11:08 PM UTC

378

Ambika Sukla · Dec 17, 2023 · 2:54 AM UTC

Ambika Sukla

@AmbikaSukla

17 Dec 2023

Replying to @RichardSocher

Although word2vec was the first intro to word embeddings, I found Glove to be more intuitive and easy to port. Your script to build word embeddings from any corpus was very handy. We still use Glove in production as a lightweight 1st filter along with SIF.

Sara Polak · Dec 17, 2023 · 2:29 PM UTC

Sara Polak

@sarapolak_cz

17 Dec 2023

Replying to @RichardSocher

Hey Richard, thanks for the status. Can you be a bit more specific in what way Tomas’ statement is inaccurate and what points exactly you disagree with?

Abhinav Upadhyay · Dec 17, 2023 · 9:49 AM UTC

Abhinav Upadhyay

@abhi9u

17 Dec 2023

Replying to @RichardSocher

That's a very graceful and positive response.

Emma P. N. · Dec 17, 2023 · 5:31 AM UTC

Emma P. N. @emmaphuongng

17 Dec 2023

Replying to @RichardSocher

I joined later in early 2015 from math + neuroscience + privacy + faith on computing growth trajectory. Still, it was quite depressing to hear what people say, complain and doubt when I travel around the West including the Bay Area. Often, it's Doc2Vec or Word2Vec as an example.

Teemu Määttä · Dec 17, 2023 · 4:24 PM UTC

Teemu Määttä @TeemuMtt3

17 Dec 2023

Replying to @RichardSocher

Great summary Richard! I totally agree and I think its as well good Mikolov research is more widely known.

J·O·V·A·N·I · Dec 17, 2023 · 12:20 AM UTC

J·O·V·A·N·I @Jovani_Dooley50

17 Dec 2023

Replying to @RichardSocher

Congrats to Tomas Mikolov and his co-authors on winning the test of time award! Well-deserved recognition for their contributions. It's great to see their hard work acknowledged. Success in academia often comes with challenges, but they've overcome them.

Jairo Luciano · Dec 16, 2023 · 11:43 PM UTC

Jairo Luciano @jairo_luciano

16 Dec 2023

Replying to @RichardSocher

This is so great, a mature point. Both were incredible. Word2vec gained attention to neural nets in semantics. Glove and the Stanford online classes that were released by the same time gave immense reach to everyone. I still remember how exciting was watching those online classes