“8 Google Employees Invented Modern AI. Here’s the Inside Story: They Met by Chance, Got Hooked on an Idea, and Wrote the Transformers Paper—The Most Consequential Tech Breakthrough in Recent History”, Steven Levy2024-03-20 (, ; backlinks)⁠:

[A history of the developments inside Google leading to Vaswani et al 2017 (WP): low-level optimization by Noam Shazeer, trial-and-error, lots of compute & data. Interviews with the 8 authors: “Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin”.]

…All 8 authors have since left Google. Like millions of others, they are now working in some way with systems powered by what they created in 2017. I talked to the Transformer 8 to piece together the anatomy of a breakthrough, a gathering of human minds to create a machine that might well save the last word for itself.

…Apple had just announced Siri, a virtual assistant that promised to deliver one-shot answers in casual conversation, and the Google brass smelled a huge competitive threat: Siri could eat up their search traffic. They started paying a lot more attention to Uszkoreit’s new group.

“It was a false panic”, Uszkoreit says. Siri never really threatened Google. But he welcomed the chance to dive into systems where computers could engage in a kind of dialog with us. At the time, recurrent neural networks—once an academic backwater—had suddenly started outperforming other methods of AI engineering. The networks consist of many layers, and information is passed and repassed through those layers to identify the best responses. Neural nets were racking up huge wins in fields such as image recognition, and an AI renaissance was suddenly underway. Google was frantically rearranging its workforce to adopt the techniques. The company wanted systems that could churn out human-like responses—to auto-complete sentences in emails or create relatively simple customer service chatbots.

But the field was running into limitations. Recurrent neural networks struggled to parse longer chunks of text…“The methods we were applying were basically Band-Aids”, Uszkoreit says. “We could not get the right stuff to really work at scale.”

Around 2014, he began to concoct a different approach that he referred to as self-attention. This kind of network can translate a word by referencing any other part of a passage. Those other parts can clarify a word’s intent and help the system produce a good translation. “It actually considers everything and gives you an efficient way of looking at many inputs at the same time and then taking something out in a pretty selective way”, he says. Though AI scientists are careful not to confuse the metaphor of neural networks with the way the biological brain actually works, Uszkoreit does seem to believe that self-attention is somewhat similar to the way humans process language.

Uszkoreit thought a self-attention model could potentially be faster and more effective than recurrent neural nets. The way it handles information was also perfectly suited to the powerful parallel processing chips that were being produced en masse to support the machine learning boom. Instead of using a linear approach (look at every word in sequence), it takes a more parallel one (look at a bunch of them together). If done properly, Uszkoreit suspected, you could use self-attention exclusively to get better results…Say goodbye to recurrent neural nets? Heresy! “From dinner-table conversations I had with my dad, we weren’t necessarily seeing eye to eye.”

…Uszkoreit persuaded a few colleagues to conduct experiments on self-attention. Their work showed promise, and in 2016 they published a paper about it. Uszkoreit wanted to push their research further—the team’s experiments used only tiny bits of text—but none of his collaborators were interested. Instead, like gamblers who leave the casino with modest winnings, they went off to apply the lessons they had learned. “The thing worked”, he says. “The folks on that paper got excited about reaping the rewards and deploying it in a variety of different places at Google, including search and, eventually, ads. It was an amazing success in many ways, but I didn’t want to leave it there.”

Uszkoreit felt that self-attention could take on much bigger tasks. There’s another way to do this, he’d argue to anyone who would listen, and some who wouldn’t, outlining his vision on whiteboards in Building1945, named after its address on Charleston Road on the northern edge of the Google campus.

…The transformer crew set about building a self-attention model to translate text from one language to another. They measured its performance using a benchmark called BLEU, which compares a machine’s output to the work of a human translator. From the start, their new model did well. “We had gone from no proof of concept to having something that was at least on par with the best alternative approaches to LSTMs by that time”, Uszkoreit says. But compared to long short-term memory, “it wasn’t better.”

They had reached a plateau—until one day in 2017, when Noam Shazeer heard about their project, by accident. Shazeer was a veteran Googler—he’d joined the company in 2000—and an in-house legend, starting with his work on the company’s early ad system. Shazeer had been working on deep learning for 5 years and recently had become interested in large language models. But these models were nowhere close to producing the fluid conversations that he believed were possible…Shazeer found the existing recurrent neural networks “irritating” and thought: “Let’s go replace them!”

Shazeer’s joining the group was critical. “These theoretical or intuitive mechanisms, like self-attention, always require very careful implementation, often by a small number of experienced ‘magicians’, to even show any signs of life”, says Uszkoreit. Shazeer began to work his sorcery right away. He decided to write his own version of the transformer team’s code. “I took the basic idea and made the thing up myself”, he says. Occasionally he asked Kaiser questions, but mostly, he says, he “just acted on it for a while and came back and said, ‘Look, it works.’” Using what team members would later describe with words like “magic” and “alchemy” and “bells and whistles”, he had taken the system to a new level.

“That kicked off a sprint”, says Gomez. They were motivated, and they also wanted to hit an upcoming deadline—May 19, the filing date for papers to be presented at the biggest AI event of the year, the Neural Information Processing Systems conference in December. As what passes for winter in Silicon Valley shifted to spring, the pace of the experiments picked up. They tested two models of transformers: one that was produced with 12 hours of training and a more powerful version called Big that was trained over 3 and a half days. They set them to work on English-to-German translation.

The basic model outperformed all competitors—and Big earned a BLEU score that decisively shattered previous records while also being more computationally efficient. “We had done it in less time than anyone out there”, Parmar says. “And that was only the beginning, because the number kept improving.” When Uszkoreit heard this, he broke out an old bottle of champagne he had lying around in his mountain expedition truck.

The last two weeks before the deadline were frantic. Though officially some of the team still had desks in Building1945, they mostly worked in Building1965 because it had a better espresso machine in the micro-kitchen. “People weren’t sleeping”, says Gomez, who, as the intern, lived in a constant debugging frenzy and also produced some diagrams for the paper. It’s common in such projects to do ablations—taking things out to see whether what remains is enough to get the job done.

“There was every possible combination of tricks and modules—which one helps, which doesn’t help. Let’s rip it out. Let’s replace it with this”, Gomez says. “Why is the model behaving in this counterintuitive way? Oh, it’s because we didn’t remember to do the masking properly. Does it work yet? OK, move on to the next. All of these components of what we now call the transformer were the output of this extremely high-paced, iterative trial and error.” The ablations, aided by Shazeer’s implementations, produced “something minimalistic”, Jones says. “Noam is a wizard.”

…Gomez was there, and Vaswani told him that what they were working on would transcend machine translation. “Ultimately, like with the human brain, you need to unite all these modalities—speech, audio, vision—under a single architecture”, he says. “I had a strong hunch we were onto something more general.”

In the higher echelons of Google, however, the work was seen as just another interesting AI project. I asked several of the transformers folks whether their bosses ever summoned them for updates on the project. Not so much. But “we understood that this was potentially quite a big deal”, says Uszkoreit. “And it caused us to actually obsess over one of the sentences in the paper toward the end, where we comment on future work.” That sentence anticipated what might come next—the application of transformer models to basically all forms of human expression. “We are excited about the future of attention-based models”, they wrote. “We plan to extend the transformer to problems involving input and output modalities other than text” and to investigate “images, audio and video.”

…When the transformer crew heard back from the conference peer reviewers, the response was a mix. “One was positive, one was extremely positive, and one was, ‘This is OK’”, says Parmar. The paper was accepted for one of the evening poster sessions.

By December, the paper was generating a buzz. [especially on Twitter & Reddit] Their 4-hour session on December 6 was jammed with scientists wanting to know more. The authors talked until they were hoarse. By 10:30 pm, when the session closed, there was still a crowd. “Security had to tell us to leave”, says Uszkoreit. Perhaps the most satisfying moment for him was when computer scientist Sepp Hochreiter came up and praised the work—quite a compliment, considering that Hochreiter was the co-inventor of long short-term memory, which transformers had just booted as the go-to hammer in the AI toolkit.

…Transformers did not instantly take over the world, or even Google. Kaiser recalls that around the time of the paper’s publication, Shazeer proposed to Google executives that the company abandon the entire search index and train a huge network with transformers—basically to transform how Google organizes information. [Presumably he didn’t mean direct Q&A of everything? but maybe predicting relevant document URLs—which does work.] At that point, even Kaiser considered the idea ridiculous. Now the conventional wisdom is that it’s a matter of time.

…A startup called OpenAI was much faster to pounce. Soon after the paper was published, OpenAI’s chief researcher, Ilya Sutskever—who had known the transformer team during his time at Google—suggested that one of its scientists, Alec Radford, work on the idea. The results were the first GPT products [GPT-1]. As OpenAI CEO Sam Altman told me last year, “When the transformer paper came out, I don’t think anyone at Google realized what it meant.”

The picture internally is more complicated. “It was pretty evident to us that transformers could do really magical things”, says Uszkoreit. “Now, you may ask the question, why wasn’t there ChatGPT by Google back in 2018? Realistically, we could have had GPT-3 or even GPT-3.5 probably in 2019, maybe 2020. The big question isn’t, did they see it? The question is, why didn’t we do anything with the fact that we had seen it? The answer is tricky.”…When I asked CEO Sundar Pichai last year why his company wasn’t first to launch a large language model like ChatGPT, he argued that in this case Google found it advantageous to let others lead. “It’s not fully clear to me that it might have worked out as well. The fact is, we can do more after people had seen how it works”, he said.

…Kaiser is the only one who hasn’t founded a company. He joined OpenAI and is one of the inventors of a new technology called Q, which Altman said last year will “push the veil of ignorance back and the frontier of discovery forward.” (When I attempted to quiz Kaiser on this in our interview, the OpenAI PR person almost leaped across the table to silence him.)

…Without that environment: no transformer. Not only were the authors all Google employees, they also worked out of the same offices. Hallway encounters and overheard lunch conversations led to big moments. The group is also culturally diverse. 6⁄8 authors were born outside the United States; the other two are children of two green-card-carrying Germans who were temporarily in California and a first-generation American whose family had fled persecution, respectively.