“How Tech Giants Cut Corners to Harvest Data for AI: OpenAI, Google and Meta Ignored Corporate Policies, Altered Their Own Rules and Discussed Skirting Copyright Law As They Sought Online Information to Train Their Newest Artificial Intelligence Systems”, 2024-04-06 (; backlinks):
In late 2021, OpenAI faced a supply problem. The artificial intelligence lab had exhausted every reservoir of reputable English-language text on the internet as it developed its latest AI system. It needed more data to train the next version of its technology—lots more.
So OpenAI researchers created a speech recognition tool called Whisper. It could transcribe the audio from YouTube videos, yielding new conversational text that would make an AI system smarter. Some OpenAI employees discussed how such a move might go against YouTube’s rules, 3 people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are “independent” of the video platform.
Ultimately, an OpenAI team transcribed more than one million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI’s president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4, which was widely considered one of the world’s most powerful AI models and was the basis of the latest version of the ChatGPT chatbot.
…Like OpenAI, Google transcribed YouTube videos to harvest text for its AI models, 5 people with knowledge of the company’s practices said. That potentially violated the copyrights to the videos, which belong to their creators.
Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message viewed by The Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its AI products.
…Transcribing YouTube: In May, Sam Altman, the chief executive of OpenAI, acknowledged that AI companies would use up all viable data on the internet. “That will run out”, he said in a speech at a tech conference. Altman had seen the phenomenon up close…“As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, everything will be fine”, Altman said…To combat this, OpenAI and others are investigating how two different AI models might work together to generate synthetic data that is more useful and reliable. One system produces the data, while a second judges the information to separate the good from the bad. Researchers are divided on whether this method will work. AI executives are barreling ahead nonetheless. “It should be all right”, Altman said at the conference.
…At OpenAI, researchers had gathered data for years, cleaned it and fed it into a vast pool of text to train the company’s language models. They had mined the computer code repository GitHub, vacuumed up databases of chess moves and drawn on data describing high school tests and homework assignments from the website Quizlet.
By late 2021, those supplies were depleted, said 8 people with knowledge of the company, who were not authorized to speak publicly. OpenAI was desperate for more data to develop its next-generation AI model, GPT-4. So employees discussed transcribing podcasts, audiobooks and YouTube videos, the people said. They talked about creating data from scratch with AI systems. They also considered buying start-ups that had collected large amounts of digital data. OpenAI eventually made Whisper, the speech recognition tool, to transcribe YouTube videos and podcasts, 6 people said. But YouTube prohibits people from not only using its videos for “independent” applications, but also accessing its videos by “any automated means (such as robots, botnets or scrapers).”
OpenAI employees knew they were wading into a legal gray area, the people said, but believed that training AI with the videos was fair use. Brockman, OpenAI’s president, was listed in a research paper as a creator of Whisper. He personally helped gather YouTube videos and fed them into the technology, two people said.
Brockman referred requests for comment to OpenAI, which said it uses “numerous sources” of data.
…Some Google employees were aware that OpenAI had harvested YouTube videos for data, two people with knowledge of the companies said. But they didn’t stop OpenAI because Google had also used transcripts of YouTube videos to train its AI models, the people said. That practice may have violated the copyrights of YouTube creators. So if Google made a fuss about OpenAI, there might be a public outcry against its own methods, the people said.
…At the time, Google’s privacy policy said the company could use publicly available information only to “help train Google’s language models and build features like Google Translate.” The privacy team wrote new terms so Google could tap the data for its “AI models and build products and features like Google Translate, Bard and Cloud AI capabilities”, which was a wider collection of AI technologies. “What is the end goal here?” one member of the privacy team asked in an internal message. “How broad are we going?” The team was told specifically to release the new terms on the 4th of July weekend, when people were typically focused on the holiday, the employees said. The revised policy debuted on July 1, at the start of the long weekend.
…In August, two privacy team members said, they pressed managers on whether Google could start using data from free consumer versions of Google Docs, Google Sheets and Google Slides. They were not given clear answers, they said. Bryant said that the privacy policy changes had been made for clarity and that Google did not use information from Google Docs or related apps to train language models “without explicit permission” from users, referring to a voluntary program that allows users to test experimental features. “We did not start training on additional types of data based on this language change”, he said.
…At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by The Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.
…Ahmad Al-Dahle, Meta’s vice president of generative AI, told executives that his team had used almost every available English-language book, essay, poem and news article on the internet to develop a model, according to recordings of internal meetings, which were shared by an employee.
Meta could not match ChatGPT unless it got more data, Al-Dahle told colleagues. In March & April 2023, some of the company’s business development leaders, engineers and lawyers met nearly daily to tackle the problem.
Some debated paying $10 a book for the full licensing rights to new titles. They discussed buying Simon & Schuster, which publishes authors such as J. K. Rowling and Stephen King, according to the recordings.
They also talked about how they had summarized books, essays and other works from the internet without permission and discussed sucking up more, even if that meant facing lawsuits. One lawyer warned of “ethical” concerns around taking intellectual property from artists but was met with silence, according to the recordings.
Mark Zuckerberg demanded a solution, employees said.
“The capability that Mark is looking for in the product is just something that we currently aren’t able to deliver”, one engineer said.
…While Meta operates giant social networks, it didn’t have troves of user posts at its disposal, two employees said. Many Facebook users had deleted their earlier posts, and the platform wasn’t where people wrote essay-type content, they said.
Meta was also limited by privacy changes it introduced after a 2018 scandal over sharing its users’ data with Cambridge Analytica, a voter-profiling company.
Zuckerberg said in a recent investor call that the billions of publicly shared & and photos on Facebook & Instagram are “greater than the Common Crawl data set.”
During their recorded discussions, Meta executives talked about how they had hired contractors in Africa to aggregate summaries of fiction and nonfiction. The summaries included copyrighted content “because we have no way of not collecting that”, a manager said in one meeting.
Meta’s executives said OpenAI seemed to have used copyrighted material without permission. It would take Meta too long to negotiate licenses with publishers, artists, musicians and the news industry, they said, according to the recordings.
“The only thing that’s holding us back from being as good as ChatGPT is literally just data volume”, Nick Grudin, a vice president of global partnership and content, said in one meeting.
OpenAI appeared to be taking copyrighted material and Meta could follow this “market precedent”, he added.