“Why YouTube Could Give Google an Edge in AI”, Jon Victor2023-06-14 (, , , , ; backlinks)⁠:

The video site, which Google owns, is the single biggest and richest source of imagery, audio and text transcripts on the internet. And Google’s researchers have been using YouTube to develop its next large-language model, Gemini, according to a person with knowledge of the situation. The value of YouTube hasn’t been lost on OpenAI, either: The startup has secretly used data from the site to train some of its artificial intelligence models, said one person with direct knowledge of the effort.

OpenAI Digs YouTube: It’s possible such techniques would lead Google to OpenAI. The Microsoft-backed startup found so much value in YouTube videos that it previously used them to train its AI, including a model called Whisper that automatically converts speech to text, according to a person with direct knowledge of the practice. OpenAI also used podcasts to develop Whisper, this person said, though the sources of those podcasts couldn’t be learned. OpenAI open-sourced Whisper, but some data used to train the Whisper model were later used to train GPT-4, the LLM that powers ChatGPT, the company’s biggest revenue generator, the person with direct knowledge said.

It isn’t clear whether OpenAI violated any YouTube policies, but the site’s terms of service prohibit viewing or listening to content for anything other than “personal, non-commercial use”, as well as accessing the service using automated means. Spokespeople for OpenAI and Google declined to comment.

Google, however, took similar liberties with OpenAI’s data. At one point Google used… [OA API transcripts to train an LLM]