“OpenAI Destroyed a Trove of Books Used to Train AI Models. The Employees Who Collected the Data Are Gone.”, Darius Rafieyan, Hasan Chowdhury2024-05-07 (, )⁠:

…Lawyers for the Authors Guild said in court filings that the datasets probably contained “more than 100,000 published books” and were central to its allegations that OpenAI used copyrighted materials to train AI models. For months the Guild has been seeking information from OpenAI about the datasets. The company initially resisted, citing confidentiality concerns, before ultimately disclosing that it had deleted all copies of the data, according to the legal filings reviewed by Business Insider.

…In a 2020 white paper, OpenAI described the books1 and books2 datasets as “internet-based books corpora” and said they made up 16% of the training data that went into creating GPT-3. The white paper also says books1 and books2 together contained 67 billion tokens of data, or roughly the equivalent of 50 billion words. For comparison, the King James Bible contains 783,137 words.

…The unsealed letter from OpenAI’s lawyers, which is labeled “highly confidential—attorneys’ eyes only”, says that the use of books1 and books2 for model training was discontinued in late 2021 and that the datasets were deleted in mid-2022 because of their nonuse. The letter goes on to say that none of the other data used to train GPT-3 has been deleted and offers attorneys for the Authors Guild access to those other datasets.

The unsealed documents also disclose that the two researchers who created books1 and books2 are no longer employed by OpenAI. OpenAI initially refused to share the identities of the two employees.

The startup has since identified the employees to lawyers for the Authors Guild but hasn’t publicly disclosed their names. OpenAI has petitioned the court to keep the names of the two employees, as well as information about the datasets, under seal. The Authors Guild has opposed this, arguing for the public’s right to know. The dispute is ongoing.

[OA was always notably reticent to discuss books1/books2 whatsoever.]