“Inside Big Tech’s Underground Race to Buy AI Training Data”, 2024-04-05 ():
At its peak in the early 2000s, Photobucket was the world’s top image-hosting site. The media backbone for once-hot services like MySpace and Friendster, it boasted 70 million users and accounted for nearly half of the US online photo market. Today only 2 million people still use Photobucket, according to analytics tracker Similarweb. But the generative AI revolution may give it a new lease of life.
CEO Ted Leonard, who runs the 40-strong company out of Edwards, Colorado, told Reuters he is in talks with multiple tech companies to license Photobucket’s 13 billion photos and videos to be used to train generative AI models that can produce new content in response to text prompts. He has discussed rates of between $0.05 and $1 dollar per photo and more than $1 per video, he said, with prices varying widely both by the buyer and the types of imagery sought. “We’ve spoken to companies that have said, ‘we need way more’”, Leonard added, with one buyer telling him they wanted over a billion videos, more than his platform has. “You scratch your head and say, where do you get that?”
…At the same time, these tech companies are also quietly paying for content locked behind paywalls and login screens, giving rise to a hidden trade in everything from chat logs to long forgotten personal photos from faded social media apps. “There is a rush right now to go for copyright holders that have private collections of stuff that is not available to be scraped”, said Edward Klaris from law firm Klaris Law, which says it’s advising content owners on deals worth tens of millions of dollars apiece to license archives of photos, movies and books for AI training. Reuters spoke to more than 30 people with knowledge of AI data deals, including current and former executives at companies involved, lawyers and consultants, to provide the first in-depth exploration of this fledgling market—detailing the types of content being bought, the prices materializing, plus emerging concerns about the risk of personal data making its way into AI models without people’s knowledge or explicit consent.
…Many major market research firms say they have not even begun to estimate the size of the opaque AI data market, where companies often don’t disclose agreements. Those researchers who do, such as Business Research Insights, put the market at roughly $2.5 billion now and forecast it could grow close to $30 billion within a decade.
…In the months after ChatGPT debuted in late 2022, for instance, companies including Meta, Google, Amazon and Apple all struck agreements with stock image provider Shutterstock to use hundreds of millions of images, videos and music files in its library for training, according to a person familiar with the arrangements. The deals with Big Tech firms initially ranged from $25 million to $50 million each, though most were later expanded, Shutterstock’s Chief Financial Officer Jarrod Yahes told Reuters. Smaller tech players have followed suit, spurring a fresh “flurry of activity” in the past two months, he added.
…A Shutterstock competitor, Freepik, told Reuters it had struck agreements with two large tech companies to license the majority of its archive of 200 million images at $0.02-$0.04 cents per image. There are 5 more similar deals in the pipeline, said CEO Joaquin Cuenca Abela, declining to identify buyers.
OpenAI, an early Shutterstock customer, has also signed licensing agreements with at least 4 news organizations, including The Associated Press, and Axel Springer. Thomson Reuters, the owner of Reuters News, separately said it has struck deals to license news content to help train AI large language models, but didn’t disclose details.
‘Ethically Sourced’ Content
An industry of dedicated AI data firms is emerging too, securing rights to real-world content like podcasts, short-form videos and interactions with digital assistants, while also building networks of short-term contract workers to produce custom visuals and voice samples from scratch, akin to an Uber-esque gig economy for data.
Seattle-based Defined.ai licenses data to a range of companies including Google, Facebook, Apple, Amazon and Microsoft, CEO Daniela Braga told Reuters.
Rates vary by buyer and content type, but Braga said companies are generally willing to pay $1-$2 per image, $2-$4 per short-form video and $100-$300 per hour of longer films. The market rate for text is $0.001 per word, she added.
Images of nudity, which require the most sensitive handling, go for $5-$7, she said.
Defined.ai splits those earnings with content providers, Braga said. It markets its datasets as “ethically sourced”, as it obtains consent from people whose data it uses and strips out personally identifying information, she added.
One of the firm’s suppliers, a Brazil-based entrepreneur, said he pays owners of the photos, podcasts and medical data he sources about 20%–30% of total deal amounts.
The priciest images in his portfolio are those used to train AI systems that block content like graphic violence barred by the tech companies, said the supplier, who spoke on condition his company wasn’t identified, citing commercial sensitivity.
To fulfill those requests, he obtains images of crime scenes, conflict violence and surgeries—mainly from police, freelance photojournalists and medical students, respectively—often in places in South America and Africa where distributing graphic images is more common, he said. He said he has received images from freelance photographers in Gaza since the start of the war there in October, plus some from Israel at the outset of hostilities.
His company hires nurses accustomed to seeing violent injuries to anonymize and annotate the images, which are disturbing to untrained eyes, he added.
…Photobucket CEO Leonard says he is on solid legal ground, citing an update to the company’s terms of service in October that grants it the “unrestricted right” to sell any uploaded content for the purpose of training AI systems. He sees licensing data as an alternative to selling ads. “We need to pay our bills, and this could give us the ability to continue to support free accounts”, he said.
Defined.ai’s Braga said she avoids acquiring content from “platform” companies like Photobucket and prefers to source social media photos from influencers who create them, who she said have a clearer claim to licensing rights. “I would find it very risky”, Braga said of platform content. “If there’s some AI that generates something that resembles a picture of someone who never approved that, that’s a problem.”
Photobucket is not alone among platforms in embracing licensing. Tumblr’s parent company Automattic said last month it was sharing content with “select AI companies.” In February, Reuters reported Reddit struck a deal with Google to make its content available for training the latter’s AI models.