“AI Is a Lot of Work: As the Technology Becomes Ubiquitous, a Vast Tasker Underclass Is Emerging—And Not Going Anywhere”, 2023-06-20 ():
…I skimmed to the bottom of the manual, where the instructor had written in the large bright-red font equivalent of grabbing someone by the shoulders and shaking them, “THE FOLLOWING ITEMS SHOULD NOT BE LABELED because a human could not actually put wear any of these items!” above a photo of C-3PO, Princess Jasmine from Aladdin, and a cartoon shoe with eyeballs.
Feeling confident in my ability to distinguish between real clothes that can be worn by real people and not-real clothes that cannot, I proceeded to the test. Right away, it threw an ontological curve-ball: a picture of a magazine depicting photos of women in dresses. Is a photograph of clothing real clothing? No, I thought, because a human cannot wear a photograph of clothing. Wrong! As far as AI is concerned, photos of real clothes are real clothes. Next came a photo of a woman in a dimly lit bedroom taking a selfie before a full-length mirror. The blouse and shorts she’s wearing are real. What about their reflection? Also real! Reflections of real clothes are also real clothes.
After an embarrassing amount of trial and error, I made it to the actual work, only to make the horrifying discovery that the instructions I’d been struggling to follow had been updated and clarified so many times that they were now a full 43 printed pages of directives: Do NOT label open suitcases full of clothes; DO label shoes but do NOT label flippers; DO label leggings but do NOT label tights; do NOT label towels even if someone is wearing it; label costumes but do NOT label armor. And so on…“When you start off, the rules are relatively simple”, said a former Scale employee who requested anonymity because of an NDA. “Then they get back a thousand images and then they’re like, Wait a second, and then you have multiple engineers and they start to argue with each other. It’s very much a human thing.”…Annotators invariably end up confronted with confusing questions like, Is that a red shirt with white stripes or a white shirt with red stripes? Is a wicker bowl a “decorative bowl” if it’s full of apples? What color is leopard print? When instructors said to label traffic-control directors, did they also mean to label traffic-control directors eating lunch on the sidewalk? Every question must be answered, and a wrong guess could get you banned and booted to a new, totally different task with its own baffling rules.
…According to workers I spoke with and job listings, U.S.-based Remotasks annotators generally earn between $10–$25 per hour, though some subject-matter experts can make more. By the beginning of this year, pay for the Kenyan annotators I spoke with had dropped to between $1–$3 per hour…Then, in 2019, an opportunity arose: Joe could make 4× as much running an annotation boot camp for a new company that was hungry for labelers. Every two weeks, 50 new recruits would file into an office building in Nairobi, Kenya to begin their apprenticeships. There seemed to be limitless demand for the work. They would be asked to categorize clothing seen in mirror selfies, look through the eyes of robot vacuum cleaners to determine which rooms they were in, and draw squares around LiDAR scans of motorcycles. Over half of Joe’s students usually dropped out before the boot camp was finished. “Some people don’t know how to stay in one place for long”, he explained with gracious understatement. Also, he acknowledged, “it is very boring.”…As 2022 ended, Joe started hearing from his students that their task queues were often empty. Then he got an email informing him the boot camps in Kenya were closing. He continued training taskers online, but he began to worry about the future.
…Another Kenyan annotator said that after his account got suspended for mysterious reasons, he decided to stop playing by the rules. Now, he runs multiple accounts in multiple countries, tasking wherever the pay is best. He works fast and gets high marks for quality, he said, thanks to ChatGPT. The bot is wonderful, he said, letting him speed through $10 tasks in a matter of minutes [cf. et al 2023]. When we spoke, he was having it rate another chatbot’s responses according to 7 different criteria, one AI training the other.
…Identifying clothing and labeling customer-service conversations are just some of the annotation gigs available. Lately, the hottest on the market has been ‘chatbot trainer’. Because it demands specific areas of expertise or language fluency and wages are often adjusted regionally, this job tends to pay better. Certain types of specialist annotation can go for $50 or more per hour.
A woman I’ll call ‘Anna’ was searching for a job in Texas when she stumbled across a generic listing for online work and applied. It was Remotasks, and after passing an introductory exam, she was brought into a Slack room of 1,500 people who were training a project code-named Dolphin, which she later discovered to be Google DeepMind’s chatbot, Sparrow, one of the many bots competing with ChatGPT. Her job is to talk with it all day. At about $14 an hour, plus bonuses for high productivity, “it definitely beats getting paid $10 an hour at the local Dollar General store”, she said.
Also, she enjoys it. She has discussed science-fiction novels, mathematical paradoxes, children’s riddles, and TV shows. Sometimes the bot’s responses make her laugh; other times, she runs out of things to talk about. “Some days, my brain is just like, I literally have no idea what on earth to ask it now”, she said. “So I have a little notebook, and I’ve written about two pages of things—I just Google interesting topics—so I think I’ll be good for 7 hours today, but that’s not always the case.”
Each time Anna prompts Sparrow, it delivers two responses and she picks the best one, thereby creating something called “human-feedback data.” When ChatGPT debuted late last year, its impressively natural-seeming conversational style was credited to its having been trained on troves of internet data. But the language that fuels ChatGPT & its competitors is filtered through several rounds of human annotation. One group of contractors writes examples of how the engineers want the bot to behave, creating questions followed by correct answers, descriptions of computer programs followed by functional code, and requests for tips on committing crimes followed by polite refusals. After the model is trained on these examples, yet more contractors are brought in to prompt it and rank its responses. This is what Anna is doing with Sparrow. Exactly which criteria the raters are told to use varies—honesty, or helpfulness, or just personal preference. The point is that they are creating data on human taste, and once there’s enough of it, engineers can train a second model to mimic their preferences at scale, automating the ranking process and training their AI to act in ways humans approve of. The result is a remarkably human-seeming bot that mostly declines harmful requests and explains its AI nature with seeming self-awareness…This dynamic makes chatbot annotation a delicate process. It has to be rigorous and consistent because sloppy feedback, like marking material that merely sounds correct as accurate, risks training models to be even more convincing bullsh—tters.
…When Anna rates Sparrow’s responses, she’s supposed to be looking at their accuracy, helpfulness, and harmlessness while also checking that the model isn’t giving medical or financial advice or anthropomorphizing itself or running afoul of other criteria. To be useful training data, the model’s responses have to be quantifiably ranked against one another: Is a bot that helpfully tells you how to make a bomb “better” than a bot that’s so harmless it refuses to answer any questions? In one DeepMind paper, when Sparrow’s makers took a turn annotating, 4 researchers wound up debating whether their bot had assumed the gender of a user who asked it for relationship advice. According to Geoffrey Irving, one of DeepMind’s research scientists, the company’s researchers hold weekly annotation meetings in which they rerate data themselves and discuss ambiguous cases, consulting with ethical or subject-matter experts when a case is particularly tricky.
…Because feedback data is difficult to collect, it fetches a higher price. Basic preferences of the sort Anna is producing sell for about $1 each, according to people with knowledge of the industry. But if you want to train a model to do legal research, you need someone with training in law, and this gets expensive. Everyone involved is reluctant to say how much they’re spending, but in general, specialized written examples can go for hundreds of dollars, while expert ratings can cost $50 or more. One engineer told me about buying examples of Socratic dialogues for up to $300 a pop. Another told me about paying $15 for a “darkly funny limerick about a goldfish”.
…Anna knows “absolutely nothing” about Remotasks, but Sparrow has been more open. She wasn’t the only annotator I spoke with who got more information from the AI they were training than from their employer; several others learned whom they were working for by asking their AI for its company’s terms of service. “I literally asked it, ‘What is your purpose, Sparrow?’” Anna said. It pulled up a link to DeepMind’s website and explained that it’s an AI assistant and that its creators trained it using RLHF to be helpful and safe.
…The work paid far better than anything he had tried before, often around $30 an hour. It was more challenging, too: devising complex scenarios to trick chatbots into giving dangerous advice, testing a model’s ability to stay in character, and having detailed conversations about scientific topics so technical they required extensive research. He found the work “satisfying and stimulating.” While checking one model’s attempts to code in Python, Lewis was learning too. He couldn’t work for more than 4 hours at a stretch, lest he risk becoming mentally drained and making mistakes, and he wanted to keep the job…I spoke with 8 other workers, most based in the U.S., who had similar experiences of answering surveys or completing tasks on other platforms and finding themselves recruited for
Taskup.aior several similarly generic sites, such asDataAnnotation.techorGethybrid.io. Often their work involved training chatbots, though with higher-quality expectations and more specialized purposes than other sites they had worked for. One was demonstrating spreadsheet macros. Another was just supposed to have conversations and rate responses according to whatever criteria she wanted. She often asked the chatbot things that had come up in conversations with her 7-year-old daughter, like “What is the largest dinosaur?” [highly debated, apparently] and “Write a story about a tiger.” “I haven’t fully gotten my head around what they’re trying to do with it”, she told me.…“I’ve always felt the annotation landscape is overly simplistic”, Edwin Chen said over a video call from Surge’s office. He founded Surge in 2020 after working on AI at Google, Facebook, and Twitter convinced him that crowdsourced labeling was inadequate. “We want AI to tell jokes or write really good marketing copy or help me out when I need therapy or whatnot”, Chen said. “You can’t ask 5 people to independently come up with a joke and combine it into a majority answer. Not everybody can tell a joke or solve a Python program. The annotation landscape needs to shift from this low-quality, low-skill mind-set to something that’s much richer and captures the range of human skills and creativity and values that we want AI systems to possess.”…Having fewer, better-trained workers producing higher-quality data allows Surge to compensate better than its peers, Chen said, though he declined to elaborate, saying only that people are paid “fair and ethical wages.” Surge’s customers include OpenAI, Google, Microsoft, Meta, and Anthropic. Surge specializes in feedback and language annotation, and after ChatGPT launched, it got an influx of requests, Chen said: “I thought everybody knew the power of RLHF, but I guess people just didn’t viscerally understand.”
…so far, when improvements in AI have made one form of annotation obsolete, demand for other, more sophisticated types of labeling has gone up. This debate spilled into the open earlier this year, when Scale’s CEO, Alexandr Wang, tweeted that he predicted AI labs will soon be spending as many billions of dollars on human data as they do on computing power; OpenAI’s CEO, Sam Altman, responded that data needs will decrease as AI improves.
Chen is skeptical AI will reach a point where human feedback is no longer needed, but he does see annotation becoming more difficult as models improve. Like many researchers, he believes the path forward will involve AI systems helping humans oversee other AI. Surge recently collaborated with Anthropic on a proof of concept, having human labelers answer questions about a lengthy text with the help of an unreliable AI assistant, on the theory that the humans would have to feel out the weaknesses of their AI assistant and collaborate to reason their way to the correct answer. Another possibility has two AIs debating each other and a human rendering the final verdict on which is correct [cf. “AI debate” safety proposals]. “We still have yet to see really good practical implementations of this stuff, but it’s starting to become necessary because it’s getting really hard for labelers to keep up with the models”, said OpenAI research scientist John Schulman in a recent talk at Berkeley.
…Lately, the best-paying work is in the US In May 2023, Scale started listing annotation jobs on its own website, soliciting people with experience in practically every field AI is predicted to conquer. There were listings for AI trainers with expertise in health coaching, human resources, finance, economics, data science, programming, computer science, chemistry, biology, accounting, taxes, nutrition, physics, travel, K-12 education, sports journalism, and self-help. You can make $45 an hour teaching robots law or make $25 an hour teaching them poetry. There were also listings for people with security clearance, presumably to help train military AI. Scale recently launched a defense-oriented language model called Donovan, which Wang called “ammunition in the AI war”, and won a contract to work on the Army’s robotic-combat-vehicle program.