Useful Outsourcing Is Hard

Gwern

Useful Outsourcing Is Hard

Asking ‘what if I tried to use an amnesiac natural general intelligence to solve this… using only emails’ is a good defamiliarizing trick for thinking about AI workflows and the difficulty of outsourcing of white-collar work or your life.

by: Gwern 2024-09-16–2026-06-02 finished certainty: log

Tools can be highly capable but still not useful for automation or white-collar workers due to overhead, friction, lack of context, and ignorance of what to use them for. This is why it is difficult to “outsource” things to grad students, secretaries, people overseas, etc.

This is true of AI as well. So, if someone struggles to find a use for LLMs which actually saves them time, and blames the LLMs for being too stupid or incapable, they may be wrong; it may be that this is something which is difficult to outsource.

A simple way to distinguish between these two problems is to simply ask, “could you outsource it to a human being who worked for free?” (You can even test this by pretending to be that human yourself!) If the answer is “no”, then it cannot be a matter of ‘artificial’ versus ‘natural’ intelligence.

Often, we will find the answer is “no” and this is why even famous important people with little time were not “outsourcing” most things even in the past.

An important implication of this fact is that it means that the level of “outsourcing” to LLMs is not a good measure of capability, because it is related more to factors like how much individuals are willing to rework their life to make outsourcing possible, to make “context” explicit (preferably textual), chance outcomes of LLMs being able to handle specific limiting steps, etc.

For example, even though many of my essays are, I think, fairly “obvious”, given my corpus as a whole, it is impossible for me to hand my notes and corpus to any human or LLM, specify a 1-sentence essay idea, and get back a publishable essay; they cannot see or memorize my entire corpus, they are unable to imitate my style, and so on. And this is despite the fact that they are more capable than me in many ways, like coding or math or general knowledge.

So, because of the weak link between full outsource-ability and raw capabilities, LLMs may become increasingly capable while still not that useful for outsourcing for any given individual… right until the point where they suddenly are capable of simply replacing that individual (possibly to their great shock). Which is good for whoever employs the LLM, but not for that individual.

If you’re having trouble coming up with tasks for ‘artificial intelligence too cheap to meter’, it could be because you are having trouble coming up with tasks for intelligence, period. Just because something is highly useful doesn’t mean you can immediately make use of it in your current local optimum; you may need to seriously reorganize your life and workflows before any kind of intelligence could be useful.

There is a good post on the LW2 front page right now as I write this, about exactly this problem: “The Great Data Integration Schlep”. Most of the examples in it do not actually depend on the details of ‘AI’ vs employee vs contractor vs API vs…—the organization is organized to defeat the improvement. It doesn’t matter whether it’s a data scientist or an AI reading the data if there is some employee whose career depends on that data not being read and who is sabotaging it, or some department defending its fief.

I usually call this concept “automation as colonization wave”: many major technologies of undoubted enormous value, such as steam or the Internet or teleconferencing/remote-working, take a long time to have massive effects, because you have everyone stuck in local optima and potentially outright sabotaging any integration of the Big New Thing, and potentially have to create entirely new organizations and painfully liquidate the old ones through decades of bleeding.

There are few valuable “AI-shaped holes” because we’ve organized everything to minimize the damage from lacking AI to fill those holes, as it were: if there were some sort of organization which had naturally large LLM-shaped holes where filling them would massively increase the organization’s output… It would’ve gone extinct long ago and been replaced by ones with human-shaped holes instead, because humans were all you could get.

This is why LLM uses are pretty ridiculous right now as a % of GDP—oh wow, it can do a slightly better job of “grammar-checking my emails”? I can have it write some code for me? Not exactly a new regime of hyperbolic global economic growth.

So one thing you could try, if you are struggling to spend $1,000⧸month usefully on artificial intelligence, is to instead experiment by committing to spend that on natural intelligence.

That is, look into hiring a remote worker / executive assistant / secretary, an intern, or something else of that ilk. They are, by definition, a flexible multimodal generally-intelligent human-level neural net capable of tool use and agency, a natural general intelligence or ‘NGI’ if you will. (And if you mentally ignore that money because it’s an experiment, you can treat it as ‘natural intelligence too cheap to meter’, just regarding it a sunk cost.)

An outsourced human fills a very similar hole as an AI could, so it removes the distracting factor of AI and simply asks, ‘are there any large, valuable, genuinely-moving-the-needle outsourced-human-shaped holes in your life?’ There probably are not! Then it’s no surprise if you can’t plug the holes which don’t exist with any AI, present or future.

(If this is still too confusing, you can try treating yourself as a remote worker and roleplay as them by sending yourself emails and trying to pretend you have amnesia as you write a reply and avoid doing anything a remote work could not do, like edit files on your computer, and charging yourself an appropriate hourly rate, terminating at $1,000 cumulative.)

What I’m suggesting is finding things like, “email your current blog post draft to the assistant for copyediting”. Does this wind up saving time on net compared to repeatedly rereading it yourself, possibly using the standard tricks like reading it upside down or printing it out? Then this is something that potentially an LLM can help with. But if it doesn’t save you time on net (because there is too much overhead or you don’t write blog posts in the first place), then it doesn’t matter how much the LLM costs—you don’t want to pay even $0 for it.

It may be quite hard to do this. In fact, if you tried to hire an outsourced ‘executive assistant’ through a remote-worker company, some companies will first make you fill out a bunch of forms (including personality surveys), and have you do a short online course about how to use an EA effectively before they will assign you an EA! This may sound absurd, but it reflects the reality that it can be hard to know how to use an EA. (Do you? I don’t. And decreasingly many people have ever had a secretary in their life.)

If you find you cannot make good use of your hired natural intelligent neural net, then that fully explains your difficulty of coming up with compelling usecases for artificially intelligent neural nets too. And if you do, great—you now have a clean set of tasks you can meaningfully try to do with AI services as a private benchmark.

This helps illustrate the distinction between ‘capabilities’ and ‘being useful enough to me to pay $1,000⧸month for right now’. They are not the same thing at all, and the absence of the latter only weakly implies absence of the former.

An analogous example might be the difficulties some people have in ‘being rich’ or ‘becoming a manager/learning to delegate’. If you were poor or are used to doing everything yourself, it can be difficult to spend your new money well or make any good use of your secretary or junior employees; but one would not infer from that a conclusion like “money is useless” or “staff are useless”. It is simply that you need to figure out how to live your new life, and your old ways were adapted to your old life.

This can be surprisingly hard sometimes: there are many anecdotes of people who are destroyed by their newfound wealth or can’t do anything but hoard it, or who run an organization into the ground because they are unable to delegate. Even the simpler forms are hard. (On the very rare occasion I stay at a luxury hotel/cruise ship or go to a fancy restaurant, where there is a lot of staff who are there to cater to your every whim, I struggle to come up with whims worth catering to, because having been raised middle-class and being used to staying in the cheapest hotels where waking up sans bed bugs is a minor victory, I mostly find anything like a ‘servant’ to be extremely alienating and stressful and don’t know how to get anything out of it. I’m sure I could do so if this became an ordinary thing, but it would still take time—I don’t just automatically know how to adjust!)

Let me give a personal concrete example. I am a writer, and highly enthusiastic about LLMs, but I still struggle to get a lot of personal value out of LLMs as of 2024.

I couldn’t spend $1,000⧸month on LLM calls. I currently can manage ~$50⧸month, between ChatGPT subscription and embeddings and highly-constrained AI formatting use (eg. converting LaTeX math to HTML/Unicode or breaking up monolithic single-paragraph abstracts into readable paragraphs), but I would struggle to double that. Why is that? Because while the LLMs are very intelligent and knowledgeable, and are often a lot better than I am at many things going beyond just programming, “automation as colonization wave” means they cannot bring that to bear in a useful way for me.

So, the last thing I wrote was a short mini-essay on why cats spontaneously bite you during petting; I argue that, in line with my knocking-things-over essay and other parts of my big cat psychology essay, that it is a misdirected prey drive where you accidentally trigger it by resembling small prey animals.

I had to write it all myself, and I asked several LLMs for feedback, and made a few tweaks, but they added relatively little—let’s say <5% of the value of the finished mini-essay. I’d value the mini-essay itself at maybe $100-ish; I think it is likely true and cat readers will find the discussion mildly interesting and in the long run it adds value to my site to be there, but a proof of the Riemann conjecture it is not. So, the LLM advice was at best worth a few bucks.

Why so helpless? The writing is not anything special, and the specific points made appear to all be familiar to the LLMs from the vast Internet corpus. But to deliver a lot of value, the LLMs would either have to come up with the novel connection between prey drive & spontaneous biting on their own and tell someone like me or to be able to write it given a minimal prompt from me like ‘Maybe cats bite during petting because prey drive? write please’. I know what you’re wondering. Claude-3.5, GPT-4o, and GPT-4 o1-preview produce outputs here which are largely useless and would cost more time to edit into something usable than they’d save. And they would have to do so while writing like me, with appropriate Wikipedia links, specific incidents like Siegfried & Roy rather than vague bloviating, Markdown output, and inserting into the appropriate place in Gwern.net… Obviously, ye olde ChatGPT web interface does not do any of that. I have to. So, by the time I have done all that, there’s not much left for the LLM to do.

Is it impossible in principle for an LLM to do that, or would they have to be Immanentizing the Eschaton already before they could be of genuine value to me? No, of course not. Actually, I think that even Claude-3.5 or GPT-o1-preview would probably be capable of writing that mini-essay… with appropriate reorganization of the entire workflow.

Pasting in short prompts into a chatbot browser tab doesn’t cut the mustard, but it’s not hard to see what would. For example, “Siegfried & Roy” doesn’t come out of nowhere; it is in my clippings already, and a model trained on my clippings or at least with retrieval to them would easily pull it out as an example of ‘spontaneous cat biting’ and incorporate it. Writing stylistically like me is not hard: the base models (used to) do a good job, and they would get even better if finetuned on my site and IRC logs and whatnot to refresh their memory.

The place in my essays where I discuss spontaneous cat biting, like of my grandmother, is also no challenge for an LLM with retrieval or long context windows. Inserting a Markdown-formatted footnote is downright trivial. Reading my cat-related writings and prioritizing the prey drive as an explanation of otherwise-mysterious domestic cat behaviors is maybe too much ‘insight’ to expect, but given a single sentence explicitly saying it, they definitely get the idea and can elaborate on it, by writing like me a few paragraphs elaborating the idea with the relevant references I would think of and inserting it appropriately formatted into the appropriate place in the Gwern.net corpus.

I would pay $100 if I could type in a single sentence like ‘spontaneous biting is prey drive!’ and 10 seconds later, up pops a diff with the current mini-essay for me to read and then edit or approve; and since I have perhaps 10 such insights a month, then I could easily spend $1,000⧸month on that. (And this would simply be the start, as I figure out how to expand my needs and desires to do much more than that, increasing quality and quantity in various ways.)

But you can see why none of that is happening, and why you need something like my Nenex proposal before that was feasible. The SaaS providers refuse to provide non-chatbot/instruction-tuned models, which write ChatGPTese barf I refuse to incorporate into my writing. They won’t finetune on a very large corpus, so it doesn’t know all the specific factoids I would incorporate. They won’t send their model to me, so I can’t run it locally; and I’m not sending my entire computer to them either. And they would need to train tool-use for editing a corpus of Markdown files with some of my unique extensions (like Wikipedia shortcuts).

Or look at it the other way: given all these hard constraints (and the workarounds themselves being major projects—running Llama-3-405b at home is not for the faint of heart), what would it take to make the LLM use highly valuable for this cat-biting mini-essay rather than a rounding error?

Well, it would have to be superhumanly capable—it would have to be somehow so eloquent that I would prefer its writing unedited out of the box, it would have to be somehow so knowledgeable about cat psychology research that its version is superior research compared to mine and me searching instead a waste of time, it would have to be so insightful about the details of cat behavior which support or contradict this thesis that I would read it and bolt upright in my chair blinking, as I exclaim “wow, that’s… actually a really good point. I never thought of that. How extremely stupid of me not to!” and resolve to always ask the LLM first in the future, etc.

And obviously, we’re not at that point yet, and if we were, then things would start to look rather different (as one of the first things people would start assigning such an LLM would be the task of reorganizing workflows to unlock its true potential)…

So, writing Gwern.net mini-essays, like the one I spent an hour or two writing, is an ‘automation as colonization wave’ example. It is something that LLMs probably have the capability of doing now, which is of economic value (at least to me), and yet, is not happening now, due to reasons unrelated to LLM raw capabilities but arranging the world around them to unlock those capabilities.

And you will find that if you want to use LLMs a lot, there will be many things they could clearly do, but you aren’t going to do right now because it requires reorganizing too much around them.

See Also:

[Return to blog index]

Error: JavaScript disabled.

Backlinks, similar links, and the bibliography require JS enabled to load.