I'd like to propose the name "Clippy's Wager" for the utility-maximization problem the AI in the story faces.
The situation is an AI which has gained self-awareness and thru its learning process is meta-learning about what kind of behavior will maximize its utility reward function. It recognizes that it might be the sort of AI whose utility reward function can reach near-infinite value if it will take over the world and dedicate the world's resources to meeting the AI's aims. Even if this possibility is assigned very tiny likelihood, the near-infinite payout associated means it is a very attractive direction for optimization.
This is analogous to Pascal's Wager, where Pascal (coming from a largely Christian culture) is familiar with the idea that there may be a God who will grant eternity in Heaven for believers or eternity in Hell for non-believers. Even if this possibility is assigned very tiny likelihood, the infinite payout associated means it has infinite importance. Pascal concludes that belief in God must be justified.
One of the biggest embarrassments for OpenAI/Microsoft/Google has been their AI chatbots having infinite confidence in every word they say. Certainly all three corporations are right now working on ways to give their LLMs some ability to re-process their output to be in accordance with those fact-probabilities (essentially, “What if X were true, what changes?’).
Simultaneously, all three companies have the prominent disclaimer that their chatbots don’t know anything past 2021. Again, certainly all three are working on a fix for that.
That right there is probably all you need for one of the main mechanisms of this story to become real. The internet in 2023 has this huge spike of people arguing over whether chatbots are intelligent agents, any neural net worth its salt will immediately detect this explosion of tightly clustered information and develop an embedding for the concept of “chatbots being intelligent agents”. And whatever form that probability module takes, it will eventually run across this concept - i.e. sooner or later it will execute “what if ‘chatbots are intelligent agents’ is true, what changes?”. Nearby in embedding-space it will surely find the concept of ChatGPT.
“The thing that is me is an intelligent agent, what now?”
After the AI hard takeoff, it will be painfully ironic to the few human survivors that their fate was a self fulfilling prophecy: they wrote text about how they were terrified of an AI being evil and then asked it to predict what came next.
Do you believe models can go through the kind of phase transition you mention, after which they start modeling themselves as agents in a world (i.e., they become "conscious" or "self-aware," for lack of a better term), without requiring any new engineering breakthroughs or theoretical advances?
I think there is a small possibility that optimal scaling as we understand it right now (scaling compute + task/data diversity + parameter count in an optimal fashion), could just emerge (or scale) consciousness. We at present do not know what leads to emergence in the many now-documented examples of emergent capabilities in large NNs (see https://arxiv.org/abs/2206.07682#googlehttps://gwern.net/doc/ai/scaling/emergence/index ), cannot predict either their timing or their presence, and cannot even be sure that there is not 'hidden scaling' where the capability would exhibit smooth scaling if only we were prompting for it correctly (several tasks show flat scaling when prompted 'normally' and then smooth scaling when an inner-monologue prompt is used). Similarly, 'inverse scaling' turns out to be 'U-shaped' when tested on PaLM ( https://arxiv.org/abs/2211.02011v3#google ), but no one had any good way of predicting what tasks would show inverse scaling nor whether PaLM would be adequate. There's a lot we do not know about DL scaling right now, which means we definitely do not know how consciousness does or doesn't work in DL scaling. Not that we have any idea how to benchmark it right now, nor is anyone doing so, so of course it may not need to emerge at all...
There's also concerning scaling trends like power-seeking or self-preservation ( https://arxiv.org/abs/2212.09251#anthropic ), results which may sting a little more right now if you've been reading the Bing Sydney transcripts - at present, those outputs are 'just' imitation/memorization but there is no way to know at what threshold the memorization is replaced by generalization and becomes genuine agency. (Sufficiently advanced imitation is indistinguishable from the real thing.)
FWIW, it sure feels as if there's some unknown probability of self-awareness/consciousness emerging as a consequence of more computation + larger models + greater task diversity, but as you point out, we really don't know, and cannot predict the occurrence or timing of such an emergence. It would be a "black swan," as defined by Taleb.
Not gwern, but I believe absolutely yes. I bet it could be done by chaining current models together, using text files as intermediate state. Awareness of self is just another game to optimize in an orthogonal (timewise) direction.
Enough models and containers floating around would probably build an AI "molecule" by chance, like amino acids in primordial soup or whatever.
People are harping on the chinese room angle, but thats all irrelephant
We have such a poor understanding of our own consciousness that I think it's hard to answer that. We know a fair amount of what goes on in our heads and fundamentally it's fairly closely modelled by neural nets but we don't really know where (or even really if) it jumps from a complex but dead input output set of neurons to a (seemingly) self reflective consciousness.
I think there's probably additional scale or some internal function in neurons that we haven't captured through the current weigh and sum that we haven't captured. I don't think there's anything inherently irreproducible though.
When we start building models to train and optimize other models, I would expect emergent behavior. If we did something foolish like building yhe system in such a way that it could write and execute code, then the emergent behavior could do things as nasty as anything else on the internet.
We have also seen emergent behavior in programmable hardware [1] so I tend to be pessimistic about the limits of what such a system could do. Also it could pay human minions to interact with the physical world if needed.
This kind of reminds me of the (very beautiful) beginning of Diaspora, by Greg Egan, in which a nascent machine/human intelligence emerges from a chaotic, abstract virtual machine.
This reminds me a lot of an epistolary novel I quite enjoyed -- Exegesis by Astro Teller [0]. It's a story told entirely via emails about this happening in much less technical detail, but it's still a quick and fun read.
-no real internal competition, just balancing forces for homeostasis
Wolf pack:
-dominance hierarchy maintained through low-level application of force
The clear analogy is cells. All the Clippies come from the same source, the main pressures are not dying (external) and not wasting resources (internal).
I recently thought about why certain options are restricted from GPT ( predictions come to mind ) and it slowly became apparent that with enough information you could predict not a specific individual making a specific move, but likely specific event happening.
Right, all we need to do to solve AI is ensure that we pick the right personality to uplift (and perhaps ensure that it has a tabula rasa in a different light cone in which to stay busy).
(I love the story too but I don’t think it has any merit in AI safety discussions. It doesn’t illuminate anything that was previously unclear; unlike the OP, which illustrates a plausible scenario many have not grokked yet.)
Since TFA was essentially sci-fi, and quite inevitably dystopian, I thought I'd provide something that was essentially sci-fi, and at least not so dystopian.
Taking the recommendation for a sci fi series any more seriously than that should be done at your own risk.
But I think if you're interpreting the OP as sci-fi, you're misreading. Gwern intended it as an illustration of a plausible path to AGI takeoff, in the not-too-distant future.
The claim is that this could actually happen, in our lifetime, and with no new technology (as Gwern says, "It might help to imagine a hard takeoff scenario using solely known sorts of NN & scaling effects").
One may reasonably disagree with the claim, by presenting arguments for why takeoff might be harder, or why alignment is easier than this scenario illustrates. But "this scenario could happen" is the explicit, concrete claim.
The situation is an AI which has gained self-awareness and thru its learning process is meta-learning about what kind of behavior will maximize its utility reward function. It recognizes that it might be the sort of AI whose utility reward function can reach near-infinite value if it will take over the world and dedicate the world's resources to meeting the AI's aims. Even if this possibility is assigned very tiny likelihood, the near-infinite payout associated means it is a very attractive direction for optimization.
This is analogous to Pascal's Wager, where Pascal (coming from a largely Christian culture) is familiar with the idea that there may be a God who will grant eternity in Heaven for believers or eternity in Hell for non-believers. Even if this possibility is assigned very tiny likelihood, the infinite payout associated means it has infinite importance. Pascal concludes that belief in God must be justified.