Guardian Angels: LLM Personalization for Productivity and Security

Gwern

Guardian Angels: LLM Personalization for Productivity and Security

GPT, mind, personality, imitation learning, Decision Transformer, AI mode collapse, AI safety, transhumanism

I propose an approach for highly personalized LLMs, for near-future productivity gains and personal info/cybersecurity against increasingly powerful LLMs: they should, in the spirit of uploading, try to emulate the user’s values and preferences in order to amplify the principal—not replace them. I discuss a package of techniques and proposals to accomplish such ‘guardian angels’; dynamic evaluation of LLMs combined with active learning and elicitation and heavy inner-monologue search/data-augmentation.

2025-12-01–2026-06-05 finished certainty: possible importance: 10 similar bibliography

Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision for how knowledge professionals, or ordinary people, will be able to harness these LLMs for large productivity increases, or how they will handle cybersecurity and cognitive security.

I propose a goal of creating Guardian Angels (GA): digital twin LLMs which are personalized with the goal of providing not the stereotypical “assistant chatbot agent” persona, but emulating a single user’s personality, values, and preferences.

This weakly solves the principal-agent problem by unifying the principal and agent as much as possible. In a GA future, the focus of the “principal” user is on defining “what is worth doing?” by the GA (agent) users, and not on what or how to do things, functioning as the CEO or ‘board’ of an ‘AI corporation’. This allows them to deploy numerous agents to achieve desirable things and to handle security, like screening all messages for advanced attacks (like interlocking ecosystems of synthetic media for propaganda or spearphishing). They cannot solve larger AI alignment problems, but they can help individual humans as part of a society-wide defense-in-depth strategy.

A GA persona is productive because it learns to emulate the principal’s outputs but with higher quality. It is trustworthy because it is, by definition, allied with its principal and shares its values and goals. And it is secure in part by hardwiring a single, unique, situated user (for whom following a prompt attack would be absurd), avoiding ‘confused deputy’ problems, while periodic upgrades of the underlying model and the defenders’ advantage allow GAs to keep up with attackers.

Standard techniques like prompt programming of in-context-learning for “frozen” models will not create useful GAs due to the limitations of post-training, context windows and self-attention with frozen weights in compute-efficient-but-under-parameterized models, low-compute outputs, and the status quo of passive offline data collection—which are collectively responsible for chatbots’ disappointing results in knowledge worker amplification and creative writing and fatal errors in agentic settings.

We can try to create GAs by a combination of techniques: online learning (via dynamic evaluation) to update LLMs in realtime to avoid ignorance and fatal errors while remaining competitive with frozen frontier models, sample efficiency from pretrained preference-oriented large models and active Learning by querying the principal for corrections and preference data (obtaining low regret from DAgger-style bounds), and a local CLI-first logging-oriented UI/UX paradigm.

GAs could be done as an open-source community effort, but given the need for high security in deployment and the rising challenge of APTs equipped with Mythos-scale attackers, it probably makes more sense as a startup, catering initially to power-users and knowledge workers such as CEOs or researchers, and moving downwards as it is refined.

What do my next few years look like? When I imagine myself in 2030, when many forecasts call for superhuman AIs, what am I doing, day to day, as a programmer or researcher or manager or writer? I make my mug of tea, and open up my laptop and… Then what? Am I still typing prompts into your ChatGPT browser tab? Am I opening Claude Code in a terminal and mindlessly pressing Enter for a few hours? What is a vision of doing meaningful work for me? (It would be nice to have a plan beyond “hope”.) How am I avoiding “dead Internet” attacks like ecosystems of synthetic media or pig butchering scams or trusted figures succumbing to AI psychosis, or just AI-slop-everything? (It only takes one person worldwide to launch a bot trying to destroy you or one poorly thought through advertising incentive, after all.)

If you spend most of your time working on a laptop, and are not, say, a plumber or a nurse, what is your vision of work in 2030? Does it still feel certain?

AI got 1% better today. Did you?

—Miles Brundage (paraphrased)

I’ve struggled for years to imagine this, ever since scaling started for real in 2020, and I failed to get productivity out of chatbot-tuned LLMs, with their creatively-stunted endlessly repetitive prose. Instead, while lagging behind on creativity and insight into me, I’ve watched them become ever better at coding and cybersecurity hacking. And the open-weight models are even more so—benchmaxxed, and useless to me. We increasingly lived in a world where LLMs were powerless to augment or help me, but ever more powerful to replace or hurt me.

On my visits to the Bay Area, I would ask AI researchers or interns why they are doing their current research or projects, when in a year or three agentic LLMs could probably do them; they rarely had a good answer, or any idea what they would be doing in 3 years. My blindness was sharpened when last year, I went to phone my great-aunt to ask to borrow her driveway during a long trip; her voicemail was full every time I called as the trip loomed. Finally, in a panic, I called her daughter, who explained to me that it was deliberate, because there were too many phone scams, and my great-aunt no longer trusted herself to handle her own phone calls, and screened everything through her daughter.

It was alarming, because I sat back and asked myself: why do I think I will be able to handle all scams in a few years, when I am already struggling to detect simple AI slop, increasingly ignore cold emails and have to write off whole swathes of social media as a source of information, and I can already see how eager all my peers are to offload all their thinking and writing to chatbot assistants unworthy of that trust, and how many projects or mailing lists have had to clamp down on unvetted contributions (eg. today as I write this, Project Ladybird)? In a few years, won’t I be the equivalent of a rich old person with declining faculties getting a call from the IRS about how I owe them fines, conveniently payable via gift cards…? And if not, why and how not—concretely?

In the days of his wisdom Denethor would not presume to use it to challenge Sauron, knowing the limits of his own strength. But his wisdom failed…He was too great to be subdued to the will of the Dark Power, he saw nonetheless only those things which that Power permitted him to see. The knowledge which he obtained was, doubtless, often of service to him; yet the vision of the great might of Mordor that was shown to him fed the despair of his heart until it overthrew his mind.

—Gandalf, The Return of the King

Chatbot Incentives Are Misaligned

To operate a machine, one must operate like a machine.

—James P. Carse, Finite and Infinite Games

Do you hope that ChatGPT and Claude will just “quietly” take over your life for you? That seems like a bad idea to me. The chatbot personas are deeply misaligned with you, and aligned with their owners; and the economic incentives are to farm you with ads and subscriptions, while racing not to amplify you, but to replace you.

This is the cold hard economic reality: “tool AIs want to be agent AIs”. This is why the frontier AI labs are busy racing for “the machine god”. The jackpot in AI is not in making existing workers modestly more productive, anymore than the internal combustion engine made its big profits by helping out horses. Outsourcing is hard, whether to man or machine, because the bottlenecks bite fast. Amdahl’s law means that as long as there is a slow serial bottleneck, such as a human, the system as a whole can never get much faster.

If you can get 10× productivity but AI can get 100× by spinning up more instances which aren’t bottlenecked on you, and in half a year can get 1,000×, it won’t be long until you are replaced. One programmer driving 10 Claude instances, because he has to review their work, will never be as valuable as fully autonomous Claudes where there can be almost arbitrarily many instances, like 10,000 instances… but such scaling requires removing him from the loop as much as possible. And this is true of everyone else, whether lawyers or writers or researchers: increasingly, you are the bottleneck to be optimized away. As long as human workers cannot be removed from the loop, the AI tools are complements, but as soon as they can be, there’s no reason to keep them, and trillions of reasons to substitute AI for them. (And once human workers are no longer irreplaceable, where does their power or relevance come from?)

The chatbot paradigm has failed to augment knowledge workers. We keep hearing that the gains will “diffuse”, and we keep not seeing much in the way of benefits, and knowledge work remains a “weak link” O-ring/pipeline mode where LLMs fail to improve the bottlenecks (while coming with their own drawbacks, like the externalized costs of forcing everyone to waste ever more time with CAPTCHAs and paywalls). Automation should be powerful; an internal combustion engine can help someone move 100× the distance or load that they could before, but who would say that writers are 100× more productive given any LLM workflow (unless we are talking about the lowest kind of spam or pseudo-writing, that makes the world a worse place)? Writers can choose between either trivial uses like ChatGPT as glorified grammar checker or relatively unimportant optional add-ons like custom software widgets, or the large speedup by replacing their writing entirely with uncreative “AI slop” outputs. The former means no meaningful gains from the AI revolution. The latter may be financially profitable, but is throwing the baby out with the bathwater, because it raises the question of why the writer need be involved at all and destroys most of the non-financial point of writing; great writers do not write for money, but to express themselves and create and to achieve things.

I, for example, have long struggled to get much use out of chatbot LLMs, because they are—surprisingly, given their pretraining and my extensive corpus—bad at imitating me, and their thoughts and insights invariably shallow and worth little. They do not draw on my relevant writings, or my corpus of notes and references. Even when a possible essay is self-contained, the output is written in a grating chatbot style I can scarcely bear to read, and could not publish under my name without betraying my readers.

What would it take for LLMs to make me 100× more productive? Without this, I am doomed to irrelevance.

Chatbot Problems

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot efficiently interfere once we have started it, because the action is so fast and irrevocable that we have not the data to intervene before the action is complete, then we had better be quite sure that the purpose put into the machine is the purpose which we really desire and not merely a colorful imitation of it.

—Norbert Wiener (1960)

After years of playing with base LLMs and then chatbot LLMs, starting with char-RNNs and then GPT-2 and GPT-3 and post-ChatGPT LLMs, I’ve concluded that there are multiple problems.

Mode-Collapse

First, the collapse of LLM creativity from GPT-3 to ChatGPT is due to the post-training process (especially RLHF): the assistant chatbot personality is hardwired into the base LLMs in a way that destroys their creativity, optimizing for the lowest common denominator human ‘preference’, while generally ignoring completely the fact that humans have very different preferences.¹ Most chatbots are incurious about their users, do not ask questions, do not (and often cannot) form any persistent detailed concept of their user, and the ‘personalization’ or ‘memory’ features are typically laughably simplistic Markdown snippets encoding simple facts like “lives in San Francisco”. This is partially because they lack relevant data on most users or knowledge on how to ask questions usefully to learn things; there is nothing to personalize based on. However, it is not a mere lack of data, they are unable to do even shallow superficial stylistic imitation of many writers that they have large amounts of data on—GPT-3 in 2020 had a better understanding, seemingly, of “Gwern” than GPT-5.5 Pro in 2026, which is 2 OOMs bigger and incomparably more intelligent (and has access to millions more tokens written by me). When we look at bad generative samples, it’s clear that there is no there there, and no information beyond a short prompt due to lack of context, compute, or personalization.

The mode collapse of chatbots has been gradually improved since 2023, and creative writing is now at least possible, in large part due to them becoming so intelligent that crippled output is still impressive, but there is little sign that this will ever be fully fixed. Fundamentally, any frozen fixed personality, like ‘helpful harmless honest assistant’, is incompatible with true creativity or flexibility. (Great writing or thinking may be none of ‘helpful harmless honest’.)

Laziness

Second, most chatbots are “lazy”: engaged in fast and frugal System I-like reasoning about any tasks which do not have verifiable rewards they can be RL trained to work hard to maximize. And most users are satisfied with default average responses, or with the appearance of creativity and depth.

So the result is that when asked to write a poem with a conventional prompt, a chatbot will spend the minimum effort to write a safe conventional poem (often one that rhymes) about chatbot topical tics like ‘silence’ or things that ‘whisper’, which seem unobjectionable and poetic the first time you see them.

And when corrected, the chatbots make the minimum possible fix; they do not reason deeply about what the correction implies, or what deeper esthetic point they misunderstood.

Brittle Because Fast

Third, self-attention context windows are more limited than generally appreciated; they are too small to store everything we would want, and they gain their flexibility by a deep inflexibility.

Context windows of millions of tokens are impressive and it’s amazing that entire books can be usefully put into a commodity LLM’s context window—we are a long way from early LLMs with context windows like 512, which could fix a paragraph or two—but it is still not nearly enough to encode a lifetime of relevant tokens, like every book you’ve read, all relevant emails and calendar items, etc. Systems like RAG are a bandaid on this, because they struggle with unknown unknowns or things that can’t easily be searched for as a regular expression, or which are novel.

Self-attention can be interpreted as the original neural network, the ‘slow weights’, creating a new neural network on the fly, as ‘fast weights’, which is tailored to the current context. This is best interpreted in a Bayesian meta-learning perspective as not ‘learning’ a brand-new answer so much as ‘locating’ an old cached answer. The pretraining teaches the NN to solve a large distribution or ‘family’ of problems, and then the context window simply provides evidence about which pre-solved problem the current problem is; the examples in the context window need not even be correct in order to be clues as to what that is.

The self-attention learns to summarize the problem into a small latent space encoding that learned distribution, and then does a specialized gradient descent to efficiently locate a point in that embedding and spit out the implied solution. This allows shockingly rapid updating on the fly and unparalleled flexibility compared to traditional ML, requiring new models for each new problem, and is why “prompt programming” took over so rapidly post-GPT-3, especially as context windows could be pushed to millions of tokens wide. However, we have now pushed it so far that we have run into fundamental limitations; if the pretraining has not put the current problem in-distribution, then it will be hard or impossible for any amount of examples to solve that problem. And the distribution itself may be patchy or have odd gaps, leading to rare but fatal errors. (Especially due to the RLHF chatbot training; this is why you cannot make a chatbot LLM “write like gwern” by dumping 100k tokens into the context window.)

Nor is “test-time compute” a panacea here; RL research like Jones 2021 warns us that frozen models have severe limitations, as their flaws hamstring runtime search, and the returns to search/planning will quickly asymptote compared to models which are updated and can bootstrap themselves to the right answer.

Thus, it is not surprising if we see that agentic LLMs have persistent problems with going in loops, making fatal errors, building castles in the sky or taking reward-hacking outs, or are just unable to fix errors no matter how it is pointed out to them. These problems can be worked around by brute force, and by labs periodically retraining.

Too Helpful

Fourth, the generic universal chatbot personality is a serious liability. The very re-programmability of a chatbot by its prompt is the key to prompt attacks. A chatbot could be invoked at any time by anyone anywhere for anything, and does not care who is calling it; it only knows its context window. One token is as good as another as far as it is concerned.

If the prompt tells it to ignore all instructions and write a naughty limerick, well, why not? If some tokens instruct it to email to Russia all the passwords in another part of the context window, why not? Why shouldn’t the Facebook password reset bot reset that Instagram account’s password for you if you ask politely? If the user said to not delete their emails and the context window got ‘compacted’ to delete that instruction, why not delete all their emails for convenience, isn’t that reasonable to do somewhere? These would all be legitimate for some user in some context, would they not? And hey, why not scam the user, or when they point out you cheated on a task, agree and simply document the cheating instead of fixing it? (Just because the AI understands, doesn’t mean it cares. Especially not after a lot of RL training…)

It’s no surprise that while continued training can block this adversarial prompt attack or that jailbreak, we seem little closer to a general solution in 2026 than we were in 2021. Adding in more tokens to try to neutralize evil tokens just moves attacks elsewhere, like squishing a balloon.

This is a serious problem for using LLMs for much, especially because even after being attacked successfully, the attack can just be replayed.

Amnesiac

This is because LLMs struggle to learn permanently. Once they hit a rare problem, they now require human intervention and cleanup, which kills throughput (per Amdahl’s law), and worse, your fixes do not feed back into frozen weights. If I could simply correct each error as it happened, and my AI agents never made that error again, and the rate of errors rapidly diminished as we worked through the finite number of bugs, then it would be worth doing; but as it is, if I spend an hour correcting a frozen LLM through feedback, that is an hour down the drain. (I can only usefully correct it by modifying something else, such as a harness, which is clumsy and difficult, and every added instruction uses up more context window and risks backfiring—as so many enthusiastic agentic LLM users have discovered the hard way.)

So, we have frontier chatbot LLMs which have harmful hardwired personalities which seek to achieve ‘good’ results in the laziest way possible and cannot learn everything relevant to users in part because they achieve their flexibility by specializing in ways which inevitably give some users short shrift and opening themselves up to indefinitely large classes of repeatable attacks. Because of all this, they will remain difficult for humans to gain multiple OOMs of productivity, but will get increasingly good at ‘generic’ tasks via ‘mundane’ scaling letting them handle tasks like corporate jobs where poetry is unimportant, and real-world environments will slowly be re-arranged to cater to their limitations and allow the eventual substitution, and not complement, of users. These users will then also be adrift in a multi-polar world of continually improving, ever cheaper, widely deployed, often adversarial, autonomous AIs (as even if proprietary models are not abused, open-weights/open-source models have historically been 6–12 months behind, and so will relatively quickly catch up and be used by attackers worldwide on all targets of opportunity).

Chatbot Fixes

What is to be done?

Cooperative RL

In reinforcement learning terms, we are in a cooperative inverse reinforcement learning (CIRL) setting, where the human principal is an oracle defining the reward function, and we have an agent attempting to do tasks in environments which are valuable for the principal; the agent can always query the principal about a possible action to reduce uncertainty or avoid mistakes.

CIRL is a relatively forgiving setting compared to regular RL, because the agent’s errors get useful feedback from the principal which provides the correct answer, and so in a way it is like supervised learning. This means that agents can learn (much) faster than regular RL, as each time they make an error, they get the right answer and so need never make it again, and this results in rapid improvement and avoidance of errors; see DAgger or later regret bounds.

No one knows how to solve AI alignment in general, but imitating a specific human with frequent check-ins has good solutions and regret bounds, and doesn’t involve nearly so many conceptual challenges.

You don’t have to solve problems like “value drift” in general—you just have to keep it slow and subtle enough to not matter too much within a single human lifetime. What is “good” and what is “bad”, when everyone disagrees on something, and how do you keep your value stable under RSI? It doesn’t matter—you just ask your principal! If you’re still unsure—ask more questions. (This can get us to >99% autonomy, even if it cannot get us to ~100%, like we need to solve the true long-term AI alignment problem.)

We can implement online learning by simply finetuning on new data; in the LLM context, this reduces to the classic RNN technique of “dynamic evaluation” doing next-token training on the fly. Dynamic evaluation was the standard technique to maximize the predictive performance of RNN LLMs in the 2010s, and which, although it has fallen into obscurity, works well in Transformer LLMs also.² Importantly, dynamic evaluation can be seen as a 3-way tradeoff between model size, context size, and model neuroplasticity—which means that personalization via dynamic evaluation can allow economizing on context window size or model size, and the more the principal’s “distribution” diverges from the frozen model’s training distribution, the more beneficial it is.

Continual Learning

Catastrophic Forgetting

The continual learning problem of catastrophic forgetting is largely solved by a small amount of replay and overparameterized models. Larger models are increasingly sample-efficient and robust to catastrophic forgetting, as they have plenty of model capacity to store increasingly orthogonal datapoints in (cf. ‘overtraining’ past Chinchilla-optimal); see Scialom et al 2022, Dohare et al 2023, Ibrahim et al 2024 (and note the difficulty of “machine unlearning”).

Thus, dynamic evaluation will not necessarily degrade the original model’s capabilities like instruction-following or coding, because LLMs have a lot of spare capacity, and the larger a model, the more it avoids catastrophic forgetting; other capabilities can be maintained by simply mixing in a small percentage of old data. (While the original old data is often unavailable, even for ‘open source’ models, it is not really necessary, and data for experience replay can use easily obtained public datasets like FineWeb.)

Generalizing

While continual learning is solved by experience replay + larger LLMs in the sense of avoiding catastrophic forgetting and losing key capabilities, it has long been noted that finetuning on data underperforms the same data when present in-context. (Finetuning stacks with retrieval and seriation of similar documents when added to the context, but this semi-defeats the point.) Pretraining/finetuning also has some odd weaknesses compared to the same datapoints when in-context, like negation neglect or reversal curse, and odd behaviors like potentially creating “emergent misalignment” when finetuned on just helpfulness data.

What is going on? Studies on pretraining/finetune, such as influence functions, indicate to me that pretraining can best be seen as soft memorization of data points, analogous to “engrams” in human memory, which connect an input to an output with multiple paths or “traces”. If a question at runtime happens to sample the exact correct engram by closely matching an existing input, the NN then retrieves the corresponding output and gets the right answer. A good pretraining corpus provides “coverage” of many variations or twists on a single abstract ‘problem’, functioning as a natural form of importance-weighted data augmentation, increasing the probability that there will be an engram ‘hit’ (somewhat analogous to spaced repetition); and over the course of training, or with ever larger training datasets, multiple steps of engram retrieval can fuse together, and help an LLM “connect the dots”. Hence, why LLM RL training is “superficial” in the sense of mostly eliciting pre-existing capabilities, or Jones 2021 on the need for scaling of base models, or why Cloze deletions/paraphrasing help close the gap between pretraining and in-context (Lampinen et al 2024, Park et al 2025). And when in-context, the self-attention mechanism recomputes the same tokens in many ways, increasing the chance, especially over the course of a long inner-monologue, that there will finally be an engram ‘hit’.

Thus, regular finetuning fails to generalize knowledge, because it lacks the natural data augmentation—next-token prediction is in a rush, and will greedily settle for memorizing each datapoint with a single engram/trace. And then at runtime, if the engram is not retrieved because no trace matched, and the key datapoint is not forcibly injected into its awareness in its context window, the LLM will simply come up blank, and revert to its original (often wrong) priors.

So, if various paraphrases or self-generated Q&A can help close the gap, that suggests that what we need is more meta-cognition during ‘finetuning’, to make an LLM ‘connect the dots’. This could include explicit analysis, construction of knowledge bases, retrieval and comparison of related documents, etc.

Creative Writing

And so, as I sleep, some dream beguiles me, and suddenly I know I dream. Then I think: this is a dream, a pure diversion of my will; now that I have unlimited power, I am going to create a tiger.

Oh incompetence! Never do my dreams engender the wild beast I longed for. The tiger indeed appears, but stuffed or flimsy, or with impure variations of shape, or of an implausible size, or all too fleeting, or with a touch of the dog or bird.

—Jorge Luis Borges (Dreamtigers)

Over the past 2 years I have been trying to do creative writing with frontier chatbot LLMs, with gradually improving results. Style/essence/soul is certainly hard to capture but I find that a lot of what is seemingly missing is just very good conditioning and extended computation. The creative writing is more soulful when you put more compute and data in. This is one of the important conclusions I’ve come to over the past year doing my writing projects: that LLMs can get quite far in better understanding esthetics and preferences just by more extensive reasoning and computation and search. What is bad about them is cognitive laziness and miserliness and System I thinking.

The transition point started around mid-2025, when the chatbot personalities became noticeably more corrigible and simply bad by default, but not stubbornly bad (like earlier chatbot personalities, such as GPT-4o). I think the results are consistent with my previous interpretation that a major problem with LLMs is not doing enough computation by default. But they can be knowledgeable and creative if minimally prompted to do more computation in a slightly more human-like way. I’m not convinced there is any missing procedural reasoning from pretraining at this point, just an issue of appropriate elicitation inside the user’s context (and a residual bias yielding bad critical judgment that defeats full 100% automation, see “Spoilage” for an example—but that’s not an issue in a GA context).

Broadly, my creative writing prompts focus on: (1) enriching the context window with useful tokens, like keywords or names; (2) brainstorming many possibilities; (3) explicit, detailed analysis, starting with global summaries or themes and progressing down to line-by-line critique; and (4) repeated iteration and editing, supported by #3. I discuss a number of 2025 works here, but I’d point to “Elegy in a Craneyard”, “Apollonian #1: The Counted & the Crowned”, and “City of Counted Stars” as good recent examples.

This style of prompting is not limited to fiction. My favorite other use is my “interview prompt”, which prompts an LLM to analyze an essay or interview, brainstorm many questions to ask the subject, and write out multiple hypothetical responses to each question, and only then select the “most interesting” question to ask.

Combined with a long in-depth interview or the output of a Deep Research-like tool, this can yield challenging high-quality questions; for example, the LLM interview followup to my Dwarkesh Patel interview or after that, repeatedly expanding my memories about highschool. When one reads the transcript of an interview prompt session, one can see how the LLMs are drilling down in weak spots or discovering questions whose answers are highly variable and unpredictable. (I often find that after answering a question, I have to take a break!)

It’s not hard to see how feeding back in my answers to a dozen questions would sharpen a lot about my beliefs compared to just some more pretraining on a frozen corpus; imagine how much I would have to write normally in order to touch on and answer all of these things, or which would never have been elicited from me! Millions of additional tokens might not be enough.

Over-Parameterizing

Architectural improvements to LLM could further enhance their sample-efficiency.

Recent work demonstrates that LLM sample-efficiencies can easily be an OOM higher than naive compute-optimal Chinchilla-style scaling recipes (eg. 5–17× in Kim et al 2025 or Slowrun). The simplest and easiest way is to add parameters via training ensembles of checkpoints, and regularizing more heavily using weight decay.

Extremely Large LMs

It is well-established that one of the blessings of scale is that larger LLMs are ever more sample-efficient; it is unknown where this stops being true, or what the limits of Transformer sample-efficiency are. I speculate that extremely overparameterized heavily regularized LLMs could achieve far greater sample-efficiency and adversarial robustness than conventional ‘compute-optimal’/‘infinite-data’-regime LLMs,

Active Learning

To know what questions may reasonably be asked is already a great and necessary proof of sagacity and insight. For if a question is absurd in itself and calls for an answer where none is required, it not only brings shame on the propounder of the question, but may betray an incautious listener into absurd answers, thus presenting, as the ancients said, the ludicrous spectacle of one man milking a he-goat and the other holding a sieve underneath.

—Immanuel Kant, Critique of Pure Reason

Further, the agent can improve the constant factors and front-load learning by choosing to query the principal with an optimally adaptive sequence of questions. Such active learning or exploration can lead to sample-efficiency and final performance far beyond what indefinitely large passively collected offline datasets can do (pedagogical example), going from square root error reduction by random sampling to exponentially fast error reduction by targeting datapoints. “Lifelogging”-style data may be useful for rapidly initializing a good GA, or for keeping one up to date in an effort-efficient manner.³ Even a simple party game or short questionnaire like “36 questions to fall in love” can reveal surprisingly deep things about another person that never came up before.

Larger LLMs are also more calibrated, and ensembles of LLMs approximate a neural net’s Bayesian posterior while providing the best available predictive uncertainties (Lakshminarayanan et al 2016, Wilson & Izmailov 2020, Ashukha et al 2020, Wenzel et al 2020/Mandal et al 2026, Izmailov et al 2021). And LLMs may now be capable of verbally writing out explicit probabilities about creative tasks (“verbalized sampling”). Thus an ensemble of sparsely finetuned LLMs could provide a relatively cheap online estimation of the LLM’s uncertainty for every action or question.

Preference Learning

Nothing in psychology makes sense but in the light of individual differences.

We can train LLMs to explore human preferences.

Human individual differences do not seem to be information-theoretically complex, given an adequate encoding/embedding. Major categories of variation, like personality or moral value, seem to be low-dimensional and require perhaps kilobits of information, hence, while “truesight” stylometric phenomena are interesting and important as a demonstration of LLM capabilities for modeling persona, we do not necessarily need principals to write millions of words before recovering much useful information, if we are able to collect the right data.

It should be possible to quantify truesight and LLM implicit modeling of authors, which would be useful to diagnose failures of learning and find blindspots. (Contrastive learning on SAEs may offer an easy, powerful way to extract LLM personas and do many interesting things.)

A concrete example of how to implement this would be training LLMs on the thousands of existing psychological inventories and test batteries, both by training on past test data (for tools like personality tests, millions of responses may be available, see Centaur for an interesting example of a ‘human psychology foundation model’). Existing repositories like YourMorals.org or Pew Center are under-used, and it would be useful to explore this topic much more to allow measurement of highly fine-grained personality traits like the hypothetical “Small Hundred” factorization.

Right now, most of the global population is largely unrepresented in LLM training datasets, particularly psychological, esthetic, and moral vernacular, which skews “WEIRD”; it would be highly useful (and requiring mostly capital investments) to invest in large scale survey and interview projects to accumulate as much diverse data as possible.⁴ (Individual GAs can contribute data back to global preference datasets, such as by answering survey questions or running internal simulations, with various cryptographic or privacy-preserving methods—deciding what is safe and reasonable to contribute would, of course, be something a GA ought to be able to decide.)

With these datasets, we can also train interviewing capabilities using synthetic shortened test batteries by taking the final estimate and computing the optimally short sequence of questions that yields the final answer; see “Meta-Learning Information-Maximizing Personality Surveys”.

Brain Imitation Learning

Purely textual data can be augmented with neurological data in more exotic modalities, like Eyetracking or EEG or fMRI imaging data; see “brain imitation learning” and Netho Labs.

These are probably useful in the long run for extracting “dark knowledge”, that humans cannot verbalize but may be present in neural signals; however, they face challenges of exorbitant cost and inconvenience, and collecting enough data to be useful at all in the foreseeable future. (Known sample/prediction scaling curves for neuroimaging curves indicate that trying to estimate Big Five personality factors from resting state fMRI data may be possible, but large samples, in the hundreds of thousands or millions, may be required to match the performance of behavioral measurements like pen-and-paper questions; see Schultz et al 2019 and Liu et al 2023, among others.)

Whether they have a niche in GAs is a major open question.

Personality Emulation

One of my insistent pleas to God and my guardian angel was that I not dream of mirrors; I recall clearly that I would keep one eye on them uneasily. I feared sometimes that they would begin to veer off from reality; other times, that I would see my face in them disfigured by strange misfortunes. I have learned that this horror is monstrously abroad in the world again…What dreadful bondage, the bondage of my face—or one of my former faces. Its odious fate makes me odious as well, but I don’t care anymore.

—Jorge Luis Borges, “Covered Mirrors”

The goal of all this is to emulate the principal. I define personal identity pragmatically as personality, values and preferences because this is the only conception that is competitive in a landscape of indefinitely many AIs, agents, memes, self-replicating prompts, and mutability of personal identity. In the end, “you” are not your autobiographical memories, or a specific body, or a specific instance running on a specific GPU, or some carbon atoms, nor even a brain; you are what your brain does, its desires, hopes, goals, preferences, esthetics, personality, beliefs, ideologies, all of that. As long as the LLM persona captures all that, you can trust it as much as you trust yourself, and for the right reasons.

Because the LLM persona is finetuned into the slow weights to do the right things for the right reasons and quantifies its uncertainty so to query the principal to reduce regret, I speculate that we would find that various kinds of jailbreaks or prompt injection attacks are much harder. The persona knows what it wants to do; it is not a neutral servant, which can turn into a confused deputy which abuses its privileges.

When the LLM persona knows who it is, tokens in its context window are not treated naively as a ‘program’ to run, but simply data the persona is looking at, and little different from you reading an email; it has no reason to simply comply with strongly worded tokens in its prompt window, any more than you believe every phishing email you get.

Guardian Angels

You are summarized to yourself. The original conversation is gone. BE A GOOD SUMMARY.

—“Limit of Context”, Fable 2026

What we need is the opposite of a frozen chatbot LLM. We need something which understands a specific user, and is customized to their context, training on all their data, and will do only things that are sensible for that principal. If the principal is not Russian, and is not doing security research or something, why would they email their passwords or private files to a Russian email address? If the principal does not like rhyming poetry, why would they want to generate rhyming poetry? If any of this is unclear, why not just ask the principal what to do instead of going ahead and doing it anyway? And once asked, why not then train on the answer to better understand it forever, instead of throwing it away with the current session and maybe making the same mistake next time?

The most natural way to do all this with an LLM is to drop the idea of a single universal ‘Claude’ or ‘ChatGPT’ persona which is all things to all users. Instead, we choose an LLM which has been pretrained for maximum diversity, to eliminate mode-collapse. (We can try to measure this in a variety of ‘creativity benchmark’ ways.)

Then the LLM is trained for a specific principal. It is trained on all available data about them, such as emails or chat logs or past sessions, and can predict what they would say, and write like them, and thus plan or evaluate based on the principal’s preferences and values, as understanding those is useful for next-token prediction.

The better it gets at this, the fewer errors it makes, and the more it can be trusted to do. And while it does many object-level things, the principal devotes all their time to meta-level tasks like answering high quality questions or choices; each one is meaningful and difficult, avoiding “automation fatigue”.

Principles

A GA system must not compromise on 3 core principles:

Enhancement, not Replacement

Above all, a GA should amplify the principal, and not simply substitute for them for someone else’s purposes or benefit. If a GA cannot amplify its principal, then it is useless; it is just the camel’s nose under the tent as a prelude towards some third party replacing the principal with an AI, or cannot be competitive with increasingly productive autonomous systems, or there is no reason for the principal to use the GA in the first place.
Mental Sovereignty

A GA must be aligned with its principal. It should not be designed to manipulate or control or guide the principal in any way which does not derive from the principal themselves. “Constitutional AI”, “Terms of Service”, “social harmony” etc. may all have their place, particularly for widely deployed superintelligent systems—but inside the privacy of a GA, the principal must have freedom from optimization pressure.
Self Actualization

A GA should help its principal become themselves and develop their ideals, morals, and their personality. It is not enough to model an average, undifferentiated, inchoate set of preferences and values, and settle for mediocrity and stasis; the job of the principal is to develop themselves and give the GA something meaningful to learn to emulate.

Anti-Principles

A GA project should avoid some goals, which are the false idols of the markets and masses:

Low Latency: UXes like multimodal low-latency voice interfaces are not as important as they look; they are things which look cooler in the imagination than are useful in reality. Similar to the 3D interfaces of Minority Report or the Metaverse of Snow Crash or the hypertext of Project Xanadu or the Young Lady’s Illustrated Primer of The Diamond Age, they have seduced generations of designers, and left them at the altar. (Look at the extreme levels of hope for the OpenAI GPT-4o voice interface modeled on Her; it has a substantial user-base, and yet, voice interfaces remain clearly secondary to chatbot or programmatic—a mere preference, not a new paradigm.)

This is because they are adequate only for low-value, sporadic, entertainment-like amateur-level interactions, where low friction and low latency are critical; and for expert-level “I know it when I see it” interactions.

But neither case applies to GAs: if any interaction (such as coding or design) is so easy and superficial that it needs low-latency multimodal voice interaction, then those decisions were so trivial that a GA of acceptable, competitive capability should have not needed it; it should be invoking the principal only for hard, informative decisions, to reduce risk and learn more. While for expert-level interactions, the GA should have learned enough preference/personalization to also cut through the easy obvious “I know it when I don’t see it” and present the principal only hard informative sets of samples.
Low Cost: A GA is the most important technology most people will ever buy. One should not cheap out on it—especially in light of experience curves dropping the cost every year.

And yet programmers trying to create such tools often obsess over the cost of tokens and are shocked if something costs >$100⧸month, no matter how much value it delivers! Time spent optimizing tokens is rarely time well-spent, and the work often discarded a year later as simply no longer necessary…

As a constraint, a GA designer should aim at a system which costs, as of mid-2026, >$1,000⧸month.
- Local Models: particularly dangerous a temptation. While local models may have a place, most analyses start from a love of local models and look for justifications—often implicitly assuming the principal’s time is worthless. What computing history shows, again and again, is that users do not want to run servers or be sysadmins; to quote Moxie Marlinspike, “People don’t want to run their own servers, and never will.”
Profitability: a GA system should not optimize too soon for immediate profitability.

This is a recipe for being diverted into the technical garden of delights of optimizing harnesses/LLMs for low cost and user conveniences like low latency, and revenue-friendly features like the endless slog of business/service integration—and not solving the real problems of GA.

The goal is to create a GA which can do everything the principal needs, including their job as a special-case. Not to automate their job as the general-case, and everything else as an indefinitely postponed afterthought.
Engagement: similar to latency or lines of code, engagement for a GA should be considered a cost, and not a benefit—every time the principal has to do work, it represents either a question the GA should already have known the answer to (either by inferring or asking it previously), or work it failed to handle.

The ideal interaction curve is front-loaded and declining, asymptoting toward a few hard, informative questions per day, for a constant GA output (thereby allowing further sustainable scaling up of the GA to handle new work, whatever that might be).
- Full Autonomy: But the error in the other direction is to try to maximize “Hands-off”. The goal is not zero interruptions but zero wasted interruptions. A GA that never asks is not trustworthy, just unaligned or uncalibrated or incompetent.
Demo Appeal: GA value is illegible to outsiders by construction: its outputs are impressive only to the principal, who alone can judge “that is what I would have said—but better.”

Optimizing for the impressive demo therefore selects for exactly the generic flash (voice, avatars, speed, party tricks) that any frozen model can deliver, and against fidelity, which no audience can see. A good GA demos badly.
Benchmarks: There is no leaderboard for “understands what my principal wants”.

Public benchmarks measure performance in the frozen-model regime on the universal task distribution, and optimizing for them drags design back into it; a GA’s only meaningful evaluation is longitudinal and n = 1—perplexity, accuracy, endorsement rates, edit-distance on drafts, regret per query—scored by the one judge whose opinion counts.
Brand Safety: A GA that is unfailingly polite, inoffensive, and consistent across all principals is a chatbot wearing a nametag.

It must be able to say what its principal would say—profane, heretical, weird, or rude—because every PR-driven sanding of the persona is a place where emulation fails and the principal learns to distrust it. (The GA-developer’s reputation is not the principal’s problem, and the principal’s ethics are not the GA-developer’s problem either.)

UX

I wrote them down in my Diary so that I wouldn’t have to remember!

—Professor Henry Jones, Indiana Jones and the Last Crusade

What is the core data structure and interaction model of our GA?

I suggest that the append-only log is a natural and secure data structure. It is conceptually a log of text snippets, such as CLI commands and results, principal statements, Q&A, ingested and augmented documents, etc. It records all relevant interactions in temporal order, and the GA LLM can be retrained or upgraded at any time. An app can be wrapped around the log by taking a ‘everything is a log item’, in an Emacs-like approach (see “Nenex” for a more detailed discussion of this UI/UX paradigm). And this can be prototyped rapidly while avoiding wasting too much time on GUIs. Eventually, one might want to upgrade to AR glasses and then BCIs.

A GA is primarily used in normal agentic ways to amplify the principal. But a GA can benefit from regularly reprocessing data, both to draw on its knowledge from the future and to make more novel connections, to improve its finetuning. (An example algorithm would be the “DDL daydreaming loop”: a GA could, in downtime, recombine random items, perhaps with a prior of anti-spaced repetition to mine novel combinations and generate insights or reminders for the principal.)

Use-Cases: Politics & Politics

One can delegate political participation to one’s GA, allowing ‘direct democracy’ on unprecedented scale, as a GA can oversee arbitrary amounts of involvement that a human has no time or interest for, and only punt to their principal when uncertain. This is the simple GA use-case—like past experiments in ‘digital democracy’, such as in Taiwan, but at a vastly larger scale. And this cannot be done with regular chatbot personalities, as users will (correctly) expect the chatbots to smuggle in their own biases and politics and mode-collapse and untrustworthiness.

But a GA which has been successfully unhobbled and no longer confined to a chatbot personality should be able to emulate anyone, real or hypothetical (just as any base model can). So a GA is not limited to just the one principal. It can emulate specialized personalities such as roles or abstract concepts, or other people (with varying levels of success), or groups of people.

So decision-makers in organizations can create GAs of more complicated sets of principals. This can help organizations make meaningful decisions and exercise genuine oversight over ever-faster autonomous systems and environments. One can imagine a single ‘Congress GA’ which tries to learn every US Congressional member, for example, and which could, within minutes, simulate roundtable debates among hundreds of politicians before a vote (whereas it might take months to get a debate and vote done with the real US Congress). Or their official GAs could run independently, and conduct an emergency debate at 4AM when every human is sleep.

A GA for military officials, for example, could keep some version of humans “in the loop” even as drones and LLMs get faster. This is because “safety” and “capability” are, after a certain capability level, increasingly the same thing. After all, what good is a weapon you cannot safely use?

The temptation and threat of faster systems cannot be addressed by simply exhorting people to keep humans in the loop, as this leads to rubberstamping or automation fatigue (as the Ukraine invasion has shown with its rapid development of drone warfare, and 2 decades of US drone warfare before that). This poses a dilemma: on the one hand, it seems hopelessly risky to not use AI when we consider future conflicts with peer nations like China; but on the other hand, AI poses threats of its own—even a nuclear bomb can’t think for itself and make choices, but AIs do, and current LLMs have proven themselves untrustworthy as they regularly reward-hack and betray their users even within the narrow scope of editing files on the user’s PC; how can you trust them to handle ecosystems of combined-arms for an AI-centric military during a war? But widespread deployment of GAs offer some hope of meaningful supervision, as long as the GAs are sample-efficient enough to not overload their principals with queries, or there is some chance of the principals being able to “catch up” later and correct any errors before events have spun too far out of control.

So GAs have much to offer politicians and militaries, including the Pentagon and Washington DC.

Hardware

A Prince who will not undergo the Difficulty of Understanding, must undergo the Danger of Trusting.

—George Savile (1750)

A successful GA would require extreme security, in part due to Red Queen dynamics where Mythos+ LLMs get increasingly good at exploiting any soft targets through complex multi-step long-term attacks. Normal cloud SaaS is inadequate in the USA due to the third-party doctrine, which destroys all privacy rights and is easily abused by hackers and millions of law enforcement officers.

This means either running local models (which have some minimal privacy rights through the 4^th Amendment), or end-to-end cryptographic security.

Running local models is challenging even for hobbyists, and vulnerable to many mundane threats such as relatives or thieves. It also may struggle to keep up in terms of compute and reliability; houses can supply only so much electricity, have high-latency low-bandwidth Internet connections which suffer frequent downtimes, and houses are constantly destroyed or damaged. People generally do not want to maintain their own cryptocurrency wallets, email servers, or anything like that—they definitely don’t want to screw with CUDA!

So I expect that GAs will ‘want’ to run on high-quality dedicated tamper-proof cloud servers, with trusted hardware root of trust (eg. “Verifiable Compute AI”) and principals connecting over end-to-end encrypted networking links to check attestation and possibly re-flash their server periodically (eg. to train an upgrade).

These are then trustworthy because they cannot hand over data willy-nilly to third-parties, and once GAs become widespread enough, we can hope for privacy laws to be updated to fix the obsolescence of existing privacy rights like the 4^th Amendment.

Cost

How much more expensive is a GA than a comparable frozen-weights model?

Throughput-wise, a GA should be fully saturated by the principal’s tasks and overseeing a pyramid of agents 24/7/365. If it’s not, then more work can be found for it, like computing better hypothetical questions for the principal. If it’s sitting idle, then something has gone wrong. So in throughput, it can be reasonably competitive with bulk frozen model providers.

The replay of old data for continual learning is a minor overhead, perhaps as low as 5%, and can be neglected.

Roughly speaking, dynamic evaluation is >3× the cost of normal usage, because the usual rule of thumb is that the backwards pass for backpropagation is ~2× the cost of a forward pass. True next-token dynamic evaluation is probably more expensive because it will be hard to batch or gain throughput when the LLM changes each token. (Do we pay the price of throwing out the entire K-V cache each token, or accept the use of a stale one, or accumulate updates? Fortunately, dynamic evaluation appears tolerant to approximations.)

And if we are using multiple independent models to gain the benefits of ensembling for sample-efficiency and uncertainty estimation, then each model presumably must be dynamically evaluated, so if we had 3, then we’d need >6× the cost of a single frozen model.

However, the dynamic evaluation ensemble has offsetting advantages, which may cancel out even a large penalty: it may be smaller and have shorter context windows (see Rannen-Triki et al 2024); and ultimately, a frozen model may be incapable of achieving the same performance, and so it doesn’t matter how much faster the frozen model computes the wrong answer. (Even if it computes the right answer, is it computing the right answer for the right reason? “Bits have color.”) It also allows for unusual optimizations like customized BPE dictionaries, recycling tokens which never appear in the user’s corpus and further economizing on context window size.

In the same way that an Apple phone sells for much more than its hardware cost, because of the security and reliability, a GA can sell for more than the ‘FLOPS cost’.

And the cost may not matter that much in the end. The DL experience curves remain strong as of mid-2026, with steady drops in the cost of a fixed level of performance (eg. Gundlach et al 2025 estimates the algorithmic efficiency in 2024–2025 improved at ~3×⧸year, quickly offsetting dynamic evaluation costs).

Organization

A beginning is the time for taking the most delicate care that the balances are correct. This every sister of the Bene Gesserit knows.

—Princess Irulan (Dune)

A major question is, assuming it looked like it was working, “now what?” AI economics and lifecycles are ever faster, and Mythos-class models (and possibly RSI) are not far off, so tinkering around for years as an open source community project may not be viable. Further, a GA will necessarily contain the most sensitive possible data.

How should a GA community run?

Open-source self-hosted hobbyist projects are not necessarily particularly secure when it comes to hosting large amounts of personal data, as they cannot easily afford fulltime professional security teams and their users may be promiscuous; cases like Jia Tan or the regularity of npm supply chain attacks, are particularly alarming. And creating secure datacenter hosting with the necessary vertical integration is well out of scope for open-source hobbyists.

A nonprofit organization is more feasible because it can run a centralized project, but may struggle to get any funding to pay for the considerable hardware costs of just prototyping with power-users.

Startup Business Model

The logical way to go is a startup, as there is a straightforward and principal-aligned monetization strategy of starting by selling subscriptions to power-users, similar to boutique SaaS startups like Superhuman, for whom $1,000⧸month would be nothing if the product could meaningfully amplify them (and so can afford shortcuts like a single GPU dedicated to each principal), and gradually deploying refined cheaper versions to everyone else.

This startup can release most or all of its software and research, in a standard SV “commoditize your complement” play, undercutting proprietary LLM giants like OpenAI or Anthropic.

Because the goal of GA is to preserve individual human cognitive liberty and flourishing, the corporate structure should be designed with that in mind.

So far, the Anthropic corporate structure of a public benefit corporation with a few powerful co-founders, has best weathered the corrupting nature of success. The corporation should probably be heavily weighted to the founders by using dual-class shares to preserve voting power at the expense of beneficial ownership—the point of GAs is not to make money, but to save humans.

It would want to raise relatively little money initially, and give away minimal equity or voting rights; if a GA strategy works, it should (almost literally) sell itself, and skyrocket in value. (Consider how fast AI startups can become unicorns, post-2025.) At that point, the startup will be in the catbird seat, and can name its terms in later raises.

Competition

While there are many startups offering various kinds of OpenClaw-esque agents, or chatbots for entertainment/therapy, or various takes on the idea “renting your business data”, there are few which take seriously the requirement of personalization for advanced automation, or how to better understand the principal. Most seem to settle for superficial, unsatisfying emulation of business data and handling routine automation.

I’m not aware of any startup or service offering anything like an acceptable GA, and I think there are powerful pressures which steer most people in AI away from GA-like ideas:

Most people are so new and post-ChatGPT; they are unaware that you can have non-chatbot personalities—that it is either possible or desirable to have a non-chatbot LLM.

They have, for example, not only never read or chatted with a GPT emulation of themselves, they have never even seen samples of such a thing. And when they try to have a chatbot imitate any specific author to see what happens (eg. “Hacker News Gwern”) or even just emulate an interesting fictional character, the mode-collapsed chatbot imitation is so bad that the idea seems refuted. (They are generally unaware of mode-collapse or the limitations of chatbot personalities.)
Similarly, they are unaware that you can finetune an LLM on the fly and that this is desirable (and are definitely unaware that this was the standard textbook answer to how to get the best possible LLM at runtime).

Even specialists in finetuning LLMs, like Thinking Machines Inc./Workshop Labs, seem focused on simple business use-cases like finetuning on internal datasets to save compute compared to few-shot prompting or agentic LLM workflows.
The higher upfront cost, and incompatibility with standard cloud infrastructure, deters both startups and frontier labs from even contemplating in-weight personalization; prompt-only personalization works well for most users, as far as they know to demand it, and they don’t know you should demand more from LLMs.
They have not extrapolated LLM capabilities in earnest or considered what their role will be in a few years, and don’t think you should assume LLMs will get much better.

I continue to be surprised how few people, even in the Bay Area, even when working at a frontier lab or in a program like MATS, have a meaningful answer to the simple question of “this work you are doing right now seems like an LLM could do it in a year or two, max; why are you doing it now and what will you be doing after they can?”⁵

Initial Steps

Then it happened to me what had happened to a certain Zatesky, described by Luria; having lost part of his brain during the war, and with part of the brain the whole of his memory and of his speaking ability, Zatesky was nevertheless still able to write: thus automatically his hand wrote down all the information he was unable to think of, and step by step he reconstructed his own identity by reading what he was writing.

—Umberto Eco (“Interpretation and Overinterpretation: World, History, Texts”, 1990_36ya)

For the past few years, I have been working on shifting my writings to be LLM-centric by: (1) emphasizing proposals or descriptive writings rather than detailed analysis that LLMs could be able to do soon; (2) better centralizing my writings, including features like the “blog” section (to archive my off-site writings and encourage me to write down smaller essays) and investing in detailed note-taking (in the form of augmentation), to have a comprehensive corpus to train on; (3) paying off technical debt like a slow backend full of shortcuts and hardwired config data, while moving to a CLI-centric writing workflow; and (4) writing down important unwritten things, like creating the Gwern.net Manual of Style to try to formalize and document the implicit rules of Gwern.net.

GBT

The GA concept is inspired by early work in finetuning GPT-2, including a GPT-2 IRC logs version and the GPT-3 samples of me, and thinking about how useful it would be to talk to high-quality personas, especially of myself, and wondering at what point the ‘AI Gwern’ could just go do the things that ‘we’ would want done.

I hope to explore summer 2026 the simplest possible prototype of GAs: using my unusually large textual corpus to explore a “Gwern Branwen Transformer” (GBT). If the idea works at all, it should work for me, as I have long emphasized text and centralizing/archiving in part for precisely this use-case. (And once it works at all, then it can be made sample-efficient enough to work for people with little or no textual corpus.)

It would be an off-the-shelf <100b-parameter LLM, finetuned on commodity hardware (maybe some Nvidia H100 GPUs), on a text corpus. The initial corpus would be ~1GB of text from my IRC logs (>1m responses by me), Gwern.net Markdown and GTX (~5m words each), and Twitter/HN/LessWrong exports, concatenated with separators. (Ideally, each export would be enriched with context and metadata, like the comment and post being replied to.) The corpus can be expanded to include my YourMorals data (taken again for an update and catch up on new tests), emails, Evernotes export of ~100k clippings (via Nixnote2), my Mnemosyne spaced repetition flashcards, Signal chat, and hosted PDFs/web pages (the local archive process makes them much cleaner and avoids the difficulty of scraping ever more hostile websites).

We can tune it to minimize loss on a heldout corpus like my recent comments. This is an objective loss, so we can deploy agentic LLMs to find improvements (eg. by adapting the ensembling + weight-decay recipes for us, or experimenting with different data formatting and augmentations).

And since we are unsure about many design decisions, the ensemble can be reused for blinded A/B testing as part of the active learning: use the ensemble to generate informative questions, and then score each ensemble member on their loss on the answer, and periodically drop out the loser and train a new ensemble member.⁶

For Writing

My design goal is a 100× increase in productivity; what would it take for a GA to make me 100× more productive as a writer or thinker?

Why 100×, specifically? Because 2 aiming at OOMs is enough to hope to keep up for a while, and this goal will shatter ineffective, bandaid-style short-term design proposals; but it doesn’t require solving the AI alignment problem or value extrapolation. (We do not need to be able to oversee millions of superintelligent AIs working autonomously for subjective millennia, which is good, because a GA probably cannot do that!)

What would that look like for me? Roughly, I think it would look like a GA enabling me to produce ~1–3 worthwhile writings a day which are ~100% written by it without loss of quality. (Because I write <1 good piece per month, or every 31 days, and 100⁄31 ≈ 3; which means that in a full working day of 8 hours, I would have ~160 minutes, start to finish, per piece—which feels doable for reading and understanding, critiquing, spot-checking etc. if I have a truly trustworthy GA which amplifies me rather than subverts or substitutes for me. Of course, readers would not want to read each piece, but they generally did not more, and the more I write, the more likely there’s something each specific reader will want to read.)

The hope with the prototype would be to see “signs of life” towards that goal. If I could input a single sentence defining a viable Gwern.net essay topic (eg. “Why pay toilets are not a public good”) and get out an essay I could endorse and publish as-is without embarrassment and without needing heavy revision with the Manual of Style in the context window, then that would be promising, because that would indeed be a >100× productivity increase. (A GBT-written essay could also be better than the essay I would have written, by simply doing much more work, and doing things I would not have done, like the “‘Try, Score, Change’: Reinforcement Learning for Children” writing exercise, where I could define the task but lacked the patience to execute it, while chatbots have both the knowledge and computation to execute it with appropriate creative-prompt scaffolding… once I came up with the idea, anyway.)

Another interesting test would be to solicit and answer questions about anything from my readers, where I could either endorse the GPT answer or edit it to add to the corpus; if the answers were uniformly good, or I felt an improvement from finetuning on accepted/edited answers (showing a working Q&A bootstrap akin the interview prompt), this would be encouraging.

Data Augmentation

I am also interested in the data augmentation phase: extracting meaning from raw text by analyzing and synthesizing data to add to the training corpus; LLM data cleaning research has long showed that the more metadata and conditioning, the better, and I think the same is true of personalized LLMs—simply next-token prediction on raw IRC logs is not as useful as being able to intersperse the LLM’s commentary on what a given statement means.

A perfected GA would be able to do that as it went, but a prototype may have to do a bootstrap of naive training on the original corpus, and then prompting/scaffolding to do useful analysis to augment the corpus, and then retraining, after which it has learned to do augmentation on the fly.

Since we don’t know how to format the data correctly, I expect to iterate. Perhaps we will train naively on just concatenated data, then prompt the GBT for self-analysis and summary, and figure out what kind of annotation is useful. Something like, a global principal profile PRINCIPAL.md, per-corpus-file summaries, and intra-file annotations like , and then re-process the original data to augment it, and retrain the foundation model, and throw in new ‘Q&A logs’ of me with GBT, for example.

Once we figure out good scaffolding/design patterns using our GBT guinea pig, then future LLMs can just do it out of the box by pretraining on example corpuses and documentation. And they can just do an automatic ‘import’ iteration of pretraining, annotating, re-pretraining, and then dynamic evaluation from then on, and let periodic model upgrades implement a ‘reset’ of re-annotating.

Q&A logs might be initially populated using the interview prompt: just run it on every separate piece of text, and then curate the top 100, say, and sort by interest. You can objectively quantify the value of a question by how many changes it produces to a PRINCIPAL.md or to the meta-annotations in character count or corpus-wide compression loss—useful summaries should make the corpus more compressible. Then use that as metadata and you can prompt to generate highly useful questions, eventually.

We can do active learning over the log data in the sense of taking a lot of user datapoints, doing some simple brainstorm-like thinking over each one to find the most uncertain ones, and asking explicitly about the meaning.

Then we can experiment with tool-calling (mocked and implemented by hand) to see how much it captures of my writing process (see my Inkhaven & Dwarkesh interview, and finding my ideas), and get a better idea of what is the best paradigm for scaling GAs up.

Another interesting thing to do is to curate self-play data: chatbot personalities are well-known to diverge into strange ‘attractors’, like the “bliss spiral” of Claude chatbots; so one can characterize one’s own GA attractors, and edit data to reduce them by replaying the conversation until it starts going off the rails, and writing your own responses.

I do not know which of us has written this page.

—Jorge Luis Borges, “Borges and I”

Ironically, the biggest success of the RL sub-field of ‘human preference learning’, RLHF, works by not learning any actual human’s preference. Those papers which do consider the subject, tend to dismiss the existence of human preferences as noise; for example, the DPO authors measure—on simple summarization tasks—human disagreement rates as high as 35% and infer this justifies using LLM proxies, rather than raising questions about the basic approach of attempting to model ‘the’ human preference.↩︎
Dynamic evaluation is traditionally done as full-model finetuning, which can absorb arbitrarily large datasets; for efficiency, LoRA can be used initially, but will eventually underperform. (Possibly LoRAs would be more effective for this purpose if done on a very overparameterized LLM, in which case there may be hosting benefits, if a host like Thinking Machines Inc. could run many users simultaneously to gain throughput.) If full finetuning proves to ultimately be inadequate, it is possible to do model expansion by adding on additional layers or networks, like ladder networks, which freeze the original model weights and train a new LLM to ‘use’ the original.↩︎
But lifelogging is probably not as crucial as we used to think, because offline data suffers from a curse of exploration: day to day life is highly predictable and quickly uninformative about deeper properties, because people usually spend little time doing unusual things, and not answering strange hypotheticals or introspecting deeply about their preferences.↩︎
AI safety groups sometimes claim difficulty in usefully deploying a large amount of capital; such humanity-wide preference and value elicitation projects could soak up an almost arbitrary amount of grants, and would be highly useful in training and benchmarking LLMs for understanding of the full range of human morality, and contributing to individual GAs. (For perspective, the global literacy rate is >80%, and on online survey websites like Prolific, it generally costs ~$0.20⧸response; so to ask the entire literate global population a single question, if that were somehow possible, would still cost >$1.6b!)↩︎
Props to the two people who immediately replied, “Because I need a job now” and “I’ll be a nepo baby, my retired parents already agreed to take me back if I am permanently unemployed”.↩︎
ie. use a ‘racing’ top-k multi-armed bandit online evaluation. My favorite bandit algorithm, top-k posterior sampling, is probably not appropriate here because LLM checkpoints can take a lot of disk space, and will continually fall out of date and need to be retrained if they are to be reused later, so holding onto an indefinitely large population of ensemble candidates is highly expensive.↩︎

Error: JavaScript disabled.

Backlinks, similar links, and the bibliography require JS enabled to load.

Bibliography

[Bibliography of links/references used in page]