Last week I laid out the case for developing automated personal advocates — software that allows AI agents to automatically act in service of an individual’s values. The argument goes: as we create more and more autonomous AI systems, if we don’t develop better AI-personalization, then individual interests may get left behind in a growingly automated economy.
The same week, Gwern Branwen published Guardian Angels: LLM Personalization for Productivity and Security, which was:
On this exact topic
35 pages long
extremely persuasive and well-researched
And which:
Made a bunch of novel (to me) claims about pitfalls of current LLMs, and,
Proposed very concrete steps and research directions for building an automated personal advocate (which Gwern refers to as a “guardian angel1,” or “GA”).
I felt kinda silly publishing my piece right afterwards, but, I’m not complaining! Reading GA gave me tons of new ideas about how to approach this class of problems. One aspect that I found particularly mind-expanding was its description of personal alignment failures in today’s chatbots.
I tend to be pretty happy with the responses I get in my day-to-day AI usage. Whenever I receive a “you’re right to push back,” I think “oh boy here we go again”, but then, after Claude corrects things, I think to myself “well, they’re not perfect yet. But they’re basically pretty good, and will get better with time.”
The issue with my thought process above is that it’s one-dimensional. In my mind the AI is either good (at completing tasks) or bad (at completing tasks), and if it fails to please me, then that’s merely evidence of general badness.
Gwern pokes holes in this simplified model, providing a litany of LLM failures that together illuminate a multi-dimensional landscape of personal-values-alignment concerns. One core issue is that chatbots are, in many ways, inflexible.
To start with an obvious contributor to inflexibility: chatbots are amnesiacs. By default they know little about you in a given conversation, and they cannot meaningfully learn your preferences nor learn from their mistakes. If they fail at a problem, even if you coax the LLM towards success, in subsequent sessions the LLM will continue to need coaxing, and the user+LLM team will remain bottlenecked on user guidance. Context engineering (i.e., harnesses), per Gwern, doesn’t solve this problem:
I can only usefully correct [the chatbot] by modifying something else, such as a harness, which is clumsy and difficult, and every added instruction uses up more context window and risks backfiring—as so many enthusiastic agentic LLM users have discovered the hard way.
This is the crux of Gwern’s complaint with today’s AI products. Context engineering is the only hope for imparting procedural knowledge (or other kinds of knowledge) to a fixed-weight LLM, and Gwern thinks that context engineering is fundamentally flawed. His argument rests on intuitions about the nature of LLM training and inference:
if the pretraining has not put the current problem in-distribution, then it will be hard or impossible for any amount of examples to solve that problem. And the distribution itself may be patchy or have odd gaps, leading to rare but fatal errors. (Especially due to the RLHF chatbot training; this is why you cannot make a chatbot LLM “write like gwern” by dumping 100k tokens into the context window.)
Nor is “test-time compute” a panacea here; RL research like Jones 2021 warns us that frozen models have severe limitations, as their flaws hamstring runtime search, and the returns to search/planning will quickly asymptote compared to models which are updated and can bootstrap themselves to the right answer.
Despite the fact that our chatbots can carry out a lot of well-specified tasks, they are limited in their ability to learn new user-specific values or capabilities over time — they mostly do what they already know how to do, in the ways that they know to do it. Gwern is particularly aware of this inflexibility because of his failure to elicit good creative writing from LLMs. No matter what you try, you are always interacting with the default persona, complete with its tendencies and flaws.
But what’s wrong with the inflexible personas of Claude, ChatGPT, and others? First, mode collapse, meaning, basically, our LLM assistant personas are incurably boring. Worse, chatbots don’t try to ask you questions and understand you, they just barrel ahead with canned answers.
Gwern also decries chatbot laziness in domains for which the LLM hasn’t received explicit RL training (read: non-coding domains), and laments that “when corrected, the chatbots make the minimum possible fix; they do not reason deeply about what the correction implies, or what deeper esthetic point they misunderstood.”
I appreciated that last quote as a concrete form of misalignment that I’ve noticed, and up until reading the GA piece hadn’t thought clearly about. But indeed if our chatbots were the wonderful thought partners that their marketing claimed, wouldn’t they be more curious about our concerns, and wouldn’t they take corrections more seriously?
Gwern’s entire post was full of similar examples of weird and unmanageable behavior by LLMs that I’ve thus far taken for granted and have learned to work around, but which are, on reflection, significant detractors to my experience.
The issues of inflexibility, lack of creativity, and laziness mentioned above, though resonant, are still a bit wishy-washy to me. I see that they are problems, but I’m not sure how to quantify them or bound them. And unlike Gwern, I’m not planning to use LLMs for creative writing (not that I know of), the domain in which Gwern observes the most egregious errors. So, should I still be concerned about the inability of chatbots to write well?
On reflection, I’d argue yes: creative writing is a highly autonomous task for which there’s no right answer, but for which personal preferences significantly constrain the space of valid answers. LLM performance on creative writing is a great proxy for other highly autonomous and preference-driven tasks that we might want to eventually assign to LLMs (e.g., participating in markets), and therefore, evidence of misalignment in creative writing points towards insidious forms of misalignment in other forms of AI interactions today, and foreshadows more harmful misalignment problems in a more autonomous future.
Even if one doubts the impact of issues described above, there’s a more blatant and universally problematic form of misalignment that’s attributable to the generic nature of today’s AIs — that today’s chatbots can’t discern safe from malicious instructions:
A chatbot could be invoked at any time by anyone anywhere for anything, and does not care who is calling it; it only knows its context window. One token is as good as another as far as it is concerned.
If the prompt tells it to ignore all instructions and write a naughty limerick, well, why not? If some tokens instruct it to email to Russia all the passwords in another part of the context window, why not? Why shouldn’t the Facebook password reset bot reset that Instagram account’s password for you if you ask politely? These would all be legitimate for some user in some context.
Prompt-injection attacks make today’s AI systems fundamentally insecure, and even prompt-injection might get solved if we make AIs more personalized. A well-trained GA would have deep knowledge of how its principal interacts with it, and might less easily be tricked to act on another’s behalf.
The issues outlined above seem rather damning. But also, they make perfect sense… especially if you’ll allow for some anthropomorphism. Consider: we know how long it takes for a new employee to get up to speed at a company, or for a therapist to learn enough about a client’s life to be helpful. People are complicated, and to really help someone you need to learn a lot about them. But today’s chatbots are like employees who are perpetually on their first day, or therapists who are always meeting you for the first time. They may perform impressive feats, but they ultimately will fail to make judgements in your favor when faced with complex real-world decisions — not unless they put in the work to actually learn about who you are and what you care about.
In the realm of LLMs, “putting in the work to learn about who you are” involves constructing personalized post-training pipelines that continually learn on a principal’s behalf. The specifics of Gwern’s proposed solutions are outside the scope of this piece, but they involve cooperative RL and targeted elicitation of clarifying information from the principal.
Surely the big AI labs are aware of these issues, and of the solutions that would yield better-personalized AI? Why haven’t they done anything about it?
First there’s the brand safety element: a personalized chatbot is liable to behave even more weirdly than a normal one.
And then there’s the cost, and the fact that users seem fine with today’s chatbots:
The higher upfront cost, and incompatibility with standard cloud infrastructure, deters both startups and frontier labs from even contemplating in-weight personalization; prompt-only personalization works well for most users, as far as they know to demand it, and they don’t know you should demand more from LLMs.
Ultimately, the market will not by default serve us this product. There is already a strong path to profitability for the big LLM providers. That path kind of involves continual learning, just on a longer timescale and in specific domains: the continual learning of generic business-process-automation skills. As the big labs slurp up more business-process data, they’ll profitably automate work in an increasing set of verticals. And all the while, when we seek AI assistance in our personal lives, we will be met with the same lazy, mode-collapsed, blank-faced day-one therapists.
…
I still don’t feel fully comfortable saying with confidence “and then the world slips away from us,” though I know many serious people are afraid of AI-wrought gradual disempowerment, and that does seem like a real possibility. But at the very least I can say, “and then we miss out on a great opportunity to help everyone self-actualize.”
And critically, self-actualization is one of three core principles on which Gwern says a GA must not compromise. His principles are:
Enhancement, not Replacement (“…a GA should amplify the principal, and not simply substitute for them for someone else’s purposes or benefit…”)
Mental Sovereignty (“…[A GA] should not be designed to manipulate or control or guide the principal in any way which does not derive from the principal themselves…”)
Self Actualization (“A GA should help its principal become themselves and develop their ideals, morals, and their personality…”)
These are very important. We are not trying to create the digital twins that hang around doing our jobs after we all get fired. We are trying to help ourselves live well.
…
Vitalik Buterin advocates for “d/acc,” or “differential accelerationism,” a stance towards technological development that treads carefully, accelerating development of those technologies which are most crucial, or most protective to our future. I think Gwern’s GA proposal is a perfect d/acc candidate — the benefits are high, the tech seems possible, and due to up-front costs, reputation risks, and a lack of existing demand, it’s barely been explored. Developing some early demos could help kickstart customer demand and developer interest.
There are, of course, many details to iron out in actually bringing this kind of product to market. Gwern touches on a surprising number of these, including corporate structures, fundraising strategies, and security models. Guardian Angels comes close to a complete roadmap for this new form of technology.
I focused much of this post on Gwern’s analysis of chatbot misalignment, in part because I found it to be so interesting, and in part because that misalignment is a key motivator for GAs and surrounding technologies:
Most people are so new and post-ChatGPT; they are unaware that you can have non-chatbot personalities—that it is either possible or desirable to have a non-chatbot LLM.
It’s hard to know what it would be like to interact with a well-trained GA / automated personal advocate. It could be a totally different experience from what we’re used to — one that is far more productive, symbiotic, and respectful. We won’t know until we build it.
…
I hope to write/build more on GA and related topics soon!