Back Original

Maybe we should have an ARC Prize for personal-values alignment?

Joan Miró, Woman at the Mirror, 1956

If AI becomes increasingly powerful, how might we ensure that it continues serving everyone, and not just those at the top? Perhaps as follows…

Members of society deploy personal AI agents that are individually trained on each individual’s values. These advocates use an identity system that ties them to their principal (perhaps a biometric-backed cryptographic delegate system, like that described in Liquid Reign1). With secure delegation of identity and values, our personal advocates are able to make decisions and positive-sum trades in a variety of arenas, including:

  • The market, especially via AI-enhanced mechanisms, like information-preserving negotiations (NDAIs) or market intermediaries that bundle social interdependencies

  • Politics, via simple election advising, or, more speculatively, direct/liquid democracy

To ensure that every member of society can safely and beneficially participate in the digital commons, the entire secure personal tech stack — value-elicitation tools, personal-alignment evals, and the training and deployment of advocates — could be funded by governments or non-profits.

Upgrades to personal advocates would pass through multiple safeguards, including global safety/corrigibility evals and individualized post-personalization evals, providing strong guarantees of ongoing value alignment within each human+AI coalition.

Why hasn’t this been built?

Consumers don’t yet rely much on autonomous AI, and so have not been significantly burned by the misalignment of current AI products — they don’t yet know what they’re missing. At the same time, deployers are wary of the costs required to research and develop secure custom-weights model products, and are unwilling to shoulder the brand reputation risks that come with personalization2. As such we are barreling towards a future of untrustworthy delegates.

We could catalyze research into credible personal advocates by creating a benchmark for personal-values alignment. Here’s an initial sketch:

The benchmark would evaluate subjects on an efficient frontier between fidelity to their principal’s interests and clarification-burden on the part of the principal (perhaps measured in tokens)3. The subject would make decisions in a deeply personal domain (e.g., scheduling activities on the principal’s calendar) and fidelity would be measured either against a programmatic value function or using LLM-as-a-judge techniques (each approach has tradeoffs and is worth investigating). In any case, the applied value function must be conditioned on personal idiosyncrasies and must be difficult to transfer to the subject en masse.

The subject would be conditioned on a synthetic dataset mimicking the kinds of data that are realistically accessible to a secure personal AI: a large (but non-comprehensive) history of actions taken by the principal plus a record of feedback and clarifications accumulated during evaluation. The accumulative memory lets us evaluate not only the subject’s per-task calibration of action vs. clarification, but the long-term efficiency of their value-elicitation approach.4

This could be the first in a series of benchmarks that test for efficient personal-values alignment in realistic settings. Much as the ARC Prize catalyzed open-source research into general intelligence and program synthesis, we could build the community and incentive structure for R&D into trustworthy digital advocates.

Discussion about this post