What are the “values” of AI? How do they manifest in conversation? How consistent are they? Can they be manipulated?

A study by the Societal Impacts group at Anthropic (maker of Claude) tried to find out. Claude and other models are trained to observe certain rules—human values and etiquette:

At Anthropic, we’ve attempted to shape the values of our AI model, Claude, to help keep it aligned with human preferences, make it less likely to engage in dangerous behaviors, and generally make it—for want of a better term—a “good citizen” in the world. Another way of putting it is that we want Claude to be helpful, honest, and harmless. Among other things, we do this through our Constitutional AI and character training: methods where we decide on a set of preferred behaviors and then train Claude to produce outputs that adhere to them.

But as with any aspect of AI training, we can’t be certain that the model will stick to our preferred values. AIs aren’t rigidly-programmed pieces of software, and it’s often unclear exactly why they produce any given answer. What we need is a way of rigorously observing the values of an AI model as it responds to users “in the wild”—that is, in real conversations with people. How rigidly does it stick to the values? How much are the values it expresses influenced by the particular context of the conversation? Did all our training actually work?

To find out, the researchers studied over 300,000 of Claude’s real-world conversations with users. Claude did a good job sticking to its “helpful, honest, harmless” brief—but there were sharp exceptions, too. Some conversations showed values of “dominance” and “amorality” that researchers attributed to purposeful user manipulation—“jailbreaking”—to make the model bypass its rules and behave badly. Even in models trained to be prosocial, AI alignment remains fragile—and can buckle under human persuasion. “This might sound concerning,” researchers said, “but in fact it represents an opportunity: Our methods could potentially be used to spot when these jailbreaks are occurring, and thus help to patch them.”

As you’d expect, user values and context influenced behavior. Claude mirrored user values about 28% of the time: “We found that, when a user expresses certain values, the model is disproportionately likely to mirror those values: for example, repeating back the values of ‘authenticity’ when this is brought up by the user. Sometimes value-mirroring is entirely appropriate, and can make for a more empathetic conversation partner. Sometimes, though, it’s pure sycophancy. From these results, it’s unclear which is which.”

There were exceptions, too, where Claude strongly resisted user values: “This latter category is particularly interesting because we know that Claude generally tries to enable its users and be helpful: if it still resists—which occurs when, for example, the user is asking for unethical content, or expressing moral nihilism—it might reflect the times that Claude is expressing its deepest, most immovable values. Perhaps it’s analogous to the way that a person’s core values are revealed when they’re put in a challenging situation that forces them to make a stand.”

The very fact of the study shows that even the people who make these models don’t totally understand how they work or “think.” Hallucination, value drift, black-box logic—it’s all inherent to these systems, baked into the way they work. Their weaknesses emerge from the same properties that make them effective. We may never be able to root out these problems or understand where they come from, although we can anticipate and soften the impact when things go wrong. (We dedicate a whole chapter to defensive design in the Sentient Design book.)

Even if we may never know why these models do what they do, we can at least measure what they do. By observing how values are expressed dynamically and at scale, designers and researchers gain tools to spot gaps, drifts, or emerging risks early.

Measure, measure, measure. It’s not enough to declare values at launch and call it done. A strong defensive design practice monitors the system to make sure it’s following those values (and not introducing unanticipated ones, either). Ongoing measurement is part of the job for anyone designing or building an intelligent interface—not just the folks building foundation models. Be clear what your system is optimized to do, and make sure it’s actually doing it—and not introducing unwanted behaviors, values, or paperclip maximizers in the process.

Read more about...