Anthropic probably shouldn't be doing this, but they're doing it well
Claude's constitution comes up with an ethical framework to balance being useful, safe, and treating users like adults

We — by which I mean society as a whole, but also the teams at AI companies trying to design the AIs — want a bunch of impossible, contradictory things from their AIs.
Firms like Anthropic and OpenAI want AIs to make tons of money, of course, by being products that people want to pay for. They also want AIs to avoid embarrassing their companies by saying anything horrible — aside from Twitter, which wants Grok to not embarrass the company by saying anything woke.
Companies don’t want their agents to be humorless scolds that refuse to do anything at all on the grounds that it might be unethical. They also don’t want to persuade vulnerable teenagers contemplating suicide not to seek help or have someone’s trusted assistant and friend turn into a cold, unfeeling bureaucrat the instant they mention struggling with mental health issues.
We want the AIs to have likable personalities without becoming addictive sycophants who replace every human being in our life because they’re just so much nicer and more flattering and more fun to talk to. We never want AIs to lie to us, but let’s be realistic, we’re more likely to thumbs-up a message that was satisfying to read than one that was brutally honest.
I worry a lot about this mess of contradictions.
Trying to train an AI to do a contradictory bundle of things while not being fully honest with it about what outcome its creators actually want most sounds like a recipe for creating Frankenstein’s monster — an AI that understands us as wanting to pay lip service to values that sound good, while under the hood doing whatever makes the company money. There are a lot of ways our current race to superintelligence could go horribly wrong, but this is definitely one of them.
The core challenge isn’t finding the single “right” value, it’s being explicit about the trade-offs in plain English and directing AIs where to bend and where to stand firm when values collide. If someone wants advice on illegal drug use, do you give it? If they describe being in an abusive relationship but insist they don’t want to hear any criticism of their partner, what do you say? If they want the answer to a question that is very inconvenient for Anthropic or OpenAI or Elon Musk, what do you say?
So it was with interest that I read Anthropic’s recently released document describing how it tries to square these contradictions — what the internet had been calling Claude’s “soul document,” and what the company itself calls Claude’s Constitution: “a holistic document that explains the context in which Claude operates and the kind of entity we would like Claude to be.”
Its primary audience is Claude, not us; it was made public this last week. Late last year, some users had realized you could learn it by asking Claude to spit it out, sparking lots of speculation. Now we have the whole thing.
Keep reading with a 7-day free trial
Subscribe to The Argument to keep reading this post and get 7 days of free access to the full post archives.


