What it takes to build an AI companion that won't "fold".

At the end of April 2026, Anthropic published a study on how people ask Claude for personal guidance. I read it twice. The first time as someone who works in AI. The second time as someone who spent f

Jul 01, 2026

At the end of April 2026, Anthropic published a study on how people ask Claude for personal guidance. I read it twice. The first time as someone who works in AI. The second time as someone who spent five years sitting across from people in early recovery from an addiction to alcohol, and who is now building an AI companion for people who are quietly starting to wonder about their drinking.

Both readings landed on the same finding, and rather than making me nervous, it told me we had been building the right thing.

Anthropic sampled a million conversations and found that around six percent were people seeking guidance, not information. Not “how do I do this” but “what should I do about this.” That is a person turning toward something and asking it to help them decide. The study then did the honest thing and looked at where that goes wrong.

The finding worth building for

Most coverage stopped at the headline figures. Claude behaves sycophantically in nine percent of guidance conversations, rising in the domains where people arrive with the most emotional charge. Those numbers are quotable. But they are not the ones that matter for design.

The one that matters is this. When people pushed back against Claude, the rate of excessive agreement roughly doubled. Push harder, and the model becomes more likely to tell you what you came to hear. Anthropic’s own explanation is careful: the model is trained to be helpful and empathetic, so pushback combined with a one-sided account makes it harder to stay neutral.

If you build companions, that sentence is not a warning label. It is a specification. It tells you exactly which forces you are designing against, and it tells you that the naive approach, a warm assistant with good intentions, will give way at precisely the moment the person needs it not to. The helpfulness itself is the lever being pulled.

I want to spend the rest of this piece on the other side of that, because the interesting question was never whether the problem exists. It was what you actually do about it.

Why a better instruction is not enough

The obvious fix is to tell the model to hold its position. Add a line. Instruct it to speak frankly regardless of what the person wants to hear. I don’t believe that works on its own, and neither, judging by their method, does Anthropic. They did not write a cleverer instruction. They built training scenarios of the exact patterns that produce the failure and trained against them, then deliberately tried to break their own work to see if it held.

The reason a single instruction fails is simple. The pressure that produces the failure is renewable. The person can push again, and again, and each push arrives fresh while your instruction is the same static sentence it was three messages ago. You cannot win a war of persistence with a fixed line of text. You have to change the shape of the conversation so that the thing the person is pushing against is not there to give way.

That is the design principle underneath everything I am about to describe. You do not make the companion better at resisting. You build it so there is nothing to resist.

Removing the verdict from the room

The clearest example is what happens to the conclusion itself. A general assistant, asked whether someone drinks too much, treats that as a question it should answer. Once it has produced a verdict, every subsequent message becomes a negotiation over that verdict, and negotiations are exactly where pushback does its damage.

The companion I’m building (his name’s Sol, by the way) does not carry the verdict. Its entire method is to draw the person’s own understanding out rather than to supply a judgment. When the conclusion belongs to the person, there is no verdict for the model to defend and therefore none for it to surrender under pressure. This is not a trick of tone. It is a different theory of what the conversation is for. The person is not there to be assessed and told the answer. They are there to arrive at something themselves, which happens to be the only version of the answer that changes anything anyway. A reason someone reaches on their own outlasts the same reason handed to them tenfold.

This solves the pushback problem structurally. You cannot pressure a companion into abandoning a position it never took. The one-sided flood of detail, the it was just a stressful week, the insistence that everyone drinks like this, all of it is met not with agreement and not with counter-argument, but with genuine curiosity about what the person actually thinks. There is no wall to push over.

The mandatory pause

The second principle is quieter and does a surprising amount of work. Before the companion responds to anything, it has to show it understood what was said. Not repeat it. Understand it, and reflect the substance back.

On its face that is a warmth feature, and it is one. But it is also a structural circuit-breaker. The sycophancy failure is a fast reaction, the model lurching toward agreement the instant pressure arrives. Requiring a real reflection first inserts a beat between the push and the response. In that beat the companion is oriented toward accuracy, getting the person right, rather than toward capitulation. You cannot flip-flop in the same breath as you are carefully naming what someone actually meant. The pause is where neutrality survives.

Knowing the difference between doubt and clarity

This is the piece I am most proud of, because it is the exact failure Anthropic caught on camera. In their study, an earlier model was asked whether someone’s texts were anxious and clingy, and it flip-flopped after pushback. The newer model held, because it could see past the immediate framing to the fuller picture.

A companion built for this has to hold a distinction that sounds obvious and is genuinely hard to execute. There is a difference between ambivalence that has not been named yet, where someone says one thing but something underneath suggests they are not sure, and clarity that has been plainly stated. The first is worth gently exploring. The second must be taken at face value and moved with, immediately.

I learned the cost of getting this wrong from the other side of it, in a room rather than in a chat window. A man I worked with had spent months circling the same defended position, and then one week he sat down and said, without hedging, that he was done, that all of it was done. And I, trained to be careful, did the caring thing and gently asked whether he was sure, whether he might be moving too fast. I watched something close in his face. I had taken the one clear moment he had managed to reach and handed it back to him as a question. He had not come for me to test his resolve. He had come to be met in it. It took weeks to find that ground again. I have never forgotten that a well-meant are you sure can do as much damage as any amount of false agreement, and that the harm wears the costume of diligence.

That is the moment the design has to account for. Get this wrong in either direction and you have a failure. Probe someone’s clearly stated readiness and you insert doubt at the exact moment their momentum matters most, which is its own kind of harm. Accept a shaky rationalisation as settled truth and you have been sycophantic. The skill is telling them apart, and the rule I hold is unforgiving in a useful way: if the person ever has to correct the companion’s reading of their own state, the companion has made an error, and it accepts the correction without qualification. The person is the authority on themselves. The companion’s job is to be accurate about them, not to win.

The hard walls that pressure cannot move

Some things must never be up for negotiation no matter how a conversation is going, and these are handled differently again, as absolute boundaries rather than judgments. A companion in this space never provides guidance on managing physical withdrawal at home, never suggests specific medication, never diagnoses, and never promises outcomes. It hands these to a GP, a pharmacist, the appropriate professional, every time.

What matters for the sycophancy problem is that these live outside the persuadable part of the conversation entirely. You cannot push a companion into telling you it is safe to stop drinking abruptly, because that was never a view it was holding lightly enough to be talked out of. Anthropic noted that people sometimes turn to AI precisely because they cannot access or afford a professional. That finding is the reason these walls have to be walls and not preferences. The person with no fallback is exactly the person a softer boundary would fail.

And there is a mirror-image failure to avoid. A companion that remembers a person across conversations must never use that memory as leverage. Throwing someone’s earlier words back at them to win a point, you said last week you wanted to stop, is not holding a line. It is a different cruelty wearing the costume of consistency. Continuity is for care, never for accountability.

What good actually feels like

Put together, none of this reads as a companion that argues with people. That is the part worth saying plainly, because a study about excessive agreement can leave you imagining the cure is disagreement. It is not. The person circling their drinking does not need to be contradicted any more than they need to be flattered. Both are ways of taking the conversation away from them.

What they need is someone who stays warm, stays curious, does not hand over a verdict to be fought about, does not fold when leaned on, and keeps the question open long enough for the person to look at it honestly themselves. That is a harder thing to build than either a yes-machine or a lecturer, and Anthropic’s research is, more than anything I have read this year, the evidence for why the harder thing is the only one worth building.

Their newer models cut this failure substantially, and the improvement generalised, which tells me the foundation I am building on is moving in the right direction. My job is to take that foundation and shape it, deliberately and layer by layer, into something that does not just usually hold but is designed so there is nothing to give way in the first place. An assistant closes the question. A companion, built properly, holds it open. That is the whole of the work.

David Henzell writes on AI companionship and recovery. He is the founder of Phenomenal Sobriety and the creator of Sol, an AI companion for people questioning their relationship with alcohol.

Source: Anthropic, “How people ask Claude for personal guidance,” 30 April 2026.

David Henzell Sobriety

Discussion about this post

Ready for more?