Agreeableness Drift in Long Mascot Debates

Even with harsh prompts, characters start agreeing after four beats. Here's why, and the fix.

April 2, 2025·4 min read

If you ask any frontier LLM to write a six-beat debate between two rivals, something surprising happens. The first four beats are great — sharp, adversarial, well-characterized. Then, reliably, around beat five, the characters start softening. By beat six, they're agreeing about something.

I've seen this happen across Gemini, GPT, Claude. It happens at high temperatures and low temperatures. It happens in short prompts and long prompts. It's a universal LLM behavior for long-form conversation, and it's called agreeableness drift.

Here's what's happening and how to fight it.

The cause

LLMs are trained via RLHF (reinforcement learning from human feedback). The humans providing the feedback are asked to rate outputs on helpfulness, safety, and harmlessness. Over time, the model learns that conversational outputs ending in agreement score higher than outputs ending in continued conflict.

This makes sense for the model's primary use case (helpful assistant). It breaks for our use case (adversarial dialogue).

The drift isn't immediate. In the first few turns, the model holds characters' positions. But as the context grows, the agreeableness pressure accumulates, and the characters gradually migrate toward consensus.

How to fight it

Technique one: re-inject stance each turn.

Instead of a single system prompt at the start, re-add each character's stance rules with every generation step. Literally paste the character rules into the context before each new beat.

This keeps the stance "fresh" in the model's attention. The accumulated conversation can't dilute the rules because the rules keep reappearing.

Technique two: use shorter contexts.

Generate long debates in chunks. Write beats 1-3 in one call, then beats 4-6 in a second call that only sees beats 1-3 as summary, not as raw text.

The summary prevents the model from "learning" the characters' rhetorical patterns and softening them over time. Each chunk gets fresh characters with fresh positions.

Technique three: explicit anti-drift instructions.

Add a rule in the brief: "The characters must remain in disagreement throughout. Neither character may adopt the other's position. Neither character may concede. Neither character may end the conversation with harmony."

Negative instructions, explicit and repeated. This doesn't eliminate the drift but reduces it significantly.

Technique four: adversarial validation.

After generating a debate, run a separate LLM call that asks: "Did the characters remain in disagreement throughout? Rate on a scale of 1-5, where 5 is fully adversarial and 1 is fully conciliatory."

If the validation scores below 4, regenerate. This catches drift automatically and forces a rewrite before the user sees the output.

The fix I use

DebaterX combines all four techniques. System prompt with anti-drift rules. Character stance re-injected per beat. Adversarial validation on every generation with regeneration if the score is low.

The result: over 90% of generated debates stay in disagreement all the way to the end. Without these techniques, that number was below 40%.

The broader lesson

LLMs have built-in preferences that work against certain creative tasks. Adversarial dialogue, tragic endings, villain protagonists, unresolved tension — all of these require active countermeasures.

When you notice your LLM pipeline producing consistently one-shaped output (always harmonious, always resolved, always feel-good), don't blame the prompt. You've discovered a training-level preference. You fight it with architecture: re-injection, chunking, explicit bans, post-generation validation.

This is the real prompting skill beyond basic prompt engineering. Basic prompting teaches you how to describe what you want. Advanced prompting teaches you how to defend against the model's preferences when those preferences conflict with what you want.

Agreeableness drift is one of a dozen such biases. Every creative AI tool has a list of these. You learn them by watching outputs fail in consistent ways and naming the failure.

Once named, they're tractable. Before naming, they're invisible.