DebaterXDebaterX

AI Is Bad at Disagreeing. I Spent Weeks Trying to Fix That.

Why most LLM-written debates feel like two people politely agreeing, and the specific prompting moves that break the pattern.

·5 min read

The first hundred debates I generated for DebaterX were useless. They were well-written, grammatically correct, thematically coherent. They were also, every single one, collaborative. Two characters assigned to opposite sides of an argument would reliably find common ground by turn four. Everyone left the scene friends. Everyone had learned something.

This is a disaster for a product whose entire purpose is brand rivalries.

The shape of the failure

The pattern was consistent across every frontier model I tested. Gemini, GPT, Claude, all of them. Prompt the model for a debate between Ronald McDonald and the Burger King, and within six turns you'd get something like:

"I think we both bring something valuable to fast food."

"That's a really fair point. Maybe we're more similar than we thought."

"Different customers, different needs, right?"

"I agree completely. Let's put aside our differences."

This is not a debate. It's a corporate retreat. The mascots have lost all conflict and are now discussing business synergy.

Why this happens

It took me three weeks to diagnose the cause, and another two weeks to find a reliable fix. The cause is structural: LLMs are trained via RLHF on human feedback that rewards helpfulness, harmlessness, and harmony. Over millions of training examples, the models learn that conversations ending in agreement score higher than conversations ending in continued conflict.

This is correct for the models' primary use case — helpful general assistants. It breaks in specific creative contexts where conflict is the point. You need to actively fight the training.

What doesn't work

Before I found the fix, I tried several approaches that seem reasonable but don't actually help:

Adding "make this adversarial" to the prompt. The model hears this and writes two turns of adversarial dialogue, then drifts back to harmony anyway. Positive instructions are too soft to override RLHF.

Raising temperature. Higher temperature produces more random output, not more conflict-preserving output. You get jittery agreement instead of stable agreement, which is worse.

Writing more detailed character descriptions. More adjectives don't help. The model averages the adjectives into a reasonable character and then makes the reasonable character reach for reasonable agreement.

Using more capable models. GPT-5, Claude 4.7, Gemini 2.5 — all of them exhibit the same drift toward harmony. Capability doesn't change training-shape biases.

What works: explicit negative rules

The fix turned out to be embarrassingly simple once I found it. Instead of telling the model to be adversarial (positive), tell it what it's forbidden from doing (negative).

The specific rule that broke the pattern:

"Neither character may agree with the other for the full video. Neither character may reach for common ground. The scene must end with both characters in the same positions they held at the start. Disagreement is the entire point of this exercise."

Aggressive language. Explicit prohibitions. Repeated clarification. With these sentences in the system prompt, output quality shifted dramatically. The mascots started staying in conflict. Endings stopped reaching for morals. Debates felt like debates.

Why negative rules beat positive rules

The reason this works: RLHF bias operates on what the model defaults to. When you tell the model "be adversarial," it hears that and tries to comply, but its training keeps pulling back toward harmony. The baseline bias wins.

When you tell the model "do not agree, do not reach for common ground, do not conclude in harmony," the baseline bias has nothing to pull toward. The negative rules close off the default paths. The model is forced to generate conflict-preserving output because that's what's left.

Negative prompts narrow the output space more sharply than positive prompts. For tasks where the model's defaults work against you, negative is the right lever.

The rules I use now

The DebaterX system prompt has 17 negative rules, each one earned by watching the model fail in a specific way. A few highlights:

Each of these blocks a failure mode I saw repeatedly. Each of them produces visibly better output when added.

The broader pattern

The principle extends beyond debates. Any creative task where the model's defaults fight against you benefits from the same approach: identify the failure pattern, write a negative rule against it, repeat.

This is the real prompting skill beyond basic engineering. You're fighting the training. The more you understand what the model wants to do, the better you can prevent it from doing exactly that.

AI models don't just need good instructions. They need specific prohibitions against their own preferences. Mascot debates are one category where this is acutely true. Villain dialogue, tragic endings, unresolved tension, adversarial narratives — all benefit from the same approach.

The takeaway

If your AI creative pipeline is producing soft, harmony-seeking, conclusion-reaching output, the model isn't broken. It's doing exactly what it was trained to do. Your prompt is the problem.

Add explicit negative rules. Be aggressive. Tell the model what it may not do. The output will shift within one generation.

I spent three weeks staring at collaborative mascots before I figured this out. You don't have to. Start with the negative rules. Build from there.

← Back to all posts