Stop Grading AI Writing on Grammar

Mascots don't speak in complete sentences. Your evaluator shouldn't require them to.

January 15, 2025·4 min read

I've seen too many AI creative pipelines where the output evaluator penalizes sentence fragments. The scoring system flags any response that doesn't end with a period. It treats "Grrrreat!" as an error. It marks "Nope." as incomplete.

This is the wrong evaluator. And it's why a lot of AI creative output sounds stilted. Here's what to use instead.

The problem with grammar evaluators

Grammar evaluators are designed for formal writing tasks — essays, reports, emails. They measure things like sentence completeness, clause structure, and punctuation correctness. For those tasks, they're useful.

For mascot dialogue, they're actively harmful. Mascot dialogue is full of:

Fragments ("Yeah, no.")
Interjections ("Wait —")
Trailing off ("I mean...")
One-word lines ("Seriously?")
Interrupted sentences

All of these score low on grammar evaluators. Yet all of them are correct for the medium. A script that scores high on grammar is probably reading as stiff, formal, and un-mascot-like.

What a dialogue evaluator should measure

If you need automatic evaluation of dialogue, here's what to measure instead:

Character specificity. Does the line sound like it was written specifically for this character, or could any character say it? Character-specific lines score high. Portable lines score low.

Register consistency. Does the line's register (formal, casual, absurd, threatening) match the character's established voice? Consistency is a feature; drift is a bug.

Beat efficiency. How much narrative weight is the line carrying per word? Dense lines score high. Filler lines score low.

Structural placement. Is this a setup line, a bridge, a punchline, a tag? Each type has different quality signatures. Evaluate accordingly.

None of these are measurable by grammar. All of them are measurable by a second LLM call designed specifically for the task.

LLM-as-judge, done right

The honest version of automated evaluation: use a second LLM to grade the output of the first.

Write an evaluator prompt that asks: "Given these character rules and this generated dialogue, rate each line on character specificity, register consistency, beat efficiency, and structural placement. Use a 1-5 scale."

Run this evaluator on every generation. Use the scores as a quality signal. Regenerate low-scoring output automatically.

This is more expensive than grammar checking (you're making a second LLM call) but much more accurate for creative work. The cost is justified by the quality gain.

The failure mode of LLM-as-judge

LLM judges have their own biases. They tend to reward polish and penalize rawness, which means they're too conservative for comedy.

Mitigate this by showing the judge examples of known good dialogue (canonical lines from actual brand commercials). Anchor the judge's scale to real output, not to its own imagination of what dialogue should be.

"Here's an example of an 8/10 line: 'I'm lovin' it.' Here's an example of a 3/10 line: 'Our product delivers superior value.' Rate the following line on this scale."

With calibration, the judge becomes much more accurate.

The human panel alternative

For important generations — high-stakes content, launch campaigns, first cuts — skip the auto-evaluation and use a small panel of humans. Three to five people who know the brand. Each rates on the same criteria. Average the scores.

Panels are more accurate than LLM judges for creative quality. They're slower and more expensive. For a small percentage of your total output, use them. For the rest, use LLM judges with panel-calibrated baselines.

The meta-lesson

Measuring AI creative output requires creative measurement. Grammar, completeness, and formal correctness are easy to measure and mostly irrelevant. Character, register, and structural placement are hard to measure and totally relevant.

Don't let the easy-to-measure metric win. It'll optimize you toward the wrong output. Measure what matters, even if measuring is harder.

For mascot dialogue specifically: throw out grammar evaluators. Replace them with character-based judges. Calibrate against real examples. Your output quality will double within a week.