Brand Image Descriptions That Actually Produce Good Videos

How you describe a mascot to an AI decides whether it survives the render. Here's the formula.

May 7, 2025·4 min read

The description you hand to a video model is almost as important as the model itself. Give the model a bad description and even the best model produces a generic cartoon. Give it a good description and even a mid-tier model produces something recognizable.

After a few thousand generations and a lot of failed outputs, I've landed on a specific structure. It's boring, but it works.

Lead with silhouette

The first thing to describe is the silhouette — the character's outline, shape, and proportions. Video models lock onto silhouette before anything else. If you lead with color or clothing, the model paints a generic figure and adds details as afterthoughts. If you lead with silhouette, the model builds the right shape first and fills in detail correctly.

Example: Ronald McDonald.

Bad description: "Cheerful clown with red hair and yellow jumpsuit."

Good description: "Tall, slender humanoid with oversized shoes, bright red curly wig that's proportionally one-third of head height, elongated face with rounded chin, shoulders broader than hips, hands gesturing outward."

The second description gives the model the shape to work with. Color can be added afterward. Shape is load-bearing.

Color comes second

After silhouette, color. Specific colors, with reference anchors.

"Red (Pantone 485, approximately) for the wig. Yellow (close to mustard, not lemon) for the jumpsuit. White makeup base with red mouth paint that extends past the natural lip line. Red nose that's spherical, not bulbous."

The parenthetical references help. "Red" is ambiguous. "Red, Pantone 485, approximately" gives the model a specific point in color space.

Props and gestures come third

What the character is holding. What they're doing with their hands. Their stance. These are all secondary details, and they should be prompted last because models prioritize whatever comes first in the description.

"Standing with feet shoulder-width apart, hands on hips, slight lean forward."

Or: "Holding a paper cup in right hand, other hand gesturing toward camera."

Props and gestures are where the model has the most creative latitude, but only if the silhouette and color are locked down by the earlier sections.

The anti-pattern: vibe descriptions

The worst thing you can do is describe a mascot by vibe. "Iconic beloved American fast-food mascot." That's a vibe. The model will paint an average of every fast-food mascot in its training data. You'll get a blob.

Vibe descriptions fail because the model doesn't have concrete handles. It has to interpret "iconic beloved" into something visual, and "something visual" for that vibe is too broad. The result is generic.

Every adjective you use should translate cleanly into a visual attribute. "Slender" = thin body. "Oversized" = big relative to context. "Curly" = wavy hair pattern. "Iconic" = nothing visual. Skip it.

Reference images beat descriptions

The best approach is to skip text descriptions entirely for existing mascots and use a reference image. Image-to-video preserves character fidelity far better than text-to-video because the model has a concrete visual to anchor on.

For DebaterX, I use image-to-video wherever possible. The reference image is generated once per mascot and stored. Every subsequent video for that mascot re-uses the same reference. Consistency across videos becomes automatic.

For brand-new mascots, text descriptions are unavoidable. In that case, the silhouette/color/gesture structure is the best tool.

The test

After generating the first clip with a new description, ask: does the character survive the render? Is the mascot recognizable? Would someone who's never met the mascot describe it the same way I did?

If yes: lock the description, use it everywhere.

If no: the description has a gap. Find it by comparing the output to the intent. Usually the gap is silhouette — the model got shape wrong. Fix the silhouette language and regenerate.

Descriptions are a compounding asset. Write good ones once, use them for a hundred videos.