Image-to-Video Works Better Than Text-to-Video for Characters

If you want a recognizable mascot, start from an image. Always.

December 18, 2024·4 min read

When I first started generating mascot videos, I assumed text-to-video would be the main workflow. Describe the mascot in the prompt, let the model render them, animate them, ship it.

After a few hundred failed generations, I switched to image-to-video for every mascot-bearing shot. Text-to-video is now a fallback, not a primary. Here's why.

The fidelity problem

Text-to-video models interpret descriptions. They turn words into visual tokens. For generic subjects — a generic person, a generic city, a generic dog — this works fine. The model has a reasonable mental model of what you're asking for.

For specific subjects — this specific mascot, with its proprietary color palette, silhouette, and proportions — text fails. The model's mental model of "Ronald McDonald" is an averaged abstraction from its training data. You'll get a red-haired figure in a yellow outfit who could plausibly be Ronald to someone who squinted, but who wouldn't survive legal review.

No description is detailed enough to produce exact fidelity. The model fills in gaps with its own averaging, and the averaging is exactly what you don't want.

The image-to-video solution

Image-to-video takes a reference image as input. The model's job becomes: "animate this specific image." It doesn't have to guess what the character looks like — it has the character right in front of it.

The result is much higher fidelity. Proportions are preserved. Colors match. Silhouette stays locked. Across multiple generations, the character looks consistent.

How to build the reference image

You still need to create the reference image. Options:

Option one: official art. If you have access to licensed brand art, use it directly. This is the highest-fidelity path.

Option two: image generation. Use an image model (Midjourney, DALL-E, Imagen) to generate the reference image. Iterate until the image matches the mascot precisely. Save the final image as the canonical reference for all future generations.

Option three: composite and touch-up. Start with a base image (generated or licensed), then use image editing to align it exactly. This is time-consuming but produces the best fidelity when the first two options don't work.

For DebaterX, I generate reference images via image models, iterate until they're right, and then lock them. Every video of that mascot uses the same reference image. Consistency across videos becomes automatic.

The prompt pattern

For image-to-video generation, the prompt focuses on motion and scene, not appearance. The appearance is handled by the reference image.

A good image-to-video prompt:

"The mascot in the reference image, standing in an empty diner, 2 AM lighting, sliding a burger across a counter toward someone offscreen. Camera at medium distance, slight low angle."

Notice: no description of the mascot. The reference carries that. The prompt only adds context the reference doesn't provide — the setting, the action, the camera.

The gotcha: scene consistency

Image-to-video models handle the mascot well but can still drift on the scene. The diner in frame 1 might have different lighting than frame 60. Background elements shift. Props teleport.

This is a smaller problem than mascot drift, but it's real. Mitigate by keeping shots short (fewer frames = less drift) and editing in cuts rather than long takes.

The model-level differences

Not all image-to-video models are equally good. Here's my ranking as of late 2026:

Sora (OpenAI): best character fidelity from reference images. Best for close-ups.
Veo (Google): best scene coherence. Best for wide shots where the mascot is one element among many.
Kling: best motion quality. Best for action shots with significant body movement.

I route shots based on type, using each model for its strength.

The takeaway

If your product generates video featuring recognizable characters — mascots, brand faces, celebrity likenesses — image-to-video is the correct workflow. Text-to-video is a fallback for scenes that don't involve characters.

Build your reference library first. Lock the canonical image per mascot. Use it everywhere. Fidelity becomes a system property, not a per-generation gamble.

Text is too lossy a description format for specific visual identity. Images are the right input for visual tasks. Use images.