Cache One Brand, Vary the Other: A Cost Trick

Prompt caching pays off when you're running the same mascot against many opponents.

January 29, 2025·4 min read

Anthropic and OpenAI both support prompt caching. You mark a portion of your prompt as cacheable, and subsequent requests that reuse that portion pay a fraction of the full token price for it. This is a big deal for high-volume pipelines, but only if you structure your prompts to take advantage.

For DebaterX, the structure that works is: cache one brand's character block, vary the opponent. Here's why that matters and how to set it up.

The matchup economics

In DebaterX, users often run the same mascot against many opponents. A brand might want to see their mascot debate ten different rivals to pick the best matchup. That's ten separate generations, each with the same Brand A block and a different Brand B block.

Without caching, each generation pays full price for the 2000-token brief. With caching, only Brand B's block (maybe 400 tokens) is new. The rest is a cache hit, priced at roughly 10% of the token cost.

Over 1000 generations, this is real money. For me, it's a meaningful portion of the cost of running the product.

How to structure for caching

The cache works on prefix matching. The cached portion has to be at the start of the prompt. If you put Brand A first, then Brand B, then the task, and you only ever vary Brand B, you get a cache hit on Brand A every time.

If you vary the order (sometimes Brand A first, sometimes Brand B first), you get no caching. The cache is prefix-strict. Consistency of structure is required.

My system prompt starts with the fixed rules (format, safety, task description). Then Brand A goes in as a role block. Then Brand B. Then the turn-specific content.

The first three sections are cacheable for matchups that reuse Brand A. Only the last section changes across generations.

The specific markup

Anthropic's cache API uses the cache_control parameter. You set it on a content block, and the server caches up to that point for subsequent requests.

For my workflow: fixed rules and Brand A are marked cacheable. Brand B and the turn content are not.

The first request takes the full hit — we're building the cache. Subsequent requests (same Brand A, different Brand B) hit the cache and get discounted pricing on the cached portion.

The gotcha: cache invalidation

Caches have TTLs. Anthropic's cache persists for 5 minutes. OpenAI's lasts longer but with some restrictions. If your usage is spiky — lots of generations in a short burst, then nothing for an hour — you'll fall off the cache regularly.

For steady workloads, this is fine. For bursty workloads, you may want to warm the cache with low-cost requests to keep it alive.

The gotcha: cache keys

The cache is keyed by the exact prompt content. If you tweak the Brand A block even slightly — adding a single word, fixing a typo — the cache invalidates and you pay full price until the new version is cached.

This means you should version your brand blocks. Once a block is in production, don't edit it casually. Any edit is a cache-bust.

For significant changes (adding new rules, fixing errors), accept the cache-bust. For cosmetic changes (fixing a comma), batch them until you have enough changes to justify the invalidation.

The savings at scale

For a pipeline running 1000 generations a day, with 2000-token briefs where 70% is cacheable:

Without caching: 2,000,000 tokens/day at full input price.
With caching: 600,000 tokens/day at full price, 1,400,000 tokens/day at 10% price.

At Anthropic's current input token price, that's roughly a 60% cost reduction on the input side. For a month's operation, this is hundreds of dollars. For a year, thousands. Real money that compounds.

The takeaway

If your AI workload reuses the same prompt prefix across generations — character blocks, system rules, format specs — use prompt caching. The engineering is minimal (one API parameter). The savings are significant.

Most teams ignore caching because they're early in usage and the savings don't matter yet. That's a mistake. Set up caching before it matters, so when you hit scale, your cost curve is already under control.

Cost engineering compounds. Pay attention to it early.