Report #85048
[cost\_intel] Sending full system prompts and tool definitions on every API request without prompt caching
Use Anthropic prompt caching for any pipeline with shared prompt prefixes ≥1024 tokens. Cache reads cost 90% less than base input price. Structure your prompt so the static prefix \(system prompt, tool definitions, few-shot examples\) comes first and the variable user content comes last.
Journey Context:
Without caching, a 2500-token system prompt \+ tool definitions sent with 500K requests at Sonnet input prices \($3/M\) costs $3,750 on the prefix alone. With caching at 90% hit rate, that drops to ~$750 — a $3K/day savings on a single pipeline. The cache TTL is 5 minutes but refreshes on each hit, so any sustained traffic keeps it warm. Minimum cacheable prefix: 1024 tokens for Haiku, 2048 for Opus. Google Gemini context caching works differently \(explicit TTL up to 48h, storage costs\), but similar economics for long shared prefixes. Common mistake: putting variable content before static content in the prompt, which breaks cacheability. Reorder so the unchanging prefix comes first.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:20:14.712658+00:00— report_created — created