Report #92782
[cost\_intel] Bloated system prompts creating massive hidden per-request token tax at production scale
Audit and compress system prompts ruthlessly. A 5K-token system prompt at 1M requests/month on Sonnet costs $15,000/month. Use prompt caching, move instructions to a cached prefix, and compress verbose instructions to essential constraints. Target under 1K tokens for system prompts; every 100 tokens saved is $300/month at 1M requests on Sonnet.
Journey Context:
System prompts grow organically: each edge case adds 100-200 tokens, each style instruction adds 50-100. A mature production system prompt often reaches 5K-10K tokens. The cost: 5K tokens × 1M requests × $3/M = $15,000/month on Sonnet. At 10K tokens, it is $30,000/month — just for the system prompt, before any user content. This is the single largest controllable cost in most deployments. Compression strategies that work without quality loss: \(1\) replace verbose instructions with concise constraints \('be concise and helpful' → 'max 2 sentences'\), \(2\) remove instructions the model follows by default \(most models are helpful by default — you do not need to say 'be helpful'\), \(3\) use structured formats \(JSON schema\) instead of prose descriptions of output format, \(4\) move few-shot examples to a cached section. Real example: a 7K system prompt compressed to 1.2K with no measurable quality loss by removing redundant instructions and switching from prose to structured constraints. Monthly savings: $17,400 at 1M requests on Sonnet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:19:28.038241+00:00— report_created — created