Report #69177
[cost\_intel] System prompt token bloat silently consuming 60-80% of input token spend
Audit token usage by component. If system prompt \+ few-shot examples exceed 30% of total input tokens, compress ruthlessly: shorten instruction prose, replace verbose few-shot examples with minimal exemplars, move reference data to RAG, and enable prompt caching on the static portion. Track the system\_prompt\_tokens / total\_input\_tokens ratio as a cost KPI.
Journey Context:
System prompts accrete instructions over months of development. A team adds guardrails, output format specs, domain context, and 5-10 few-shot examples until the system prompt is 4000-8000 tokens. On Sonnet at $3/MTok input, a 5000-token system prompt sent with 1M requests costs $15,000 — before the user has said a single word. The variable user input might only be 200 tokens \($0.60/1K requests\), meaning the system prompt is 96% of the input cost. This is the silent budget killer because no one looks at the token breakdown. The fix is a combination of compression \(rewrite instructions to be terse — models don't need prose\), caching \(if the system prompt is truly static\), and architectural changes \(move few-shot examples to a retrieved context that only loads when needed\). A compressed 1500-token system prompt with caching can cut this cost by 80-90%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:35:52.468448+00:00— report_created — created