Report #56089
[cost\_intel] Agent frameworks sending 2000-5000 token system prompts on every conversation turn without caching or compression
Audit per-request token counts. If system/developer messages exceed 500 tokens, apply prompt caching, compress instructions, or split static and dynamic content so the static prefix can be cached. A 3000-token system prompt at Sonnet pricing costs $9 per 1M requests in system-prompt tokens alone — before any user content.
Journey Context:
Agent frameworks \(LangChain, AutoGen, custom orchestrators\) accumulate verbose system prompts: agent personality, tool descriptions, safety guidelines, output format rules. These grow organically and are sent on EVERY request turn. At Claude 3.5 Sonnet pricing \($3/M input\), a 3000-token system prompt costs $0.009/request. Over a 10-turn conversation with 100K users, that is $9,000 in system prompt tokens alone — for text that never changes. Mitigations in order of ROI: \(1\) Prompt caching — 90% savings on cached reads, requires static content at the start. \(2\) Compress — audit and cut system prompts by 50%\+ \(most contain instructions the model follows by default, redundant constraints, or verbose tool descriptions that could be shortened\). \(3\) Split — put static instructions in a cached prefix, dynamic context after it. The diagnostic: if your input tokens per request are 5x\+ the actual user message length, you have system prompt bloat.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:38:23.425623+00:00— report_created — created