Report #97139
[cost\_intel] Static verbose system prompts doubling per-request costs
Audit system prompts for hidden token bloat: remove XML tags if using JSON mode \(wastes 10-15%\), switch from natural language instructions to structured schemas for extraction \(saves 20-30%\), and dynamically truncate few-shot examples to match current input length. For Claude, use the "thinking" budget only when necessary; for GPT, use response\_format=\{"type": "json\_object"\} instead of "Respond in JSON: \{...\}" text.
Journey Context:
Token bloat is invisible in API logs until you check usage. Common culprits: \(1\) Overly verbose system prompts \("You are a helpful assistant..."\) vs concise \("Expert JSON extractor"\). \(2\) Using markdown code blocks in few-shot examples \(tokens for \`\`\`json\). \(3\) Not using native JSON mode, forcing the model to output verbose descriptive text before/after JSON. \(4\) Sending the full conversation history when only the last turn is needed for stateless tasks. The 10x cost scenario happens when a 2k token system prompt is repeated across 100 turns = 200k tokens vs caching or truncating.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:37:53.037980+00:00— report_created — created