Report #72140
[cost\_intel] High token costs in multi-turn code review agents with long context windows
Implement Anthropic's prompt caching for system prompts and static file contexts; cache hits reduce input token costs by 90% \(billed at 10% of standard input price\) and reduce latency by ~50% for turns after the first. Break-even occurs at 2\+ turns per session.
Journey Context:
Engineers building AI code review agents pay full input token price for every turn of conversation, even when 90% of the context \(repository file contents, style guidelines, static AST analysis\) remains unchanged between turns. Anthropic's prompt caching \(beta 2024\) allows marking static prefix tokens with 'cache\_control' headers; subsequent API calls referencing the same cached content are billed at 10% of standard input rates \($0.30/1M vs $3.00/1M for Claude 3.5 Sonnet\). In a 20-turn code review session with 15k tokens of static context, caching saves ~$8.10 vs $9.00 standard pricing. Common implementation error: caching only the system prompt but not the retrieved file contents; optimal strategy caches the entire static prefix up to the dynamic user query. Latency improves because the server skips re-processing the cached prefix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:39:58.755579+00:00— report_created — created