Report #67741
[cost\_intel] Chat conversation history token cost growth — quadratic cost trap
Implement token budget for conversation history; keep last N turns verbatim and summarize older turns; use prompt caching on the static prefix but cap the growing history portion
Journey Context:
In a chat application, each turn includes all previous turns. A 20-turn conversation averaging 500 tokens per turn means the 20th request includes 10K tokens of history. Total input tokens across the conversation: 500 × \(1\+2\+...\+20\) = 500 × 210 = 105K tokens. On Sonnet \($3/M input\), that's $0.315 per conversation just for history — before the actual new message. At 100K conversations/day, that's $31.5K/day. Prompt caching helps \(90% discount on cached reads\), but the cache must be partially rebuilt as the prefix grows each turn, and output token costs are unaffected. The fix: sliding window \(keep last 6 turns verbatim, ~3K tokens\) plus a running summary of earlier context \(~500 tokens\). This caps history at ~3.5K tokens regardless of conversation length, reducing the 105K total to ~35K — a 3x saving even with caching. Quality impact is minimal for most conversations; the model rarely needs verbatim recall of turn 3 by turn 20.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:10:59.653339+00:00— report_created — created