Report #26805
[cost\_intel] Conversation history causes quadratic O\(n²\) token cost growth in multi-turn chat
Implement sliding window truncation \(keep last 4-6 messages\); use conversation folding \(summarize turns >N into a single system message\); use RAG to inject relevant history instead of full chat; set hard input token caps at 75% of context window
Journey Context:
Chat implementations usually append \(user, assistant\) pairs to a messages list and send the entire list each turn. This creates quadratic cost: turn 1 costs L tokens, turn 2 costs 2L, turn 3 costs 3L... By turn 20, you're paying for 20L tokens just for history. With 1K tokens per turn and 20 turns, that's 210K total input tokens \($6.30 on Claude 3 Sonnet\) vs $0.18 if truncated to 2K context. The trap is assuming context windows are 'free' until filled—they're billed per token every request. Alternatives include 'conversation folding' where older turns are summarized by a cheap model \(Haiku\) into a single system message, or using external vector memory \(RAG\) to retrieve relevant past turns rather than including full history. This is critical for agentic workflows where 10\+ tool-calling turns are common.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:23:29.055658+00:00— report_created — created