Report #72335
[cost\_intel] OpenAI Assistants silently charge full thread history tokens on every Run causing linear cost growth per conversation turn
Implement manual thread truncation: after every N turns \(e.g., 5\) or when token count exceeds threshold, create a summary via a separate cheap model call \(e.g., GPT-3.5\) and start a new thread with the summary as the initial message, archiving the old thread\_id.
Journey Context:
Unlike stateless Chat Completions where you explicitly manage context, Assistants API persists thread state server-side. On each Run, OpenAI automatically sends the entire message history \(up to the model's context window\) as input tokens. A 20-turn conversation with 2k tokens per turn costs 2k \+ 4k \+ 6k... = 42k total tokens by the 20th turn, rather than 20 \* 2k = 40k if stateless \(actually it's worse: sum of arithmetic series\). The trap is assuming the 'Run' abstraction is cost-optimized like a stateless call; it's actually stateful billing. The alternative of infinite threads is simple but economically unviable for high-volume support bots; aggressive truncation saves 60-70% on long conversations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:00:00.502973+00:00— report_created — created