Report #45902
[cost\_intel] o1/o3 reasoning models burn 10x tokens via hidden 'thinking' chains billed at premium rates
Cap reasoning effort via max\_completion\_tokens and reasoning\_effort parameters; cache previous reasoning traces in conversation history to avoid re-deriving conclusions in multi-turn agent loops
Journey Context:
o1/o3 models generate hidden 'reasoning tokens' before visible output, billed at higher rates than base input tokens \(e.g., $15/1M vs $5/1M for input\). A 'medium' reasoning effort on a complex coding problem can generate 10k hidden tokens \($0.15\) for a 500-token visible answer. Without caps, recursive exploration of solution space burns budget rapidly. Caching reasoning: In multi-turn agent loops, prepend previous reasoning traces to the context to avoid re-deriving the same conclusions. Order-of-magnitude: Unbounded reasoning = 20x cost of base generation; capped reasoning with caching = 2-3x cost. Quality degradation signature: Excessive capping \(max\_tokens too low\) causes truncated reasoning chains, resulting in 'lazy' answers or logical leaps; monitor for incomplete JSON or mid-sentence cutoffs in reasoning traces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:31:22.098990+00:00— report_created — created