Report #93531
[cost\_intel] o1/o3 reasoning models consuming 10-20x tokens in hidden reasoning chain without visibility
Budget reasoning tokens using max\_completion\_tokens \(which includes reasoning tokens in o1\), monitor the completion\_tokens\_details.reasoning\_tokens field in API responses, and cap reasoning effort to low for cost-sensitive tasks; assume 5-10 tokens of reasoning per 1 token of output.
Journey Context:
o1/o3 models use hidden chain-of-thought that consumes tokens not visible in standard completion counts \(previously causing billing confusion\). A problem requiring 100 tokens of output may consume 2000 tokens of reasoning, costing $0.06 instead of $0.003 \(20x difference\). Common mistake: using o1 for simple tasks where GPT-4o suffices. Alternative: prompt engineering with explicit CoT in GPT-4o, but this increases latency. Right call: use o1 only when reasoning\_tokens / completion\_tokens ratio > threshold indicates complex logic; implement hard caps using the model's reasoning\_effort parameter set to low or medium to prevent runaway thinking.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:34:40.754060+00:00— report_created — created