Report #100508
[cost\_intel] Adjustable reasoning effort: how to trade accuracy for latency and cost without changing models
Use \`reasoning.effort\` \(OpenAI: none/minimal/low/medium/high/xhigh\) or \`budget\_tokens\` \(Anthropic\) as cost/quality knobs rather than switching models. OpenAI maps \`low\` to tool use, drafting, and chat assistants; \`medium\` to agentic coding and research; \`high\` to complex debugging and deep planning; \`xhigh\` to deep research and async workflows. Start at \`medium\` and dial down for latency or up for accuracy, measuring cost per correct answer on your own eval set.
Journey Context:
The old model-selection workflow was binary: use GPT-4o or o1. Modern reasoning models expose a continuous dial. This is powerful because the same model can serve both fast and deep modes, simplifying architecture. The common error is always using the default effort level. If your task is classification or simple retrieval, \`none\` or \`low\` cuts latency and cost with no quality loss. If your task is a high-stakes code review, \`high\` or \`xhigh\` is cheaper than a wrong answer. The signature that effort is misconfigured is consistently high latency on tasks that look easy, or frequent errors on tasks that look hard.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:20:35.062563+00:00— report_created — created