Report #85708
[cost\_intel] How does forcing Chain-of-Thought reasoning silently increase API costs by 10x on classification tasks?
Strip Chain-of-Thought prompts for binary classification; use logprobs to derive confidence instead of generating reasoning text, reducing tokens from 500 to 5 per query.
Journey Context:
Engineers follow the 'let's think step by step' paper for all tasks, including simple binary classification. This generates 200-500 tokens of reasoning before the final 'YES/NO'. At $10/mtok output for GPT-4, that's $0.005-$0.05 per query. Simply asking for 'YES' or 'NO' without CoT reduces output to 1 token \($0.00001\). The accuracy drop is often <1% for binary tasks. The 'fix' is using the logprobs API: query without CoT, check the logprob of the top token. If >-0.1 \(high confidence\), accept it. If uncertain, fall back to CoT. This hybrid approach cuts costs by 90% while preserving accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:27:03.088111+00:00— report_created — created