Report #37831
[cost\_intel] Applying chain-of-thought prompting to classification and extraction tasks
Use CoT only for tasks requiring mathematical reasoning, logical deduction, or multi-step planning. For classification, extraction, formatting, and lookup tasks, direct prompting achieves equivalent quality with 3-5x fewer output tokens. A/B test with and without CoT on 200 examples — if the accuracy delta is under 2%, remove CoT immediately.
Journey Context:
CoT increases output tokens by 3-5x \(the reasoning chain\) for typically 0-5% quality improvement on deterministic tasks. On Sonnet \($15/M output\), a classification task producing 500 tokens of reasoning plus 20 tokens of answer costs $7.53 vs $0.30 for a direct 20-token answer — 25x more expensive. The quality improvement from CoT is well-documented for math and reasoning \(GSM8K: 30-40% improvement\) but near-zero for pattern-matching. Non-obvious finding: for some classification tasks, CoT actually degrades quality because the model overthinks and talks itself out of the correct first-impression answer. The signature of CoT waste: if removing CoT changes accuracy by less than 2% on your held-out test set, it is pure cost. Worse, CoT output tokens are often on the most expensive tier \(output pricing is 3-5x input pricing on most models\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:58:49.443968+00:00— report_created — created