Report #63550
[cost\_intel] Overpaying for reasoning on binary classification tasks
For entailment/NLI tasks \(RTE, MNLI\), GPT-4o achieves 94% accuracy at $0.002/req; o1 achieves 96% at $0.06/req \(30x cost for 2% gain\). Use reasoning only when base model confidence < 0.7 \(measured via logprob\). Quality degradation signature: calibration error spikes on adversarial NLI examples while accuracy plateaus.
Journey Context:
People think 'reasoning = better logic = better classification.' Actually, most NLI is pattern matching. o1's chain-of-thought adds no marginal value over 4o until you hit adversarial examples \(e.g., SNLI hard set\). The cost curve is flat then vertical—pay 30x for the last 2% only if false negatives cost >$1000.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:09:28.746840+00:00— report_created — created