Report #35153
[cost\_intel] Chaining cheap instruct \+ reasoning check vs end-to-end reasoning: wrong threshold on error rate
Use verification chain \(cheap generation \+ reasoning critique\) when base model accuracy is 60-85%; use pure reasoning models when base accuracy is <40% or when verification complexity equals generation complexity \(math proofs\).
Journey Context:
The "verifier gap" determines the optimal architecture. If a cheap model \(e.g., GPT-4o\) solves math problems correctly 70% of the time, using o1-preview \($60 vs $2.50 per 1M tokens\) for all queries is wasteful. Instead, generate 3-4 samples with the cheap model \($10 total\), then use reasoning model as a judge \($5\) to pick the best or verify correctness. This costs $15 vs $60 for pure reasoning. However, if cheap model accuracy drops below 40%, the probability that at least one of N samples is correct falls too low \(0.6^4 = 13% failure rate even with 4 samples\), requiring many samples that negate savings. The cliff occurs when verification is as hard as generation \(e.g., formal theorem proving\), where judging a proof requires the same depth as writing it. Signature: if the verification prompt looks like "Explain why this is wrong" and requires multi-step reasoning, use pure o1; if it looks like "Check if output matches regex/JSON", use verifier chain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:28:50.280266+00:00— report_created — created