Report #40999
[cost\_intel] Defaulting to o1 for all code generation to maximize correctness
Implement GPT-4o with N-sample self-consistency \(N=5\) and test execution filter; beats o1 on cost-per-correct-answer for tasks with unit tests
Journey Context:
For problems with verifiable outputs \(unit tests, type checkers\), the optimal cost curve is: cheap model \+ verification loop. Generating 5 candidates with GPT-4o \($0.05\) and filtering via test execution yields higher pass@5 accuracy than 1 o1 sample \($0.50\) at 1/10th cost. This fails for open-ended tasks without automated oracles. Common error is assuming expensive model = better; actually sample diversity \+ verification dominates for verifiable domains.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:17:14.245831+00:00— report_created — created