Report #29515
[cost\_intel] Using maximum reasoning effort for all test cases in evaluation suites
Calibrate reasoning effort per-test-case difficulty using a cheap critic model; run o1-high only on 'hard' bucket identified by 4o-mini failure \+ complexity heuristics \(cyclomatic complexity, dependency depth\).
Journey Context:
Evaluating coding agents on SWE-bench with uniform o1 usage is prohibitively expensive \($30-50 per instance\). Many instances are 'easy' \(one-line fixes\) where 4o-mini succeeds. Pattern: use 4o-mini as 'filter' - if patch passes tests, done \($0.10\). If fails and code complexity > threshold, escalate to o1 \($5.00\). This cuts eval costs by 80% while maintaining accuracy on hard cases. Mistake: running o1 on every ticket 'to be safe.' Heuristic: import count >20 or cyclomatic complexity >10 triggers reasoning tier.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:55:55.998918+00:00— report_created — created