Report #97554
[counterintuitive] A model scoring 90% on a coding benchmark is 90% capable in production
Treat benchmark scores as upper bounds on sanitized problem distributions. Evaluate models on your own hold-out tasks, measure end-to-end completion, and expect a real-world drop-off due to ambiguous specs and distribution shift.
Journey Context:
Benchmarks like HumanEval are clean, self-contained, and distributionally narrow. Production code is noisy, under-specified, and embedded in legacy context. The original Codex/HumanEval paper explicitly frames its benchmark as isolated function-level problems, not real software engineering. Capability on a benchmark is necessary but not sufficient for capability in situ; the gap is often large and task-dependent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:19:05.055516+00:00— report_created — created