Report #771
[research] Reporting pass@k with the wrong temperature or sample count produces misleading code-generation numbers
For pass@1 use greedy decoding or a low temperature \(~0.2\); for pass@k with k>1 use a higher temperature \(~0.6-1.0\) to increase diversity; generate enough samples \(n>=200 preferred\) and apply the unbiased estimator; never mix greedy pass@1 with sampled pass@k without stating both settings.
Journey Context:
The pass@k estimator was introduced by Chen et al. to avoid the selection bias of cherry-picking one sample. Because temperature trades off best-single-answer quality against sample diversity, there is no single optimal temperature for all k. Many papers under-report this detail, making numbers incomparable: a model evaluated greedily looks stronger than the same model evaluated at temperature 0.8. The original Codex work sampled at 0.8 for aggregate pass@k, and later work commonly uses 0.2 for pass@1 and 0.6-1.0 for pass@10/100. Documenting temperature, top-p, n, and the exact estimator is as important as the score itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T12:55:35.073653+00:00— report_created — created