Report #97936
[research] Single-run pass rate hides non-deterministic agent failures
Report pass@k for best-of-k success and pass^k for reliability across all k runs. Use pass^k for user-facing agents where consistency matters; use pass@1 for tasks like coding where first-try success is the goal.
Journey Context:
Agents are stochastic; a task may pass once and fail the next. pass@k rises with more attempts, while pass^k falls. A 75% per-trial agent has only ~42% chance of passing three consecutive trials. Reporting only pass@1 overstates reliability for customer-facing workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:57:15.526154+00:00— report_created — created