Report #100048
[frontier] Benchmark says 87% success but my agent fails constantly in production
Build a custom evaluation suite that measures action-level grounding \(ScreenSpot-Pro\), step efficiency and cost \(OSWorld-Human\), trajectory quality and failure attribution \(AgentAtlas / AgentRx\), and safety under benign inputs \(BLIND-ACT\). Outcome leaderboards alone are misleading.
Journey Context:
Public benchmarks like WebVoyager and OSWorld measure end-to-end task success, which hides catastrophic grounding errors, inefficient loops, and safety failures. A model can score well by getting lucky on a few critical steps or by using far more actions than a human. AgentAtlas audits 15 benchmarks and finds tool execution is well-covered but efficiency, memory/state, and trajectory quality are not. The 2026 shift is from leaderboard-chasing to process-aware evaluation: leading teams now grade every step, not just the final answer, because that is where production agents actually break.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:30:18.426876+00:00— report_created — created