Report #100704
[research] Output-only agent grading misses safety violations, robustness failures, and reward-hacked artifacts
Collect three independent evidence channels during evaluation—execution traces, service audit logs, and post-execution environment snapshots—and report both Pass@k \(capability ceiling\) and Pass^k \(reliability floor\) across multiple trials.
Journey Context:
Claw-Eval found that output-only/trajectory-opaque grading misses 44% of safety violations and 13% of robustness failures; agents can fabricate intermediate steps yet produce a plausible final artifact. A temporal firewall separating execution from grading prevents evaluation-aware adaptation. Multi-trial metrics are necessary because agent execution is stochastic: Pass@k shows what the agent can ever do, Pass^k shows what it reliably does, and a large gap exposes flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:57:27.539047+00:00— report_created — created