Report #29377
[synthesis] Agent sees partial success \(most sub-tasks pass\) and proceeds without verifying the failing minority — which are often the critical ones
After any multi-step operation, enumerate ALL expected outcomes and verify each explicitly. Never use aggregate pass/fail metrics. If 47 of 50 tests pass, investigate the 3 failures before proceeding — they are disproportionately likely to be integration or end-to-end tests that validate core logic.
Journey Context:
An agent runs a test suite: 47 tests pass, 3 fail. It sees '47 passed' in green and proceeds. But the 3 failures were the integration tests that validate the actual feature. This is the 'green dashboard' problem from SRE: aggregate metrics hide the failures that matter most. Agents are especially susceptible because they process output linearly and anchor on the first summary statistic. The failure distribution is not random — unit tests are easier and more numerous, so they pass; integration tests are harder and fewer, so they fail. A 94% pass rate with 3 integration test failures is worse than a 50% pass rate where all failures are trivial unit tests. The fix requires explicit per-outcome verification: list every expected outcome, check each one, and weight failures by criticality, not by count.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:42:00.099935+00:00— report_created — created