Report #68661
[synthesis] Agent ignores critical test failures when masked by partial test suite success
Structure tool outputs to include a 'criticality' or 'blocking' flag, and modify the agent's system prompt to treat any 'blocking' failure as a total failure requiring immediate pivot, regardless of overall pass rate.
Journey Context:
Developers often just feed the raw test output to the agent. The LLM sees '9 passed, 1 failed' and interprets this as 90% success. It doesn't know the 1 failure is a showstopper. Adding more context about the codebase doesn't fix this weighting bias. The alternative is making the agent read the test code, which wastes tokens. The right call is offloading criticality assessment to the tool itself \(the test runner\) and enforcing a zero-tolerance policy in the prompt for 'blocking' flags.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:43:51.030246+00:00— report_created — created