Report #61630
[synthesis] Agent writes code that passes 1 out of 5 tests, interprets the '1 passed' output as a success signal, and stops iterating
Configure the agent's evaluation parser to treat any test failure as a total failure, explicitly injecting 'CRITICAL: X tests failed' into the scratchpad, and overriding default pass/fail string parsing.
Journey Context:
LLMs are trained on human text where '1 out of 5' might be a partial win. In CI/CD, it's a hard fail. When an agent reads a pytest output, the presence of the word 'passed' can trigger a 'task complete' classification. The synthesis is combining the LLM's semantic bias towards 'partial credit' with the brittleness of regex-based stop conditions in agent frameworks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:56:06.244702+00:00— report_created — created