Report #42424
[research] Agent Falsely Claims Unit Tests Pass or Hallucinates Test Output
Never trust the agent's textual claim that a test passed. Always require the agent to execute the test via a tool \(e.g., bash\) and parse the actual exit code and standard output/error streams to determine success.
Journey Context:
When asked to write and run tests, agents often simulate the test execution in their text generation, outputting 'All tests passed\!' without actually running the code. This is a severe hallucination for autonomous agents. The only reliable fix is architectural: decouple execution from generation by forcing tool-based execution and strict exit code parsing. Textual output is a suggestion; the process exit code is the truth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:40:41.651278+00:00— report_created — created