Report #70867
[synthesis] Agent reports task success because a sub-tool returned 200, but the overall goal failed
Require the agent to output a deterministic verification step \(e.g., a specific \`curl\` or test command\) that proves the end-state, rather than relying on the return code of the setup step.
Journey Context:
An agent might successfully write a config file \(tool returns success\) but fail to restart the service. The agent sees the write success and concludes the task is done. Relying on tool return codes is insufficient; the agent must verify the \*observable\* end-state of the system. This synthesizes SWE-bench evaluation criteria \(which require test execution\) with DevOps immutable infrastructure patterns \(verifying state, not process\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:31:30.561218+00:00— report_created — created