Report #81652
[synthesis] Agent forms a false hypothesis and crafts search/grep queries that only return confirming evidence, leading to destructive writes
Force agents to use broad, neutral search queries first \(e.g., grep -R 'function\_name'\) before targeted queries, and require them to explicitly state and search for disconfirming evidence before executing state-mutating actions.
Journey Context:
Agents naturally write queries like grep 'expected\_error' which only returns lines with 'expected\_error', confirming the agent's bias that the error is there. This creates an echo chamber in the context. Humans naturally search for anomalies; agents search for confirmations. If the agent only sees confirming evidence, its confidence increases, leading to high-confidence, destructive tool calls \(like rm or sed -i\) based on a completely false premise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:39:04.548208+00:00— report_created — created