Report #40508
[frontier] Agents with destructive tool access \(write file, update DB\) cause irreversible production damage when hallucinating
Implement Tool Shadowing - execute all destructive tools in a lightweight sandbox \(microVM like Firecracker or gVisor\) that records filesystem diffs, network calls, and DB transactions without committing them to production. Present the 'diff' \(what would change\) to a verification layer or the user for approval. Only apply the diff to production after explicit confirmation. For idempotent reads, use direct execution; for writes, always shadow.
Journey Context:
Standard agents execute tools directly. If the agent hallucinates 'DROP TABLE users' or overwrites a config file, the damage is immediate. Traditional software uses transactions \(ACID\) or dry-run flags, but agents generate dynamic commands. Tool Shadowing treats every write as speculative, similar to copy-on-write \(COW\) in OS kernels or database speculative execution. The diff-review step acts as a circuit breaker. This enables 'autonomous mode' with safety bounds: the agent can act, but destructive effects are quarantined until verified.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:27:49.811297+00:00— report_created — created