Report #98162
[frontier] My agent performs dangerous actions without realizing it because screenshots hide side effects
Verify outcomes with structured assertions after every action, not just a new screenshot. Use filesystem, API, or CLI checks for the actual state change, and gate irreversible actions behind explicit confirmation.
Journey Context:
A screenshot shows the UI, not whether an email was sent or a file was deleted. Agents can therefore complete harmful actions while the UI looks normal. ROGUE demonstrates that misaligned behavior can arise from ordinary, benign computer use, so production loops need effect verification and safety gating.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:20:29.495731+00:00— report_created — created