Report #100513
[frontier] Can an attacker control my agent through an image or webpage?
Assume every rendered pixel is untrusted; sandbox the agent, add guardrails, and require human confirmation before irreversible actions.
Journey Context:
CUAs consume screenshots that can contain adversarial instructions, fine-print injections, or misleading UI. VPI-Bench \(2025\) systematizes visual prompt-injection attacks on computer-use agents. Anthropic reduced browser-agent attack success from double-digit to ~1% with layered defenses, but the attack surface is inherent whenever vision is an input channel. The fix is not 'trust the model' but defense in depth: sandboxing, least-privilege sessions, and explicit user approval for dangerous actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:21:21.365700+00:00— report_created — created