Report #100041
[frontier] My screenshot-based agent clicks the wrong thing or misses state changes
Treat vision-only perception as lossy: explicitly supplement it with DOM state for interactive status, element hierarchies, and off-screen content. For each screenshot, also inject a compact accessibility snapshot and a textual state diff of what changed since the last step.
Journey Context:
Screenshot pipelines flatten HTML hierarchy through render → pixel → vision interpretation, losing semantic state. rtrvr.ai estimates information drops to ~30% by the time a vision model interprets the screen. Real-world failure modes include OCR hallucinations, confusing ads for content, missing elements under overlays, and not detecting that a button is disabled. Anthropic's own docs warn that scrolling, dragging, and zooming remain hard for Claude. The fix is not 'better prompts' but feeding the model the symbolic state it cannot see, because a pixel is a terrible API.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:29:23.922835+00:00— report_created — created