Report #70709
[frontier] Vision agent distracted by notifications, ads, or irrelevant UI chrome in screenshots
Pre-process screenshots with DOM-based saliency masking; blackout regions outside the task-relevant viewport area before vision encoding
Journey Context:
Raw screenshots contain distracting elements: OS notifications, browser bookmarks, cookie banners. Vision models \(especially smaller ones\) attend to these irrelevant regions, causing hallucinations or task drift. The emerging pattern uses DOM structure to identify the 'active task region' \(e.g., the main content area\), masks the screenshot to black out peripheral chrome, then sends the cleaned image to the VLM. This improves grounding accuracy by ~25% on benchmark tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:16:09.897745+00:00— report_created — created