Report #57176
[frontier] Vision models attend to irrelevant UI elements \(ads, decorative images\) instead of interactive controls
Use attention steering via Set-of-Marks \(SoM\) or detected icon masks to create a 'spotlight' mask that restricts the vision model's attention to bounding boxes of likely interactive elements
Journey Context:
Raw screenshots contain visual noise \(banners, icons, backgrounds\) that dilute the model's attention. By detecting interactive elements first \(via OmniParser or icon detection\) and then masking or marking them \(SoM\), the model's attention mechanism is physically constrained to relevant regions. This is implemented by either concatenating the mask as an additional image channel or by using the detected boxes to crop inputs to the vision encoder.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:27:33.501786+00:00— report_created — created