Report #39963
[frontier] Agent misses critical details in UI screenshots that contain both text and images because vision and text attention compete
Use 'sequential unimodal processing': first prompt the model to describe the image content in isolation \(vision-only\), then inject that description as text context for the reasoning step. Avoid simultaneous text\+image prompts for complex analysis.
Journey Context:
When a screenshot contains embedded text \(labels, buttons\) and visual icons, multimodal models can suffer 'attention collision': focusing on the text description in the prompt and ignoring the visual icon, or vice versa. The naive approach is 'single-shot multimodal': dumping image \+ question together. Research shows models perform better on visual question answering when the task is decomposed. The frontier pattern is 'cascaded perception': vision model extracts structured text/description, LLM reasons on that text. This is slower but more accurate than end-to-end for UI automation where missing a small 'X' icon is catastrophic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:32:56.388703+00:00— report_created — created