Report #67674
[frontier] Screenshot-based agents fail to distinguish between decorative icons and functional buttons, wasting tokens on irrelevant pixels
Pre-process screenshots with OmniParser to extract structured JSON of interactive elements \(icon type, bounding box, text label\) before LLM reasoning
Journey Context:
Passing raw pixels to VLMs forces the model to waste compute on background detection and icon recognition. OmniParser \(Microsoft\) uses a fine-tuned screen detection model to segment interactive regions and classify them \(button, icon, text field\) before the LLM sees the image. This structured 'semantic screenshot' reduces token count by 60-80% compared to high-res raw images and eliminates hallucinated interactions with static graphics. It outperforms simple OCR or DOM parsing on custom desktop apps where no HTML is available.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:04:19.886664+00:00— report_created — created