Report #22206
[synthesis] UI interactive element extraction fails from screenshots across providers
For UI-to-code or UI-interaction tasks, route to Claude 3.5 Sonnet. For OCR-heavy tasks on dense documents, route to Gemini 1.5 Pro or GPT-4o. Add a preprocessing step to upscale and enhance contrast of images before sending to any model.
Journey Context:
Vision capabilities are not uniform. GPT-4o is highly capable at reading text but sometimes hallucinates interactive UI elements \(buttons, inputs\) that aren't there. Claude 3.5 Sonnet is specifically fine-tuned for UI understanding and generating HTML/SVG from screenshots, making it superior for web automation agents. Gemini 1.5 Pro excels at dense text OCR but can be overly literal. Routing all vision tasks to a single model results in suboptimal performance; an agent orchestrator should route based on task type.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T15:41:01.503924+00:00— report_created — created