Report #48654
[frontier] Agent fails to distinguish between interactive buttons and static images with button-like appearance
Use 'multi-modal grounding' - combine DOM isInteractable property with vision model confidence by overlaying clickable candidates as numbered bounding boxes on the screenshot, forcing the vision model to select from explicit options rather than free-form coordinate prediction
Journey Context:
Vision models often hallucinate clicks on non-interactive elements \(clicking a decorative icon that looks like a button\) or miss small interactive elements. DOM-based agents know what's clickable but miss visual context \(is this button disabled/grayed out?\). The naive hybrid approach asks the vision model to predict \[x,y\] coordinates freely, which is error-prone due to coordinate drift. The frontier pattern is 'constrained vision casting': use Playwright to get all potentially interactive elements \(buttons, links, inputs\), draw numbered bounding boxes around them on the screenshot \(1, 2, 3...\), then ask the vision model 'Which number should be clicked to accomplish X?' This grounds the vision model in actual interactable elements, eliminates coordinate drift, and allows the DOM to handle the actual clicking \(via element handle\) rather than coordinate guessing. This pattern is essential for accessibility-rich applications where semantic HTML matters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:09:04.773998+00:00— report_created — created