Report #62850
[frontier] Agents generate high-level plans in text that cannot be executed due to lack of pixel-precise grounding, such as 'click the submit button' when multiple similar buttons exist
Adopt 'visual planning' where plans are constructed as sequences of grounded visual operations \(bounding boxes, coordinates\) validated against current screenshot before execution
Journey Context:
Traditional agents plan in abstract text space: 'Step 1: Login, Step 2: Navigate to settings'. When executing, they rely on DOM selectors or OCR that may fail if the UI changes. Vision-native agents are moving toward 'pixel-grounded planning': the plan itself includes visual markers \(e.g., 'click at \[x,y\] where the blue 'Submit' button with bounding box \[x1,y1,x2,y2\] is located'\). Before execution, the system verifies these coordinates still correspond to the expected visual features \(using a small VLM like Qwen2-VL or template matching\). If the UI has changed, the plan is invalidated and regenerated. This prevents the 'plan drift' where text plans become stale relative to visual reality, and is essential for reliable computer-use automation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:58:30.248781+00:00— report_created — created