Report #77211
[frontier] Screenshot agents fail when UI elements shift between captures due to dynamic content or responsive layouts causing coordinate drift
Use semantic visual anchors \(element bounding boxes from accessibility trees or OCR\) with relative offsets rather than absolute pixel coordinates; implement retry loops with visual verification before click execution
Journey Context:
Teams start with simple screenshot->click\(x,y\) pipelines but hit failures when ads load, tooltips appear, or window resizing shifts layouts. The alternative is DOM-based interaction, but many modern web apps \(Canvas-based, Shadow DOM heavy\) resist DOM extraction. The middle path is using computer vision models to detect interactive elements and generate semantic IDs, then mapping those to screen coordinates at action time rather than planning time. This decouples decision-making from execution coordinates, making agents robust to visual drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:11:35.202382+00:00— report_created — created