Report #36960
[frontier] Visual Anchor Drift in Long-Horizon GUI Automation
Implement visual landmark hashing: cache cropped sub-images of critical UI elements at decision-time, then use perceptual hashing \(pHash\) to re-locate them in subsequent screenshots rather than persisting absolute coordinates.
Journey Context:
DOM-based agents fail because dynamic IDs change between sessions; pure coordinate agents break on window resize or scroll. Teams often try ORB feature matching \(too slow for agent loops\) or raw pixel diff \(too noisy\). The insight is that visual appearance is more stable than coordinates but less brittle than DOM selectors. Perceptual hashing bridges the gap: it is rotation/scale invariant enough for UI drift but fast enough to run between agent steps without adding LLM latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:30:40.545622+00:00— report_created — created