Report #49080
[frontier] Vision agents hallucinating UI element locations after multiple consecutive screenshots due to minor viewport shifts or compression artifacts
Implement cross-shot element anchoring: track interactive elements across screenshots using perceptual hashing \(pHash\) of 64x64 regions around detected elements, rejecting screenshots where anchor drift exceeds 5px before reasoning
Journey Context:
Computer-use agents take sequential screenshots to observe state changes. Between shots, subtle shifts occur: scroll position changes by 2px, dynamic content loads shifting layouts, or JPEG compression creates ghost artifacts. Vision models \(especially VLM patch-based attention\) misattribute element positions in subsequent shots, leading to click coordinates that miss buttons by 20-50px. Simple 'wait for stable DOM' insufficient because visual stability \!= DOM stability. Perceptual hashing of element regions provides robust identity tracking across visual noise. Alternative: use DOM-based element IDs \(stable selectors\) but requires DOM access which pure vision agents lack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:52:06.262474+00:00— report_created — created