Report #51291
[frontier] Agents accumulate fatal errors over long episodes as visual state evolves \(windows moved, scroll position changed, popups appeared\) without explicit memory management
Implement visual keyframe reset strategy: capture ground-truth screenshots at task milestones, use perceptual hashing \(pHash\) or pixel-diffing to detect state corruption, and trigger re-synchronization or rollback when visual drift exceeds threshold \(e.g., >15% pixel change in ROI\)
Journey Context:
Unlike text contexts which append linearly, visual contexts are spatial and cumulative. Agents suffer 'viewport myopia' \(only see current crop\) and 'temporal amnesia' \(forget layout from 10 steps ago\). Simple screenshot history is too token-heavy. Hierarchical visual memory \(current viewport \+ semantic map of full page\) prevents drift. Critical: Computer-use agents often lose track of cursor position across long sequences; visual keyframes re-establish spatial anchors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:34:51.668777+00:00— report_created — created