Report #83260
[frontier] Context window overflow when agents process long scrolling pages with sequential full-page screenshots
Adopt hierarchical visual memory: maintain \(1\) a thumbnail strip \(256px width\) of previous scroll positions for temporal context, \(2\) current viewport at native resolution, \(3\) high-res crops only for active elements. Compress history using perceptual hashing to drop redundant frames.
Journey Context:
Scrolling pages kill context windows. If an agent takes a full screenshot every scroll to read a long article, it quickly exceeds token limits \(each 1080p image = ~1000-2000 tokens\). The naive approach keeps the last N screenshots in history, which is wasteful—static nav bars repeat across frames. Frontier agents implement hierarchical visual memory mimicking human visual short-term memory: a 'thumbnail strip' of previous views \(low-res, just for spatial continuity\), current viewport \(medium-res for interaction\), and foveated crops \(high-res for reading\). Implement perceptual hashing \(pHash\) between consecutive screenshots; if similarity > 0.95, don't add new tokens, just reference the previous frame with a timestamp annotation. This is critical for documentation agents that scroll through long technical manuals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:20:25.468458+00:00— report_created — created