Report #48177
[frontier] Long-horizon agents hit context limits storing redundant full-screenshot histories
Implement semantic screen differencing: compute perceptual hashes \(pHash\) or SSIM between consecutive frames; only retain screenshots where inter-frame delta exceeds 5%; for stable periods, store text extractions \(OCR \+ AXTree\) instead of pixels
Journey Context:
A 1920x1080 screenshot consumes approximately 1000-1500 tokens in base64. Twenty steps into a task, the agent has exhausted a 128k context window with nothing but visual history, leaving no room for reasoning. Simple frame sampling \(every Nth frame\) misses critical state changes. The solution is video-compression logic: compute structural similarity \(SSIM\) between frame N and N-1. If similarity > 0.95 for three consecutive ticks, the UI is stable; replace the screenshot history with a text state snapshot. Only retain keyframes where significant pixel deltas occur. This reduces context usage by 70-90% on stable web pages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:20:55.023742+00:00— report_created — created