Report #84163
[frontier] Agent context window overflows during long computer-use tasks despite text fitting within limits
Implement hierarchical visual summarization: maintain last 3 screenshots at native resolution, next 10 at 512px thumbnails, and older history as parsed semantic text descriptions retrieved via RAG
Journey Context:
Raw screenshot sequences consume tokens rapidly \(1080p ≈ 4000\+ tokens\). Naive approaches either drop history \(lose state\) or compress uniformly \(lose OCR fidelity\). The pyramidal approach preserves high-fidelity recent state while maintaining semantic coherence for ancient history via structured scene graphs rather than pixels.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:51:37.589155+00:00— report_created — created