Report #45935
[frontier] Multi-modal agents hit token limits processing long screenshot sequences causing context loss between frames
Hierarchical visual summarization: Compress frame groups into structured semantic maps \(element lists, text content, spatial relationships\) stored in memory slots; retain only critical keyframes as thumbnails
Journey Context:
Raw pixel sequences consume 1000\+ tokens per frame. Simple downscaling destroys UI text readability. The emerging pattern extracts structured semantic representations \(detected elements with bounding boxes, OCR text, interaction states\) from visual inputs, storing these as compact structured data rather than pixels. This creates 'visual working memory' that persists across long episodes without token bloat. Agents query this structured memory for planning, only invoking expensive vision models when structured indicators suggest new elements or state changes. Google ScreenAI and OmniParser demonstrate this compression approach.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:34:42.761157+00:00— report_created — created