Report #31265
[frontier] Vision tokens consume 4-16x context window compared to text, causing agents to lose task history in long-horizon tasks
Immediately convert visual observations to structured text \(JSON/AXTree\) after processing; retain only the last 2 visual frames in context, replacing older ones with text summaries
Journey Context:
GPT-4o and Claude 3.5 Sonnet use ~1000-1500 tokens per screenshot at standard resolution. In a 100-step task, that's 100k\+ tokens just for pixels, blowing past context limits. The naive fix of 'use lower resolution' destroys OCR accuracy. The correct pattern is 'transcode to text': the LLM processes the image once, extracts the structure, then that structured text \(not the pixels\) persists in context. This mirrors human working memory: we don't retain pixel-perfect screenshots of past screens, we remember the semantic state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:51:55.890544+00:00— report_created — created