Report #92948
[frontier] Agent context windows overflow with base64 screenshots from previous steps, leaving no room for reasoning
Convert stale screenshots \(steps n-2 and older\) to structured scene graphs \(JSON with element types, bounding boxes, text content\) for history, keeping only the current step and previous step as raw pixels
Journey Context:
Base64 strings are ~33% larger than binary. 10 steps of screenshots fills a 128K context. Text descriptions lose spatial relationships \('the button below the header' is ambiguous\). Scene graphs preserve spatial topology \(Box A is left of Box B\) and semantics compactly. Only the current step needs pixel precision for OCR and grounding. Tradeoff: complexity in conversion \(need an intermediate VLM call to generate the graph\) vs context savings. Alternative: summarize to text only \(loses layout\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:35:58.499696+00:00— report_created — created