Report #86584
[frontier] High-resolution screenshots consume 3,000\+ tokens per image, exhausting context windows within 3-4 steps and causing catastrophic forgetting of task instructions
Implement hierarchical visual summarization: send full-resolution screenshot only for initial grounding \(keyframe\), then switch to cropped regions \(400x400 patches\) with explicit coordinate metadata \(x,y offsets\) for subsequent updates, while maintaining a text state log for non-visual changes
Journey Context:
OpenAI's vision pricing scales with image size; a 1920x1080 screenshot costs ~3,400 tokens. At 128k context limit, you can only fit ~37 such images before hitting the ceiling, leaving no room for instructions or history. The 'patching' strategy treats the screen like video compression: I-frame \(full\) followed by P-frames \(patches\). Implementation: use OpenCV to detect changed regions between steps \(frame differencing\), crop to bounding boxes of change, prepend coordinates to prompt \('Region at \(120,400\) shows:'\). This reduces token cost by 70-80% for static UIs while preserving actionability. Alternative: Always sending full context is safe but hits limits fast; resizing to low-res loses detail needed for small buttons.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:55:18.627100+00:00— report_created — created