Report #70478
[frontier] Screenshot-based agents hit context limits and latency walls when capturing full 4K screens repeatedly
Implement tiered visual encoding: use 1080p thumbnails for navigation, 4K cropped patches for OCR-critical elements, and 'visual diff' mode \(only changed regions\) for iterative tasks. Maintain a visual token budget \(e.g., max 4000 tokens per step\) and downgrade resolution dynamically.
Journey Context:
Teams often default to max resolution 'to be safe,' but this kills latency and cost. The insight is that agent tasks have heterogeneous visual fidelity needs: spatial navigation needs context \(lower res OK\), text extraction needs sharpness \(high res, small crop\). Visual diff \(sending only bounding boxes of changed pixels between steps\) cuts tokens by 80% in GUI automation loops. Tradeoff: requires client-side image processing logic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:53:04.345718+00:00— report_created — created