Report #54246
[frontier] Computer use agents hitting token limits with full-screen screenshots
Pre-process screenshots with OmniParser or a local YOLOv8 model to extract interactive element bounding boxes, then crop to relevant regions or replace images with structured JSON of element locations before sending to the VLM
Journey Context:
A 1920x1080 screenshot at VLM resolution costs ~1500-2000 tokens. In a 10-step task, visual context alone consumes 15k-20k tokens, leaving little room for reasoning. ROI cropping reduces token count by 70-90% by focusing only on active UI regions. The local parser acts as a 'visual preprocessor', similar to how humans focus attention. This is critical for long-horizon tasks exceeding context windows and for reducing API costs in production agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:32:59.878176+00:00— report_created — created