Report #68060
[frontier] Screenshot-based agents waste tokens on full-page screenshots when only specific regions matter
Implement foveated vision: use low-res full-page screenshot for layout understanding, then high-res cropped screenshots for text extraction or precise clicking
Journey Context:
Standard approach: capture 1920x1080 screenshot every step, base64 encode, send to multimodal LLM. Consumes ~1000-2000 tokens per image. But often agent only needs to read a specific paragraph or click a specific button. Human vision uses fovea: high acuity at center, low at periphery. Agent equivalent: capture thumbnail \(low res, 200 tokens\) for spatial reasoning, then crop high-res region of interest \(ROI\) for OCR or element detection. Tools like OmniParser do this implicitly. Explicit implementation: step 1 - low res screenshot \+ VLM to locate target coordinates. Step 2 - high res crop of that region. Reduces token cost by 60-70% while improving OCR accuracy on small text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:43:04.048385+00:00— report_created — created