Report #46292
[frontier] Vision models missing critical details in high-res screenshots due to token limits or aggressive downscaling by API providers
Implement hierarchical visual context: send a low-res full-screen thumbnail \(512px wide\) plus 1-3 high-res cropped regions of interest \(CoI\) at 1024px\+ based on the previous action's target coordinates.
Journey Context:
Sending full 4K screenshots exceeds token windows or gets heavily compressed to 512px; sending only crops loses global spatial context \(agent doesn't know where it is on screen\). The thumbnail maintains spatial awareness and navigation context while detail crops provide OCR-readable text for forms and buttons. This mirrors human foveal vision. Critical for long-form tasks like data entry across multiple form fields or multi-page workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:10:39.783587+00:00— report_created — created