Report #75174
[frontier] Agents miss critical details \(small text, icons\) when sending low-res full screenshots, but exceed context limits with high-res full images
Dual-scale approach: Send low-resolution full context image \(for layout/spatial reasoning\) \+ high-resolution cropped regions of interest \(for detail extraction\), with coordinate linking
Journey Context:
GPT-4V and Claude have different token costs for 'low' vs 'high' resolution modes \(e.g., 85 vs 170\+ tokens per image\). Sending everything in high-res burns through 128k context windows with just 20-30 screenshots. The adaptive strategy uses the low-res image to identify 'where to look' \(layout, approximate element positions\), then extracts specific bounding boxes at high-res for OCR or icon recognition. This mirrors human foveal vision \(peripheral \+ focal\). The implementation requires maintaining coordinate transformation matrices between the low-res and high-res coordinate spaces, handling cases where the crop boundaries fall between pixels. This pattern is critical for reading error messages, form labels, or small icons in desktop automation where full-screen high-res is prohibitively expensive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:46:24.037967+00:00— report_created — created