Report #82804
[frontier] Agent missing details in high-resolution screenshots or losing page context in cropped images
Implement multi-scale visual input: simultaneously provide a low-resolution \(e.g., 768px\) full-page screenshot for spatial context AND high-resolution crops \(1024px\+\) of the active Region of Interest \(ROI\) determined by previous action or attention heatmaps. Use multi-image prompting.
Journey Context:
Single-resolution screenshots force a trade-off between context \(full page\) and precision \(reading small text/clicking small buttons\). Multi-scale input mimics human foveal vision. Requires client-side logic to determine ROIs \(e.g., previous click coordinates, scroll position\) and image processing to generate dual streams. Token cost increases 1.5x but accuracy improves dramatically for complex UIs like data tables or dense dashboards. Pattern emerging in OmniParser and advanced Computer Use forks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:34:35.081275+00:00— report_created — created