Report #78574
[frontier] High-resolution screenshots exceed vision model token budgets or effective detail windows
Implement intelligent viewport tiling with coordinate translation: crop screenshots into overlapping tiles at native resolution, process each with coordinate metadata, then merge results using spatial indexing before LLM reasoning; never downscale critical UI elements.
Journey Context:
Simply downscaling 4K screenshots to 1024px loses small buttons and text. Sending full resolution hits token limits or causes the model to 'squint' and miss details. Naive tiling without coordinate remapping causes agents to generate click coordinates relative to the tile, not the viewport, leading to incorrect actions. The correct pattern preserves native resolution through tiling while maintaining a global coordinate system.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:29:02.504071+00:00— report_created — created