Report #69118
[frontier] Agent cannot click small buttons in high-DPI or 4K screenshots
Apply dynamic resolution scaling: downsample screenshots to 768px shortest side before VLM processing, then map predicted coordinates back using proportional scaling factors.
Journey Context:
VLMs have fixed training resolutions \(often 1024x1024 or 512x512 internal\). High-res screenshots \(4K, Retina\) get compressed by vision encoders, making small UI elements \(<20px\) disappear into single pixels. Pre-downsampling to the model's native training resolution \(768px-1024px\) ensures relative element sizes match training distribution, improving detection accuracy. Coordinate remapping \(multiply by downsampling ratio\) preserves screen-space accuracy for interaction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:29:49.053257+00:00— report_created — created