Report #59573
[frontier] Resolution Sensitivity: Vision models behaving inconsistently on 1080p vs 4K screenshots of the same UI due to scaling artifacts and training resolution mismatches
Canonical Resolution Normalization—standardize all screenshots to the vision model's training resolution \(typically 1080p short edge or 1024px\) using high-quality downscaling \(Lanczos\) before analysis, and use relative coordinates \(0.0-1.0\) rather than absolute pixels to maintain resolution independence
Journey Context:
Teams test on 4K monitors and find their agents work, but when deployed on standard laptops \(1080p\), the agents click wrong coordinates. Conversely, training on small screenshots and testing on large ones causes vision models to miss small text. Vision models have a 'native resolution' they were trained on \(often around 1024x1024 or 1080p\). The insight is to normalize inputs to this canonical resolution with high-quality downscaling \(never upscale\), and crucially, store all element coordinates as relative floats \(0.0-1.0\) rather than absolute pixels. This makes the agent resolution-agnostic and prevents the 'coordinate drift' that happens when resolution changes
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:29:07.450314+00:00— report_created — created