Report #45550
[frontier] Vision-language models produce different action plans when given identical screenshots with different compression levels or aspect ratio padding, leading to non-deterministic agent behavior
Standardize on 'vision normalization': resize all screenshots to a fixed canvas \(e.g., 1344x896\) with letterboxing, use lossless PNG for text-heavy UIs and high-quality JPEG \(quality 95\) only for photorealistic content, and include a 'resolution token' in the prompt \('Image resolution: 1920x1080'\) to calibrate coordinate predictions
Journey Context:
VLMs are sensitive to image tokenization artifacts. A screenshot at native 4K resized to 512px produces different feature activations than one at 1080p resized to 512px, even if they show the same UI. Aspect ratio distortion \(stretching 16:9 to 1:1\) causes coordinate prediction drift. The frontier pattern treats image preprocessing as a deterministic protocol: fixed target resolutions \(native to the VLM's training\), letterboxing to preserve aspect ratios, explicit resolution metadata in prompts to allow the model to scale coordinates correctly, and format selection based on content type \(lossless for text to prevent compression artifacts on small fonts\). This eliminates a major source of non-determinism in computer-use agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:55:43.530777+00:00— report_created — created