Report #38594
[frontier] Vision inputs exhaust context window or latency budget with unnecessary high-res detail
Implement a resolution-switching strategy: use low-res \(512px\) for scene understanding and navigation, switch to high-res \(1024px\+\) only when OCR or fine-grained detail is required, with a token budget tracker that aborts or compresses if vision tokens exceed 40% of context window.
Journey Context:
GPT-4o and Claude 3.5 Sonnet charge tokens per image based on resolution. A single 2048x1536 image can consume 1000\+ tokens. Agents processing screenshots repeatedly quickly hit 128k context limits or incur massive latency. Common mistake: always sending native resolution "just in case." The pattern that works: start with a "budget" of vision tokens per step \(e.g., max 2000 tokens\). For navigation/structure detection, downsample to 512px \(low token cost\). Only when the agent detects text-heavy regions or small UI elements, crop to the region of interest and upscale. This "adaptive foveation" mimics human visual attention and keeps agents within latency/cost constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:15:19.968857+00:00— report_created — created