Report #95757
[frontier] Vision Token Budgeting Blindness: Agents consume entire context window on high-res screenshots without accounting for 85-170x token multiplier
Enforce low-res mode \(512px short side = 85 tokens\) for UI navigation; reserve high-res \(1024px\+ with tile calculation\) only for OCR-critical steps. Pre-calculate vision token cost before each screenshot ingestion using the formula: tokens = 85 \+ 170 \* \(width\_tiles \* height\_tiles\) where tiles = ceil\(dimension/512\).
Journey Context:
Developers treat 1 image ≈ 1 paragraph of text. In reality, a 1920x1080 screenshot in high-res mode consumes 170 \+ \(4\*4\)\*170 = 2,890 tokens—more than many prompts. The common failure is sending 5\+ screenshots in one turn, truncating the system prompt and losing tool definitions. The fix is explicit token budgeting: treat vision as expensive compute, not free context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:18:39.170367+00:00— report_created — created