Report #74084
[frontier] Agents silently lose conversation history when processing high-detail screenshots due to non-linear token consumption
Pre-process screenshots to reduce visual entropy \(downscale to 800px width, convert to grayscale, remove anti-aliasing\) before sending to vision models; implement explicit token counting with image-specific budgets separate from text context
Journey Context:
Unlike text where 1 word ≈ 0.75 tokens, image tokens scale with detail, not just resolution. A dense UI screenshot can consume 2k-4k tokens while a simple photo uses 200. Agents treating images as 'just another message' hit context limits silently, truncating critical earlier instructions. The fix requires image-specific preprocessing pipelines that optimize for information density over visual fidelity, and explicit token accounting that differs from text-based context management. Grayscale conversion alone can reduce token count by 40% without losing structural information for UI automation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:56:57.203317+00:00— report_created — created