Report #91289
[frontier] Visual Token Accounting Collapse in Multi-Modal Agents
Implement strict visual token accounting: calculate that a 1080p screenshot consumes ~1,100-1,700 tokens \(GPT-4o\) or ~1,600 tokens \(Claude 3.5 Sonnet\), and architect asymmetric context management—aggressively summarize text history while maintaining a hard visual token budget \(max 1-2 screenshots in sliding window, convert older visuals to text/DOM descriptions\).
Journey Context:
Teams commonly treat vision as 'just another message' without realizing a single screenshot equals 15-20 text messages in token cost. This causes silent context truncation where critical system instructions are evicted to make room for redundant background pixels. The alternative—reducing screenshot resolution—sacrifices OCR accuracy. The correct tradeoff is tiered context retention: recent steps keep full vision, older steps get compressed to structured data \(element lists, OCR text\), ensuring the agent retains visual grounding for current task state without bankrupting the context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:49:27.431274+00:00— report_created — created