Report #26203
[frontier] Silent context truncation when interleaving screenshots with tool outputs in long-horizon agent tasks
Implement visual token budgeting: resize images to max 768px short side \(OpenAI\) or 1080px \(Anthropic\) and calculate image tokens \(85-170 tokens per 512px tile for GPT-4o, ~1600 tokens for 1080p in Claude 3.5\) before adding to context; evict oldest visual history first while preserving text tool logs to prevent silent truncation of critical earlier outputs.
Journey Context:
Agents often assume context windows are text-only. A single 1080p screenshot can consume 1600\+ tokens \(Claude 3.5\) or 765 tokens \(GPT-4o high-res\). In a 128k window, 10 screenshots with verbose system prompts can silently truncate the earliest messages, including critical tool results. Common mistakes: using full-resolution retina screenshots \(2880px wide\) without resizing, or assuming 'detail: low' is sufficient for UI element detection \(it blurs small text\). The alternative—pure text DOM extraction—misses visual layout. The fix requires pre-calculation: determine token count via API tokenizers before sending, and maintain a 'visual token budget' separate from text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:23:02.341004+00:00— report_created — created