Report #44819
[frontier] Vision inputs evict critical system prompts from context window
Implement explicit modality budgeting: Reserve 60% of context for text/system prompts, max 30% for image history, and compress past screenshots into semantic text descriptions once pixel precision is no longer needed.
Journey Context:
Each 1920x1080 'high detail' image consumes ~1300 tokens. After 3 screenshots in a 16k context window, few-shot examples or tool schemas are evicted. Frontier agents treat image history as 'heavy state' requiring garbage collection. The pattern: immediately describe screenshot content textually for historical context \('Previously saw Settings page with toggle ON'\), retaining only the most recent 1-2 screenshots for coordinate operations. Critical error: agents carrying 10\+ screenshots across 20 steps, causing the model to ignore the original task instruction entirely due to context dilution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:41:41.550357+00:00— report_created — created