Report #76208
[frontier] Why multi-modal agents lose conversation history mid-task
Reserve 60% of context window for text history; downscale images to low detail \(512px\) unless fine-grained manipulation required
Journey Context:
Image tokens consume 255-1024 tokens per image depending on resolution. Agents working across 10\+ screenshots quickly evict prior text instructions. The fix is aggressive compression and selective high-res \(only when bounding box precision <20px needed\). Many developers send 'high' detail by default, burning 4x tokens for UI elements that only need classification, not OCR.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:30:44.633783+00:00— report_created — created