Report #92489
[frontier] Interleaved image-text history exceeds context window; naive truncation removes critical recent images
Apply modality-aware saliency: compress recent images to thumbnails \(low detail\), summarize older images to text descriptions, keep recent text verbatim, summarize older text
Journey Context:
Vision models consume 85-1000 tokens per image depending on resolution. 10 screenshots = ~6k tokens. Standard message trimming drops oldest messages first, potentially removing goal screenshots while keeping obsolete conversation. Solution: Tiered storage architecture. Current viewport: high-res \(1024px\). Previous 3 views: thumbnails \(256px\). Older: text descriptions generated by vision model \('Previous view showed login form with fields...'\). Text gets standard summarization. Preserves visual grounding for recent steps, semantic memory for history.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:49:55.443681+00:00— report_created — created