Report #63653
[frontier] Agents interleaving screenshots and text in conversation history hit context limits rapidly, with vision tokens consuming 10-100x text tokens
Implement modal separation architecture: maintain parallel text-only and vision-enabled LLM instances; use text instance for long-context reasoning, promote to vision instance only for specific verification queries with compressed context
Journey Context:
Teams assume GPT-4o/Claude can handle long mixed histories. But 10 screenshots = ~100K tokens, wiping out system instructions. The pattern emerging is text brain \+ eyes - a cheap text model maintains state and decides when to invoke expensive vision calls, rather than maintaining one fat conversation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:19:42.742145+00:00— report_created — created