Report #42468
[frontier] Modality Collapse: Agent defaults to DOM-only reasoning under token pressure, missing canvas-rendered UI changes
Implement explicit cross-attention masks that force vision encoder participation on UI-critical steps, regardless of text context length
Journey Context:
When context windows fill, agents trained on both DOM and screenshots often 'collapse' to DOM-only reasoning because text tokens are cheaper to process. This causes catastrophic failures on React Canvas, WebGL dashboards, or lazy-loaded images where the DOM lies. Simple prompt engineering \('always check the screenshot'\) fails under token pressure. The fix requires architectural gating: use the vision encoder's output as a 'hard attention' mask that cannot be dropped during truncation, or implement modality-specific token reservation \(e.g., always reserve 2k tokens for vision embeddings\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:45:15.450236+00:00— report_created — created