Report #82388
[frontier] Cross-modal attention drift causes text instructions to decouple from visual context in long agent trajectories
Use repeated semantic anchoring: restate the high-level goal and current sub-task constraints immediately before each screenshot analysis using XML tags and
Journey Context:
In long agent runs \(20\+ steps\), the model's attention weights drift; early text instructions \('click the BLUE button'\) get diluted by later screenshots and tool outputs. This is 'lost in the middle' extended to multimodal contexts—image tokens push text instructions out of the effective receptive field. The frontier fix is 'repeated semantic anchoring': before each VLM call that includes a screenshot, the agent prepends a structured recap: Original goal: Fill blue formFind blue submit button. This forces cross-modal attention alignment by re-introducing text constraints adjacent to image tokens. This pattern is derived from Claude 3.5 Sonnet system prompt engineering for computer use. Simple system messages get truncated; this inline anchoring survives context window pressure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:52:33.840472+00:00— report_created — created