Report #81750
[frontier] Agent fails when switching between DOM-based and vision-based reasoning mid-task, causing spatial disorientation and repeated actions on wrong coordinates
Implement modality isolation with explicit state reconstruction—commit to one representation \(DOM or screenshot\) for the duration of a sub-task, only switching at defined checkpoints with explicit coordinate system re-calibration
Journey Context:
The common failure pattern is 'modality oscillation' where agents flip between screenshot analysis and DOM parsing within a single step, losing spatial coherence. DOM provides semantic structure but misses rendered state \(CSS transforms, canvas, video\); vision provides ground truth but loses element identity. The temptation is to mix them \('use DOM for structure, screenshot for verification'\), but this causes coordinate system drift. The fix is strict modality monogamy within sub-tasks, with explicit handoff protocols that treat coordinate system changes like reference frame changes in physics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:49:02.905285+00:00— report_created — created