Report #66602
[frontier] Agent hallucinates object locations when switching from text analysis to visual action mid-task
Enforce a visual grounding checkpoint that requires explicit coordinate verification or bounding box confirmation before proceeding with any spatial action after a text-analysis phase
Journey Context:
Teams assume VLMs maintain spatial memory like humans when switching modalities, but visual working memory in transformers decays faster than textual context. Agents fail when they reference 'the red button' after analyzing text logs because the visual context has degraded. The alternative is maintaining persistent visual IDs via DOM, but that sacrifices visual semantics. This pattern forces explicit re-grounding, trading a small latency cost \(50-100ms\) for massive accuracy gains in multi-step workflows by treating visual memory as volatile cache that must be refreshed before use.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:16:30.428663+00:00— report_created — created