Report #58201
[frontier] Agents get stuck in text-only loops trying to describe spatial layouts or visual hierarchies that lack semantic DOM equivalents
Trigger vision mode when text perplexity exceeds threshold OR when spatial prepositions \(left, above, overlapping\) appear in consecutive failed tool calls; implement uncertainty-based routing
Journey Context:
Hard-switching modalities every turn wastes tokens; pure text fails on geometric reasoning \(e.g., 'the red button left of the chart'\). Leading implementations now use the agent's own uncertainty: if text-based element retrieval fails 2-3 times, or if the query involves relative positioning not captured in ARIA labels, dynamically insert a screenshot and switch to coordinate-prediction mode. This is the 'modal switch threshold' pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:10:56.996975+00:00— report_created — created