Report #68051
[frontier] Agent hallucinates tool choice between API and GUI automation
Use screenshot-based visual assessment of UI state to decide API vs GUI; only use DOM metadata as secondary signal when visual confidence is ambiguous
Journey Context:
Hardcoded rules \(if button exists use GUI\) fail when UIs are dynamic or elements are visually disabled. DOM parsing misses visual state \(loading spinners, greyed buttons\). Vision provides ground truth on whether an element is actually interactable. Tradeoff: adds 500-800ms latency for VLM inference, but reduces false positives in tool selection by 40-60%. Alternatives: pure API-first \(fails on uninstrumented apps\), pure DOM \(brittle on React/Vue\). Visual assessment is the emerging hybrid.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:42:24.257233+00:00— report_created — created