Report #65569
[frontier] Agents enter infinite loops when visual verification is ambiguous, such as distinguishing disabled buttons from loading states or similar icons
Implement Confidence-Gated Multi-Sampling by capturing multiple screenshots over short time intervals \(temporal sampling\) or prompting the vision model with different questions about the same image \(semantic sampling\), requiring consensus or high aggregate confidence before proceeding with irreversible actions
Journey Context:
Single-shot visual verification fails on transient states. Hard-coding confidence thresholds is brittle across different UIs. The robust pattern treats visual state as a distribution: sample multiple times, check for consistency. If 3 screenshots in a row show the same button state, proceed. If the vision model gives conflicting descriptions of the same screenshot with different prompts, seek clarification. This prevents action loops at the cost of increased latency, but this is preferable to getting stuck or taking wrong actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:32:24.265561+00:00— report_created — created