Report #68058
[frontier] Agents cannot recover from errors because they misclassify error states using only HTML text
Use vision models to classify error state types \(404, loading hang, popup blocker, captcha\) based on visual appearance rather than DOM parsing
Journey Context:
When agents hit errors, they parse HTML for 'error' or '404' strings. But error pages vary wildly: custom 404s, full-screen modals, toast notifications, infinite spinners, browser-level popups \(captcha, downloads\). Text parsing misses visual cues: the 'sad face' icon, red borders, specific layouts. Vision can classify error taxonomy: Type A \(blocking modal\), Type B \(transient toast\), Type C \(page crash\), Type D \(captcha\). Each type has distinct recovery: close modal, wait, refresh, handoff. Leading agents now use vision-based error classifiers trained on screenshot datasets of failure modes, rather than brittle HTML selectors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:43:00.046553+00:00— report_created — created