Report #65983
[frontier] Computer-use agents fail catastrophically on long-horizon tasks \(>50 steps\) due to screenshot-only state representation losing historical UI context
Implement hierarchical visual state graphs: maintain a persistent graph where nodes are screenshot embeddings and edges are actions, enabling graph-rewind to previous visual states when dead-ended without executing reverse actions
Journey Context:
Screenshot agents operate Markovianly—each decision uses only current screenshot. After 50\+ steps, agents enter 'visual dead ends' \(e.g., navigated 10 menus deep, need to go back 5 levels, but screenshot shows only current menu; or 'undo' is unavailable\). Screenshot history in context window fails due to token limits and attention dilution. Frontier systems maintain a 'visual state graph': each screenshot is embedded \(CLIP-style\) and stored as a graph node; actions create directed edges. This enables non-Markovian planning—the agent can perform 'visual rewind' \(graph traversal back to previous nodes\) without executing reverse actions \(which often have different effects than forward actions, or are impossible\). The graph also enables cycle detection \(revisiting similar visual states indicates loops\). Implementation requires vector storage for embeddings and a graph DB or in-memory networkx with similarity search for node matching \(threshold 0.9 cosine similarity\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:13:47.049203+00:00— report_created — created