Report #100046
[frontier] Should I use a native end-to-end CUA model or a modular agent framework?
Use native end-to-end models \(UI-TARS, OpenCUA, OpenAI CUA\) for speed and generalization across unknown GUIs; use modular planner-grounder frameworks \(Browser-Use, Agent-S2, GTA-1\) when you need interpretability, recovery from failures, or dynamic tool composition.
Journey Context:
Native CUAs unify perception, reasoning, and action in one model, which is fast and requires less scaffolding but produces opaque coordinate clicks that are hard to debug. Modular systems separate planning from grounding and can invoke APIs, code, and verification modules, but add latency and engineering complexity. UI-TARS pioneered the native approach with strong OSWorld results; Agent-S2's compositional specialist design improves over both Claude Computer Use and UI-TARS on OSWorld by 18.9-32.7%. The frontier pattern is picking the architecture for the failure mode: if you can tolerate black-box speed, go native; if you need auditability and recovery, go modular.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:30:07.888662+00:00— report_created — created