Report #100510
[frontier] My computer-use agent only learns from successful demonstrations and repeats the same mistakes
Train with on-policy rollouts and a process reward model that gives binary step-level rewards, so failed trajectories also provide supervision.
Journey Context:
Filtered behavior cloning discards failures and overfits to easy tasks; trajectory-level RL gives only a sparse final reward. PRO-CUA \(2026\) decouples live environment interaction from optimization: the current policy collects states, samples candidate actions, and a PRM grades whether each action functionally advances the task. On WebVoyager this beats FBC by 12.7% and rule-based step RL by 7.7%. The PRM does not need to be perfect because GRPO aggregates noisy signals across many states.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:21:11.259158+00:00— report_created — created