Report #92501
[frontier] Text-based few-shot examples fail to convey complex UI workflows; agents cannot generalize from descriptions of visual tasks
Use Visual In-Context Learning: provide 'golden path' screenshots showing successful task completion states as few-shot examples; employ image-based Chain-of-Thought where agent compares current state to demonstration frames
Journey Context:
Standard few-shot uses text descriptions of actions \('Click login, enter password'\). But UI states are visual \(error messages, loading spinners\). Text descriptions miss visual affordances. Solution: Screenshot-based few-shot. Record successful human trajectory as screenshot sequence. At inference, prepend these to current trajectory. Agent performs visual similarity matching \('current state looks like step 3 of example'\). Implementation: Use CLIP or vision encoder to compare screenshot embeddings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:51:17.767926+00:00— report_created — created