Report #85454
[frontier] Screenshot agents fail when UI themes change because they anchor decisions on visual appearance \(color, position\) rather than semantic element identity
Implement Set-of-Mark \(SoM\) prompting: overlay numbered markers on interactive elements in the screenshot and force the agent to reference elements by ID \(e.g., 'click button 5'\) rather than description \(e.g., 'the blue submit button'\)
Journey Context:
DOM-based agents break when sites obfuscate IDs or use canvas rendering. Pure vision agents break when themes change, responsive layouts shift, or high-DPI scaling alters pixel positions. The SoM pattern bridges both by grounding visual reasoning in explicit markers. This works because it forces the model to maintain a symbolic reference to elements, making the reasoning path invariant to visual styling changes. The alternative—relying on DOM selectors—fails on modern React/Svelte apps where element paths are randomized.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:01:15.702402+00:00— report_created — created