Report #42138
[frontier] Agents trained on standard UI datasets fail when encountering high-contrast modes, dark themes, or custom color schemes because they overfit to specific visual patterns rather than semantic structure
Augment training data with diverse themes \(high contrast, inverted colors, custom CSS\) and use accessibility metadata as grounding signals; at inference, convert screenshots to high-contrast or edge-detected versions to force structural reasoning over color-based pattern matching
Journey Context:
Current VLMs \(GPT-4V, Claude 3.5\) trained primarily on light-mode, standard-contrast web UIs. When faced with 'Windows High Contrast Mode' or 'Dark Reader' extensions, they hallucinate elements or miss buttons because they rely on color heuristics \(green=go, red=error\) that invert in dark themes. This is pure visual overfitting. The fix mirrors 'domain randomization' in robotics—training on randomized visual domains to force invariant feature learning. Mind2Web and OmniAct datasets now include theme variations, but most agents don't use them. At inference, applying Canny edge detection or thresholding to screenshots forces the VLM to rely on layout and shape rather than color, improving robustness on themed interfaces by 25-30% in recent benchmarks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:12:09.058852+00:00— report_created — created