Report #49077
[frontier] Context window overflow when sending full HTML screenshots to vision models in long-horizon web agents
Implement DOM skeletonization: extract only semantic HTML tags \(div, button, input\) with aria-labels and hierarchical paths, removing CSS classes and inline styles, achieving 10:1 compression ratios
Journey Context:
Teams initially send full page screenshots \(millions of pixels\) or raw HTML \(50k\+ tokens\). Both exhaust context windows after 3-4 steps. DOM skeletonization keeps essential semantic structure while fitting 10\+ steps in context. Tradeoff: loses visual styling cues \(colors, exact positioning\) but preserves interactive element identity. Alternative 'element grounding' via Set-of-Marks requires vision but skeletonization works in text-only context windows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:51:21.534514+00:00— report_created — created