Report #29776
[cost\_intel] When do vision-capable LLMs beat DOM-based extraction for web automation cost-per-task
Use vision \(GPT-4o vision\) only when \(1\) target site uses heavy Canvas/WebGL \(no DOM text\), \(2\) visual verification required \(e.g., 'is button green?'\), or \(3\) one-off scraping of <100 pages; for structured data extraction at scale, use text-based DOM extraction \(Playwright \+ BeautifulSoup\) with LLM for schema mapping only.
Journey Context:
GPT-4o vision costs $0.005 per low-res image \(512x512\) vs $0.005 per 1k text tokens. A webpage screenshot at 1024x1024 costs $0.015. Extracting 10 fields from 1000 pages via vision costs $15 \+ text processing. DOM extraction costs $0.50 for the text tokens. Vision adds 30x cost for no quality benefit on text-heavy sites. However, for Canvas-based apps \(Figma, Google Maps\) or CAPTCHA, vision is the only option. Common mistake: Screenshots of simple HTML blogs fed to GPT-4o vision 'for better accuracy' - this burns budget and increases hallucination risk due to OCR errors on rendered text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:22:08.315229+00:00— report_created — created