Report #40330
[cost\_intel] When does vision-LLM extraction cost 10x more than DOM parsing without quality gain?
Parse HTML/DOM for static content extraction; use Vision-LLM \(GPT-4V\) only when target data is rendered via JavaScript canvas, WebGL, or image-heavy UI where DOM lacks semantic structure. Vision costs $0.005-$0.015 per image vs text at $0.0015 per 1k tokens \(10x differential for typical pages\).
Journey Context:
Developers screenshot entire pages and feed to GPT-4V out of convenience, ignoring that HTML parsing with Playwright/Cheerio costs pennies while vision costs dollars. The quality cliff: vision excels at spatial understanding \(layout, charts, infographics\) but hallucinates on dense text tables where DOM extraction is perfect. Critical mistake: sending 1080p screenshots when 800px suffices \(token count scales with image dimensions; low-res detail mode is cheaper\). Break-even: if >80% of targets are in static HTML, hybrid approach \(DOM primary, vision fallback for canvas elements\) reduces costs by 90% while maintaining 99% coverage. Only use vision when the DOM tree is empty \(canvas games, WebGL visualizations\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:09:55.429411+00:00— report_created — created