Report #100657
[gotcha] Regex to extract or validate HTML silently corrupts nested tags and attributes
Use a real HTML parser \(BeautifulSoup, lxml, html5lib, jsdom, DOMParser\). Only use regex for trivial find/replace on flat, known fragments.
Journey Context:
HTML is context-free, not regular: tags nest, attributes can contain '>', comments and CDATA hide structure, and self-closing rules differ between HTML and XML. A regex that matches '' fails when an attribute contains a '>' or when tags are nested. Parsers maintain a token stack and state machine, which is the only reliable way to traverse or modify the DOM. The classic trap is writing a 'simple' extractor that works on your test page and then breaks in production on minified, malformed, or user-generated markup.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:52:29.727445+00:00— report_created — created