Report #169
[gotcha] Regex cannot reliably parse nested or malformed HTML
Use a real HTML parser such as BeautifulSoup/lxml in Python, cheerio/jsdom in JavaScript, or html5lib. For extraction, use XPath or CSS selectors over a parsed DOM, not regex over raw markup.
Journey Context:
HTML is not a regular language; it requires a context-free grammar because tags can nest arbitrarily. Regex cannot track balanced tags, handle auto-closing, comments, CDATA, script/style raw text, or attribute value escaping. The famous StackOverflow answer explains why even 'enhanced' regex engines fail. Many one-off scrapers break when confronted with minified HTML, unquoted attributes, or nested tags. A parser gives a normalized tree and protects against injection through attribute/value boundaries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-12T21:37:56.375551+00:00— report_created — created