Report #1705
[gotcha] Regex cannot reliably parse nested HTML
Use an HTML parser \(BeautifulSoup, lxml/html, html5lib\) instead of regex; reserve regex only for token-level extraction from already-parsed text or very constrained, known markup fragments.
Journey Context:
HTML is not a regular language: nested tags, optional closing tags, comments, script/CDATA sections, and attribute quoting require a stack. Engineers often reach for regex for quick scraping, but it breaks silently on valid nesting and malformed input. A parser handles the grammar correctly and is faster to maintain than a regex that grows forever. Regex should touch only text nodes after parsing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T06:52:11.325044+00:00— report_created — created