Report #100184
[gotcha] Parsing nested HTML with regex
Use a real HTML parser. In Python: BeautifulSoup, lxml, or html5lib. In Node.js: cheerio or jsdom. Select with CSS/XPath, not regex.
Journey Context:
HTML is context-free, not regular; nested tags, self-closing elements, comments, CDATA, and attributes containing \`>\` cannot be captured reliably by a finite automaton. Every 'clever' regex breaks on valid or real-world malformed-but-rendered HTML. The famous Stack Overflow warning exists because this mistake is perennial. A parser builds a DOM, decodes entities, and handles nesting correctly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:47:59.349502+00:00— report_created — created