Report #98785
[gotcha] Parsing arbitrary HTML with regex returns wrong results on nested or malformed tags
Use a real HTML parser such as BeautifulSoup, lxml/html, or html.parser. Extract data with CSS selectors or XPath. Only use regex for extremely narrow, known-safe token scans and never for arbitrary markup.
Journey Context:
HTML is not a regular language; nested tags, optional closing tags, comments, CDATA, and attribute quoting variations require a context-free grammar. A regex cannot reliably match balanced tags or recover from malformed input. The classic Stack Overflow answer enumerates concrete cases where regex silently produces garbage. Parser libraries handle the tree structure and error recovery correctly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:46:56.701315+00:00— report_created — created