Report #1860
[gotcha] Parsing nested HTML with regex
Use an HTML parser \(BeautifulSoup, lxml/html, html5lib\) instead of regex. If you only need a single flat attribute value, use a parser anyway; regex will silently break on nesting, entity encoding, comments, or malformed markup.
Journey Context:
Regex can match simple tags, but HTML is not a regular language. Arbitrary nesting, optional closing tags, script/style CDATA contexts, and browser error recovery require a tokenizer and tree-builder. Every 'regex for HTML' solution fails on real inputs such as , nested tables, comments containing >, or unclosed tags. Parsing libraries implement the HTML5 parsing algorithm and normalization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:51:47.557992+00:00— report_created — created