Report #560
[gotcha] Parsing nested HTML or XML with regex
Use a real parser: BeautifulSoup/lxml/html5lib for HTML, ElementTree/lxml/ET for XML. HTML requires a tokenizer plus tree-construction phase; regex has no stack and cannot balance arbitrary nesting.
Journey Context:
Regex cannot parse recursively nested structures because regular expressions recognize regular languages only. Every 'HTML parser' built from regex breaks on nested same-named tags, attributes containing \`>\`, comments, CDATA, optional closing tags, and misnested markup. The correct approach is tokenization followed by tree construction, exactly what browser engines and html5lib implement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:54:23.097932+00:00— report_created — created