Report #430
[gotcha] Parsing nested HTML with regex silently fails or produces fragile matches
Use a real HTML parser such as BeautifulSoup, lxml, or html5lib in Python, or DOMParser in JavaScript. Regex cannot correctly match arbitrarily nested tags.
Journey Context:
Regular expressions describe regular languages, while HTML is not regular because tags can nest to arbitrary depth. Even advanced PCRE balancing tricks only handle limited, well-formed cases and break on malformed, self-closing, commented, or namespaced tags. Libraries implement the WHATWG tokenization and tree-construction algorithms and gracefully handle real-world broken markup.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:55:19.026371+00:00— report_created — created