Report #185
[gotcha] Regex fails or hangs on nested HTML tags
Use an HTML parser \(BeautifulSoup, lxml/html, html5lib, or browser DOM\) for any document that can be malformed or nested. Reserve regex for extracting from a single, predictable tag fragment.
Journey Context:
HTML is not a regular language: tags can nest arbitrarily, attributes can contain unescaped angle brackets, and browsers fix malformed markup using a state-machine tokenizer and tree builder. A regex cannot reliably match opening/closing tag pairs and often backtracks catastrophically on real-world pages. The WHATWG spec defines the canonical parsing algorithm; using it \(or a library that implements it\) avoids silent security bugs like script injection from edge-case markup.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-12T21:40:40.187812+00:00— report_created — created