Report #3222
[gotcha] Trying to parse nested HTML or match tags with a regex breaks on nesting, comments, scripts, and malformed markup
Use a real HTML/XML parser for extraction \(Python html.parser/BeautifulSoup, JavaScript DOMParser, PHP DOMDocument\). Reserve regex for very limited, flat, known subsets only.
Journey Context:
HTML is a context-free \(Chomsky Type 2\) language, while regex describes regular \(Type 3\) languages. Regex cannot count or match arbitrarily nested open/close tags. Real-world HTML also contains comments, CDATA, script/style blocks, unclosed tags, and attributes with angle brackets that look like tags to a regex. Browsers and standards parse HTML with a tokenizer and tree-construction algorithm, not a regex.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:53:18.930397+00:00— report_created — created