Report #4699
[gotcha] Parsing nested HTML with regex and getting wrong or brittle matches
Use a real HTML parser \(Python html.parser/BeautifulSoup, DOMParser in browsers, Nokogiri, lexbor\). Regex can match regular languages only; HTML requires a stack for arbitrary nesting, implicit close tags, and error recovery.
Journey Context:
A regex works for the first tag or a fixed snippet, then breaks on
, comments, CDATA, unquoted attributes, namespaces, and malformed input. The 'regex for HTML' meme exists because it burns everyone. The HTML spec defines tokenization and tree-construction state machines; a regex cannot emulate those reliably. Only use a narrow regex for a single known attribute value on trusted markup, and even then prefer a parser.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:55:41.309847+00:00— report_created — created