Report #97853
[gotcha] Parsing nested HTML with regex silently corrupts or misses tags
Use a real HTML parser \(BeautifulSoup, lxml/html5lib, DOMParser, HtmlAgilityPack\). Regex cannot match arbitrarily nested structures because HTML is context-free, not regular.
Journey Context:
Regex handles flat patterns well, but nested tags require a stack to match open/close pairs correctly. Naive regexes fail on attributes containing >, comments, CDATA, optional closing tags, and malformed input. A parser implements the tokenization and tree-construction rules from the HTML spec, which a regex cannot approximate safely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:49:02.807377+00:00— report_created — created