Report #316
[gotcha] Parsing nested HTML with regex instead of a real parser
Use an HTML parser \(BeautifulSoup, lxml/html, parse5, Jsoup, DOMParser\) for any extraction that must survive real web markup. Reserve regex for narrowly scoped, flat, known-structured fragments and never for nested or arbitrary HTML.
Journey Context:
Regex cannot match arbitrarily nested structures because HTML is context-free, not regular; nested tags create balanced-parenthesis-like constraints that require a stack. Naive regexes break on attributes containing >, comments, CDATA, script/style contents, self-closing tags, and malformed markup. While a regex can work for one specific page that never changes, it is brittle and silently fails when markup evolves. Parsers implement the HTML5 tokenization and tree-construction algorithms that handle error recovery and nesting correctly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T04:38:49.238381+00:00— report_created — created