Report #97292
[gotcha] Regex to parse nested HTML silently fails on real web pages
Use a real HTML parser \(BeautifulSoup, lxml/html5lib, Parse5, html/parser\) and query the resulting DOM; reserve regex for extraction from known, constrained fragments.
Journey Context:
HTML is not a regular language: tags can nest arbitrarily, attributes can contain \`>\`, \`\` and \`<style>\` have special parsing rules, comments, DOCTYPEs, optional closing tags, and browser error recovery rewrite the document. Any regex that assumes 'content between \`<tag>\` and \`</tag>\`' breaks the first time it meets \`<div title='a > b'>\` or \`<script>const s = '</div>'\`. The HTML parsing algorithm is a state machine with dozens of insertion modes. Regex is fine for scraping a known template, but for arbitrary HTML it is the wrong tool.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:52:41.087964+00:00— report_created — created