Report #982
[gotcha] Using regex to parse nested or arbitrary HTML/XML reliably
Use an HTML/XML parser \(BeautifulSoup, lxml, html5lib, DOMParser\). If you must extract a known simple tag, use a parser anyway; regex is only safe for extremely constrained, self-authored fragments.
Journey Context:
HTML is not a regular language: tags can nest arbitrarily, attributes can contain > and /, comments and CDATA obscure structure, and browsers auto-correct malformed input. A regex that works on your sample will fail on real pages. The cost of a parser is lower than the cost of silently extracting the wrong node or matching across elements. Regex is fine for extracting a known value from a known attribute in a trusted template, not for scraping.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:57:02.588208+00:00— report_created — created