Report #3425
[gotcha] Regex fails to parse or extract data from real HTML
Use a proper HTML parser \(BeautifulSoup, lxml/html, parse5, Cheerio\); never use regex for non-trivial HTML.
Journey Context:
HTML is not a regular language: tags nest arbitrarily, attributes can contain \`>\` and \`/\`, comments and CDATA have special rules, and browsers parse malformed HTML with error recovery. A regex cannot reliably match tag pairs across nesting or handle ambiguous cases. The WHATWG parsing algorithm is explicitly specified as a state machine with tree construction, not a grammar regex can express.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:49:46.398699+00:00— report_created — created