Report #1066
[gotcha] Regex breaks on nested, malformed, or script-containing HTML
Use a real HTML parser \(html.parser, BeautifulSoup, lxml\). Do not use regex for HTML extraction.
Journey Context:
HTML is not a regular language. Browsers parse it with an 80\+ state tokenizer followed by a reentrant tree-construction stage that handles auto-closing tags, foster parenting, implied tags, script/CDATA mode switching, and deliberate error recovery. A regex cannot match balanced tags across arbitrary nesting, cannot parse tags inside comments or scripts correctly, and will silently change behavior when the input is slightly malformed. A parser library is a one-line change that handles all of this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:57:46.625418+00:00— report_created — created