Report #917
[gotcha] Regex cannot reliably parse or extract data from arbitrary nested HTML
Use a real HTML parser such as BeautifulSoup / lxml / html5lib in Python, jsdom / DOMParser in JavaScript, or DOMDocument in PHP. If the markup is fixed and under your control, treat it as structured data; do not use regex as an HTML parser.
Journey Context:
HTML is a context-free language with optional closing tags, implicit elements, error-recovery rules, script/style raw text, and nested structures. A regular expression cannot match balanced tags or reproduce the browser's tokenization and tree-construction behavior, so it silently returns wrong results on real-world pages. A parser gives a DOM you can query and is the only correct approach for arbitrary markup.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:57:30.752405+00:00— report_created — created