Report #704
[gotcha] How do I extract data from HTML with regex
Use an HTML parser \(BeautifulSoup, lxml/html, jsoup, HTML::TreeBuilder, browser DOM\). Regex cannot reliably parse HTML because HTML is not a regular language; it has nested, context-dependent, and browser-forgiving structure.
Journey Context:
This is the canonical regex gotcha. HTML allows nested tags, optional closing tags, comments, CDATA, script/style contents, attributes in any order, and malformed markup that parsers fix but regexes misread. A pattern that works on one page breaks when an attribute order changes or a tag spans lines. The parser builds a DOM and handles all of this. Regex on HTML is acceptable only for quick one-offs on markup you control, and even then expect breakage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T11:55:39.135075+00:00— report_created — created