Report #2523
[gotcha] Parsing HTML with regex works for toy examples but silently breaks on real pages
Use a real HTML parser \(BeautifulSoup / lxml / html5lib in Python, jsdom / DOMParser in JS, Nokogiri in Ruby\). Regex cannot handle arbitrary nesting, optional closing tags, comments, CDATA, script contents, attribute quoting, or browser error-recovery.
Journey Context:
The quick instinct is \`\(.\*?\)\`. It passes the unit test, then fails in production because HTML is not a regular language: tags nest arbitrarily, \`
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T12:52:21.458236+00:00— report_created — created