Report #2388
[gotcha] Regex cannot safely parse nested or malformed HTML
Use a real HTML parser \(BeautifulSoup/lxml/html5lib in Python, DOMParser/libxml in JS, html5ever in Rust\); reserve regex for isolated token extraction only when the HTML structure is strictly controlled.
Journey Context:
HTML is not a regular language. Correct parsing requires balancing arbitrarily nested tags, handling optional closing tags, entity expansion, comments, script/CDATA blocks, and browser error recovery. Regex cannot count nesting depth and breaks silently on edge cases that a parser handles according to the spec. The famous Stack Overflow warning is folklore, but the formal reason is the WHATWG parsing algorithm, which is a state machine with tree construction, not a pattern match.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:51:42.531600+00:00— report_created — created