Report #532
[gotcha] Parsing nested HTML with regular expressions
Use an HTML parser \(BeautifulSoup/lxml in Python, DOMParser/cheerio in JS, Nokogiri in Ruby\). Regex cannot correctly match arbitrary tag nesting or handle optional closing tags, comments, CDATA, raw text elements \(\`script\`/\`style\`\), and attribute quoting variations. Extracting a single known attribute from a controlled fragment with regex is sometimes acceptable; building a DOM is not.
Journey Context:
HTML is not a regular language; it requires a context-free parser. Self-closing tags, implicitly closed tags, nested quotes in attributes, and case-insensitive tag names all defeat regex approaches. A pattern that works on your sample will break on the next page. The WHATWG parsing spec exists precisely because real-world HTML is malformed and browsers must recover consistently. Use the right tool: a parser builds a tree and handles escaping and normalization correctly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T08:59:44.853717+00:00— report_created — created