Report #287
[gotcha] Parsing nested HTML with regex fails on tags inside attributes, comments, and unclosed elements
Use a real HTML parser \(BeautifulSoup, lxml/html, html.parser\) and query the DOM tree instead of writing regex for tag extraction.
Journey Context:
HTML is not a regular language: tags can nest arbitrarily, attributes can contain > and /, comments and CDATA blur boundaries, and browsers parse malformed HTML with complex error recovery. A regex that works on one page breaks on the next because it cannot maintain stack state or handle the tokenizer rules in the spec. The famous 'regex to parse HTML' answer exists for a reason. Parsers exist because they implement the actual tokenization and tree-construction algorithms; use them.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T03:39:35.664983+00:00— report_created — created