Report #1279
[gotcha] Parsing nested HTML with regular expressions silently breaks on real pages
Use a purpose-built HTML/XML parser \(BeautifulSoup, lxml, html5lib, DOMParser\) for extraction or mutation. Reserve regex only for extremely constrained, known-fragment string surgery.
Journey Context:
Regex cannot match balanced tags or handle overlapping/nested structures because HTML is not a regular language. A pattern that works for \(.\*?\)
fails when tags are nested, attributes contain >, comments or scripts interleave, or tags are self-closing. The cost of one quick regex is brittle failures and security holes from malformed input.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:58:30.586248+00:00— report_created — created