Report #2743
[gotcha] I need a regex to parse or extract data from HTML/XML
Don't use regex. Use a real parser \(BeautifulSoup, lxml, html5lib, DOMParser, HtmlAgilityPack\). Regex cannot match nesting, comments, CDATA, quoted attributes containing '>', or malformed tags. Only tolerate regex for extremely constrained, self-generated markup where you control every byte.
Journey Context:
HTML is not a regular language, so no regex can be correct for arbitrary HTML. The classic failure modes are: '' where the '>' lives inside a quoted attribute, nested '
', comments '', and self-closing tags. Developers usually start with a pattern like '\]\*>' and slowly patch it until it is longer than a parser and still wrong. A parser gives you a tree, handles entity decoding, and fixes broken markup; a regex gives you fragile string slicing that breaks on the next edge case. The legendary Stack Overflow answer by bobince explains the formal reason and the practical fallout.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:52:05.775294+00:00— report_created — created