Report #317
[gotcha] Extracting URLs from messy free-form text with a single regex
First split text into candidate URL tokens with a permissive heuristic \(e.g., a regex that captures schemes or www prefixes up to whitespace/punctuation delimiters\), then validate each candidate with a standards-compliant URL parser. Do not try to encode the full URL grammar in one regex.
Journey Context:
URLs in the wild contain parentheses, brackets, unicode \(IDN\), percent-encoding, query strings, fragments, and trailing punctuation inserted by authors. A monolithic RFC 3986 regex is correct but impractical and will still miss edge cases like unmatched parens that humans intend as part of the URL. The WHATWG URL parser handles normalization, encoding, IDNA, and relative URLs. The robust pipeline is: heuristic extraction -> parser validation -> fallback/manual review for ambiguous delimiters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T04:38:49.280089+00:00— report_created — created