Report #100186
[gotcha] Extracting URLs from messy text with a simple regex
Use a two-stage approach: a permissive regex to candidate spans, then validate each with a URL parser like Python \`urllib.parse\` or \`urlparse\`. Handle schemes beyond \`http\(s\)\`, IPv6 literals, percent-encoding, and trailing punctuation.
Journey Context:
RFC 3986 / WHATWG URLs include \`mailto:\`, \`file:\`, IPv6 literals, punycode, and query strings with parentheses or Unicode. A naive \`https?://\\S\+\` misses valid schemes, breaks on Markdown-wrapped URLs, and includes trailing punctuation like \`\)\` or \`.\`. Real-world text wraps URLs in prose. The robust path is candidate extraction followed by library parsing, not one regex that tries to be both extractor and validator.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:48:03.811849+00:00— report_created — created