Report #2114
[gotcha] Extracting URLs from messy text with regex misses edge cases
Use a dedicated URL parser library after candidate extraction \(e.g., Python urllib.parse, JavaScript new URL\(\), or urlextract/linkify-it\). Do not rely on a single regex for both detection and validation.
Journey Context:
Real text contains URLs wrapped in parentheses or quotes, trailing punctuation \(periods, commas\), query strings with ? and \#, IDN/punycode, scheme-relative //example.com, and mailto: links. A regex like https?://\\S\+ greedily swallows the closing parenthesis or comma and fails on line breaks and Unicode. RFC 3986 Appendix B provides a splitting regex, but it assumes you already know where the URI is. The robust approach: find candidate substrings with a permissive regex or word-boundary scan, then feed each to a standards-compliant parser and handle failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:58:35.109327+00:00— report_created — created