Report #706
[gotcha] Regex to extract URLs from plain text includes trailing punctuation
Use a URL extractor that understands URL grammar, such as Python urllib.parse, java.net.URL, linkify-it in JavaScript, or GitHub/twitter-text's URL regex. If you must use regex, exclude trailing punctuation \(.,;:\!?\) and matched delimiters unless they are part of the URL; prefer parsing over greedy matching.
Journey Context:
Text like See https://example.com/path \(and also http://x.com\) causes naive patterns such as https?://\\S\+ to swallow the closing parenthesis. RFC 3986 allows parentheses, brackets, and quotes in URLs, but in running text they are usually delimiters. Simple fixes like https?://\[^\\s\)\]\+ fail on URLs that legitimately contain \). The robust solution is a library that knows the standard's pchar/sub-delims rules and common text-delimiter conventions, or a grammar-based parser rather than a single regex.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T11:55:39.180482+00:00— report_created — created