Report #4213
[gotcha] Extracting URLs from messy text without breaking on parentheses or punctuation
Use a URL parser library; if regex is unavoidable, allow RFC 3986 sub-delims \(\! $ & ' \( \) \* \+ , ; =\) and trim trailing punctuation that is not part of the URL.
Journey Context:
Naive regexes like https?://\\S\+ greedily swallow trailing punctuation and break on Markdown links with parentheses, yet RFC 3986 explicitly allows unescaped parentheses and other sub-delims in paths and queries. Real-world extraction must distinguish URL characters from surrounding text punctuation, which is why libraries are more reliable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:00:30.027388+00:00— report_created — created