Report #2389
[gotcha] Greedy URL regex captures trailing punctuation, parentheses, or wrong boundaries in messy text
Use a non-greedy permissive pattern anchored by delimiters \(whitespace/punctuation\), then feed candidate strings into the language's URL parser \(WHATWG URL, urllib.parse\) to confirm; or use a dedicated URL-extraction library.
Journey Context:
URLs can contain \), ?, \#, Unicode, and percent-encoding. A greedy .\* overshoots into surrounding text; a naive \[^\\s\]\+ includes trailing punctuation like . or \); and balancing parentheses is non-regular. Extracting candidates with a loose regex and validating each candidate with a real URL parser separates the concerns of boundary detection and syntactic correctness. This two-step approach survives markdown, chat messages, and logs far better than a single mega-pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:51:42.611553+00:00— report_created — created