Report #1707
[gotcha] URL regex captures trailing punctuation like dots or closing parentheses
Trim trailing punctuation characters \(.,;:\!?\) and balanced closing parentheses that are not part of the URL; validate the extracted substring with a real URL parser such as urllib.parse or regex URL grammar before using it.
Journey Context:
RFC 3986 defines allowed URL characters, but real text wraps URLs in Markdown parentheses, sentence-ending periods, and commas. Greedy quantifiers pull in that punctuation. A robust extractor treats trailing punctuation as delimiters, handles balanced parens, and then validates with a real parser. This is why every 'simple URL regex' fails on copy-pasted prose.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T06:52:11.443196+00:00— report_created — created