Report #97854
[gotcha] URL regex captures trailing punctuation from surrounding text
Start from RFC 3986 Appendix B to split the URI components, then strip trailing characters like , . \! ? \) that are almost never part of the URL; better yet, use a URI parser after extraction.
Journey Context:
Naive URL regexes greedily swallow commas, periods, or closing parentheses that belong to the surrounding sentence. RFC 3986 Appendix B provides the standard regex for decomposing a URI into scheme, authority, path, query, and fragment, but it does not define URL discovery boundaries. Post-processing trailing punctuation gives much better real-world accuracy than a longer regex alone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:49:04.422930+00:00— report_created — created