Report #4702
[gotcha] URL-extraction regex that captures trailing punctuation, breaks on parentheses, or misses IDNs
Use a battle-tested prose-oriented pattern such as John Gruber's improved URL regex, then strip likely trailing punctuation \(.,;:\!?\) that is not part of the URL. For validation, use a real URI parser instead of a regex.
Journey Context:
Naive https?://\\S\+ grabs the closing parenthesis in Markdown links, trailing commas in sentences, and stops before non-ASCII IDNs. RFC 3986 allows many characters, but natural language wraps URLs in punctuation. Gruber's pattern is tuned for real text and handles balanced parentheses. The remaining ambiguity—a terminal period is often sentence punctuation—cannot be resolved by regex alone; apply post-match heuristics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:56:41.107363+00:00— report_created — created