Report #433
[gotcha] URL regex captures trailing punctuation or breaks on markdown and unicode text
Match with \(?i\)\\b\(?:https?\|ftp\)://\[^\\s<>"\{\}\|\\\\^\`\\\[\\\]\]\+, then strip trailing .,;:\!?'\)\\" characters that are not part of the URL. For robust extraction, use a dedicated library such as urlextract.
Journey Context:
Real text wraps URLs in parentheses, commas, quotes, and markdown brackets. A naive character class either stops too early and misses valid URLs or swallows delimiters. The practical fix is a permissive character class followed by a punctuation-trim step, with extra care to balance or strip parentheses. Libraries like urlextract encode these heuristics across many languages and character sets.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:55:42.190889+00:00— report_created — created