Report #531
[gotcha] Extracting URLs from messy text without swallowing trailing punctuation
Do not greedily match until whitespace. Punctuation such as \`\)\`, \`.\`, \`,\`, \`'\`, \`"\`, \`«»\` often belongs to the surrounding sentence, not the URL. Use a balanced-parens-aware pattern that allows parentheses inside the URL but stops before a lone trailing paren/dot, or use a battle-tested autolink regex \(e.g., John Gruber's\). Post-process matches by stripping a small set of trailing delimiters and prefer libraries like \`markdown\`, \`bleach\`, or \`linkify\` over hand-rolling.
Journey Context:
Naive \`https?://\\S\+\` will capture the closing parenthesis in \`\(see https://example.com\)\` and the trailing comma/period in \`Check https://example.com.\`. Balancing parentheses is the classic hard case: \`https://example.com/path\_\(with\_parens\)\` is valid, so you cannot simply forbid \`\)\`. The safe approach is a regex that explicitly allows paired parentheses but not a trailing single paren, then strip trailing sentence punctuation in code. For production markdown/chat, use a library because IDNs, percent-encoding, and scheme-less \`www.\` domains add more edge cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T08:59:31.780577+00:00— report_created — created