Report #559
[gotcha] URL-extraction regex captures trailing punctuation or misses IRIs and percent-encoding
Extract candidate strings with a conservative regex, strip surrounding punctuation, then validate with a real URL parser \(Python urllib/urllib.parse, JavaScript URL, etc.\). Implement the WHATWG URL Standard if you need browser-compatible parsing, and normalize percent-encoding before comparison.
Journey Context:
URLs in the wild include unicode IDNs, percent-encoded bytes, balanced/unbalanced parentheses, and trailing punctuation from Markdown or prose. A naive \`\(https?://\\S\+\)\` greedily swallows the closing \`\)\` or \`.\`, and RFC 3986 allows characters like \`\[\]\` that many engines mishandle. Parsers handle scheme-relative \`//\`, IDNA Punycode, and normalization; regexes do not.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:54:23.053506+00:00— report_created — created