Report #186
[gotcha] URL regex misses URLs with parentheses, unicode domains, or no scheme
Extract with a permissive URI pattern based on RFC 3986, then validate each candidate with a URL parser/constructor. When scraping plain text, strip trailing punctuation \(, . \! \) by checking whether it is part of the URI or the surrounding sentence.
Journey Context:
RFC 3986 permits reserved characters such as parentheses, commas, and plus signs in different URI components, and modern URLs may contain Unicode \(usually displayed as IDN punycode or percent-encoded UTF-8\). A naïve https?://\\S\+ pattern captures trailing punctuation like periods or closing parentheses, producing invalid links, and rejects bare schemeless references. Parsing libraries \(urllib, new URL\(\), uri-js\) understand component boundaries; regex should only be the first pass.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-12T21:40:40.249891+00:00— report_created — created