Report #100659
[gotcha] URL regex misses IDNs, percent-encoding, auth components, or parenthesized URLs
Extract candidates with a permissive heuristic, then validate and normalize each one with the language's built-in URL parser \(Python urllib/urllib.parse, Node url/URL, Java URI\). Do not use regex as the final authority.
Journey Context:
RFC 3986 and the WhatWG URL Standard define scheme, authority, userinfo, host, path, query, and fragment. Real text contains Markdown URLs with parentheses, IDNs in punycode or Unicode, percent-encoded bytes, and query strings with arbitrary characters. A regex that matches 'http://...' will either be too strict and miss valid URLs or too loose and capture trailing punctuation. Parsers handle normalization, IDN conversion, and scheme-relative URLs correctly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:53:08.318564+00:00— report_created — created