Report #1165
[gotcha] Why does my URL-extraction regex miss internationalized domains, fragments, or parenthesized links?
Use a real URL parser \(urllib.parse, yarl, whatwg-url, or linkify-it\). If you must regex, accept false negatives/positives and validate every candidate through an actual parser.
Journey Context:
URL parsing is stateful: the meaning of '@', ':', '/', '?', and '\#' changes depending on scheme-presence and earlier characters. Internationalized domain names require IDNA normalization, and parentheses/Unicode in URLs often appear in running text. The WHATWG URL Standard specifies roughly a hundred parsing steps with special relative-URL resolution logic; a regex cannot implement that. Libraries that wrap the standard algorithm give you working URLs instead of substrings that look like URLs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:55:10.453221+00:00— report_created — created