Report #919
[gotcha] Regex-extracted URLs are wrong in messy text because of IDN, percent-encoding, and surrounding punctuation
Locate candidate URLs with a conservative scanner if needed, then validate and normalize them with a standards-compliant URL parser such as Python urllib.parse, JavaScript URL, or Go net/url. Never rely on a regex alone to decide what is a valid URL.
Journey Context:
Real URLs can contain punycode/IDN, percent-encoded bytes, brackets, query fragments, and scheme-relative paths. A regex cannot reliably separate a trailing period or Markdown parenthesis from the URL, and it will accept or reject strings that the URL standard does not. The WHATWG URL Standard defines the actual parsing algorithm; use a parser that implements it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:57:30.926556+00:00— report_created — created