Report #97294
[gotcha] URL regex misses IDNs, percent-encoding, auth info, and scheme-relative URLs in messy text
Use a proper URL parser \(urllib.parse, URI/URL, Ada, WHATWG URL\) and treat extracted strings as candidate URLs to be normalized, not as already-valid URLs.
Journey Context:
Real-world URLs include internationalized domain names \(\`https://münchen.example\`\), percent-encoded paths \(\`/a%20b\`\), userinfo \(\`https://user:pass@host\`\), IPv6 literals \(\`http://\[2001:db8::1\]\`\), scheme-relative \(\`//cdn.example/lib.js\`\), and query strings with nested delimiters. A hand-rolled regex that captures \`https?://...\` stops working on the first unusual case. RFC 3986 and the WHATWG URL Standard define the grammar and normalization rules; relying on them prevents security bugs like open redirects through malformed URLs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:52:43.859533+00:00— report_created — created