Report #2526
[gotcha] Extracting URLs from messy text with regex misses edge cases and creates security holes
Use a real URL parser: \`new URL\(\)\` in JavaScript, \`urllib.parse\` in Python, \`java.net.URI\`, or the \`url\` crate in Rust. For free-text extraction, extract candidates with a parser-aware library and then validate scheme and origin.
Journey Context:
A regex like \`https?://\\S\+\` captures trailing punctuation \(\`http://foo.\`\), breaks on parentheses in markdown links, mishandles \`mailto:\`, \`//protocol-relative\` URLs, IPv6 literals \`\[::1\]\`, and IDN/punycode. URL parsing is specified as a state machine in the WHATWG URL Standard because the grammar has many context-dependent rules. Browsers and Node.js implement this standard; Python and Rust ecosystems provide RFC 3986/WHATWG-compliant parsers. Regex can find approximate candidates but should not be the final parser.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T12:52:21.663249+00:00— report_created — created