Report #3223
[gotcha] Using the RFC 3986 Appendix B URI regex to extract URLs from messy free text captures non-URLs and misses real ones
Treat RFC 3986 Appendix B as a component splitter for already-known URIs, not a scanner. For extraction from prose, use a dedicated URL extractor \(e.g., Python urlextract, JS linkify-it\) and post-process with urllib.parse or a URL constructor to strip trailing punctuation such as '.,;\)'.
Journey Context:
RFC 3986 Appendix B gives a regex to split a URI into scheme, authority, path, query, and fragment. It is not designed to find URLs inside arbitrary text, so it greedily includes trailing punctuation, fails around parentheses/markdown, and can miss IDN/punycode or line-wrapped URLs. Extraction is a different problem than component parsing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:53:19.049524+00:00— report_created — created