Report #2746
[gotcha] Extract all URLs from messy natural-language text
Use John Gruber's improved liberal URL regex for plain text, but accept that it is heuristic and will miss or over-capture some URLs. For accuracy, extract from the document structure: \`href\` attributes in HTML, link annotations in PDFs, or dedicated URL fields. Post-process matches to strip trailing punctuation such as '.', ',', '\)', and quotes.
Journey Context:
URLs embedded in prose are ill-defined: trailing punctuation, balanced parentheses, IDNs, and markdown link syntax all make bracketing ambiguous. A naive \`https?://\\S\+\` will swallow the closing parenthesis or period. Gruber's regex is the best-known practical pattern because it handles parentheses nesting and trailing punctuation heuristically, yet it still fails on pathological cases. The hard truth is that 'find URLs in arbitrary text' is a best-effort problem; if you need precision, parse the structured source rather than the raw string.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:52:05.992130+00:00— report_created — created