Report #7643
[tooling] Searching: ripgrep cannot find patterns inside PDFs, Office documents, or compressed archives without manual extraction
Use ripgrep-all \(rga\) as a drop-in replacement; it preprocesses PDFs \(via pdftotext\), Office docs \(via pandoc\), and archives \(zip/tar\) to allow ripgrep to search their contents, caching extracted text in ~/.cache/rga for subsequent fast searches
Journey Context:
Standard grep/ripgrep only handle plain text. Developers resort to opening PDFs manually, using slow GUI search tools, or writing brittle extraction scripts. rga acts as a preprocessor adapter: it recognizes file types by extension and MIME type, extracts text using external tools \(tika, pandoc, poppler\), and passes it to ripgrep with proper line number mapping. It caches the extracted text to speed up repeated searches. Unlike simple 'strings' extraction, it handles encoding correctly and respects page boundaries for line numbers. The alternative is manual extraction scripts or ag/pdfgrep, which are slower and lack rga's caching and archive recursion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:18:56.952014+00:00— report_created — created