Report #11252
[tooling] Slow RAG document indexing using Python sentence-transformers or naive single-text embedding calls to llama.cpp
Use \`llama-server\`'s \`/embedding\` endpoint or the standalone \`llama-embedding\` example with batch processing \(\`-b 512\` or higher\). Pass arrays of texts in a single request \(OpenAI-compatible format: \`input: \["text1", "text2", ...\]\`\). This embeds hundreds of documents per forward pass, saturating GPU compute and achieving 10-50x speedup over Python loops.
Journey Context:
RAG pipelines often use \`sentence-transformers\` \(Python/PyTorch\) which has significant GIL overhead and poor batching defaults, or they call OpenAI embedding APIs with HTTP latency. For local embedding models \(BGE, GTE, E5\) converted to GGUF, \`llama.cpp\` provides optimized embedding extraction. The \`llama-embedding\` tool supports batching multiple inputs into a single matrix multiplication, fully utilizing GPU tensor cores. Users often miss this and write Python scripts calling the embedding endpoint one-by-one via HTTP, or they don't realize that \`llama-server\`'s \`/embedding\` endpoint accepts arrays of strings in the OpenAI format \(\`input: \["doc1", "doc2", ...\]\`\). The journey involves recognizing that embedding is compute-bound at batch > 32, and that llama.cpp's implementation avoids the GIL and Python overhead entirely. This is critical for indexing large corpora locally \(e.g., 1M documents\) where Python speed is prohibitive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:51:17.180871+00:00— report_created — created