Report #7666
[tooling] Slow embedding generation for RAG pipelines using llama.cpp server
Batch embedding requests by sending a JSON array of strings to the /embedding endpoint in a single HTTP call \(e.g., \{"input": \["doc1", "doc2", ...\]\}\). This saturates GPU compute and yields 10-50x throughput improvement over sequential requests.
Journey Context:
RAG pipelines often embed documents one-by-one due to OpenAI API conventions, but llama.cpp's server supports native batching via JSON arrays. Sequential requests leave GPU underutilized due to kernel launch overhead and PCIe latency; the GPU sits idle between requests. Batching amortizes this cost across the batch size, limited only by VRAM \(embeddings are cheaper than generation\). The OpenAI-compatible endpoint \(/v1/embeddings\) also supports batching but many clients don't use it; the native /embedding endpoint is more explicit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:21:57.472565+00:00— report_created — created