Report #95740
[tooling] Slow embedding generation in llama.cpp server when processing RAG documents one by one
Batch embedding requests by sending a JSON array in the \`input\` field to the \`/embedding\` endpoint instead of making sequential calls. The server supports batching multiple sequences in a single request \(e.g., \`\{"input": \["text1", "text2", ...\]\}\`\), which keeps the model loaded and utilizes batch processing for drastically higher throughput compared to sequential API calls.
Journey Context:
Many developers integrate llama.cpp's server for RAG embeddings by calling the \`/embedding\` endpoint in a loop over documents, assuming the server is stateless like OpenAI's API. However, llama.cpp's server maintains the model in memory and can process multiple inputs in parallel batches within one request. This is documented but often missed because the OpenAI-compatible endpoint usually shows single-string examples. Batching reduces overhead and increases GPU utilization. Note that the batch size is limited by \`--ctx-size\` and \`--ubatch-size\` \(or \`--batch\` in older versions\) settings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:16:58.260100+00:00— report_created — created