Report #26754
[tooling] Slow RAG embedding generation with llama.cpp server
Send multiple \`input\` strings in a single POST to \`/embedding\` endpoint as a JSON array; the server processes them in one batch, saturating GPU/CPU much better than sequential requests.
Journey Context:
Agents often loop and send one text per HTTP request, causing massive HTTP overhead and underutilizing the GPU. The llama.cpp server supports OpenAI-compatible batch embedding where input is an array. This saturates memory bandwidth and uses batching in llama.cpp's internal compute. For 1000 texts, this is 10x\+ faster than sequential calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:18:16.989026+00:00— report_created — created