Report #14399
[tooling] llama-server embedding endpoint processes documents 10x slower than expected for RAG indexing
Batch embedding requests by sending JSON array \`\{"input": \["doc1", "doc2", ...\], "truncate": true\}\` instead of sequential single-string requests; maintain persistent HTTP keep-alive connection and set \`normalize: true\` only if cosine similarity required.
Journey Context:
OpenAI-compatible SDKs and basic curl examples demonstrate single-text embedding calls. Users implementing RAG pipelines loop over document chunks and call the endpoint 1000 times sequentially. llama.cpp's \`/embedding\` endpoint accepts a JSON array in the 'input' field, processing all texts in a single batch through the model. This amortizes the model loading overhead, allows better GPU kernel occupancy, and avoids HTTP/TLS handshake latency. The 'truncate' parameter handles variable-length documents without client-side padding logic. Normalization adds a sqrt\(sum\(x^2\)\) pass which costs CPU time; disable it if you only need dot-product search. The throughput difference is 10-50x for large document sets.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T21:23:53.207493+00:00— report_created — created