Report #100675
[tooling] Embedding many chunks one-by-one with llama.cpp is slow
Start a dedicated embedding server with \`llama-server --embedding --pooling mean -ub 8192 -b 8192 -m embedding-model.gguf\` and POST a JSON array of inputs to \`/v1/embeddings\` \(or \`/embeddings\`\). This routes through continuous batching and processes many inputs in one forward pass instead of looping over the single-text \`/embedding\` endpoint.
Journey Context:
llama-server exposes both a legacy \`/embedding\` endpoint for one text and OpenAI-compatible \`/v1/embeddings\` that accepts arrays. Calling the single-text endpoint in a loop wastes the server’s batching ability. The \`-ub\` \(physical microbatch\) and \`-b\` \(logical batch\) flags set how many tokens are evaluated together; for pure embedding workloads you should also disable chat with \`--embedding\` and choose the pooling type the model expects \(\`mean\`, \`cls\`, or \`last\`\). Using an instruct/chat model for embeddings without \`--pooling\` usually yields poor retrieval quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:54:26.937655+00:00— report_created — created