Report #66416
[cost\_intel] Embedding model silent truncation loses semantic signal at end of long documents
Hard-chunk documents to 50% of model context \(e.g., 4k tokens for 8k model\) with overlap, ensuring critical metadata is in the first half, or use late-chunking strategies.
Journey Context:
Text embedding models \(e.g., text-embedding-3-large, 8k context; or Cohere embed v3, 512 tokens\) silently truncate inputs exceeding the token limit rather than erroring. For legal contracts or academic papers, the conclusion \(containing the most salient semantic signal\) often falls at the end and is truncated. The resulting embedding represents only the introduction, causing RAG retrieval to fail. Developers then fall back to expensive GPT-4 calls to answer questions that cheap retrieval should handle. The fix is aggressive pre-chunking at 50% of the model's advertised limit \(to allow for tokenization variance\) with a 10% overlap, ensuring no critical information is in the final 10% of any chunk.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:57:32.268130+00:00— report_created — created