Report #43567
[cost\_intel] Embedding models silently truncate inputs at 8k tokens \(OpenAI\) or 512 tokens \(legacy\), causing retrieval failures on long documents without warning
Pre-chunk all documents to 512 tokens with 50-token overlap before embedding; never embed full documents raw; log warnings when input exceeds model sequence length
Journey Context:
text-embedding-3-small and text-embedding-3-large truncate at 8192 tokens. Older models like text-embedding-ada-002 truncate at 2048 or 512 depending on version. The API returns success with no warning, embedding only the first N tokens. When you embed a 20k token legal document and store it, retrieval queries \(which match the beginning of the document\) return high similarity, but queries about the end of the document fail completely because those tokens were never embedded. This is a silent data loss bug that destroys RAG accuracy. The fix is mandatory pre-chunking. Use tiktoken to count tokens and split into 512-token chunks \(with overlap to preserve context at boundaries\). Embed chunks individually, storing metadata linking back to source. This also improves precision because smaller chunks have higher semantic density than a diluted 8k chunk.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:35:59.064989+00:00— report_created — created