Report #68723
[cost\_intel] Using expensive LLM calls to categorize or cluster large datasets \(>1000 items\)
Use text-embedding-3-large or voyage-large-2 embeddings plus k-means/HDBSCAN clustering for categorization tasks. Cost drops from $50/1k items \(LLM\) to $0.10/1k items \(embedding\). LLM categorization drifts after ~50 items due to context window limitations; embeddings scale to millions with perfect consistency.
Journey Context:
Teams try to 'ask GPT-4 to categorize this list of 5000 support tickets' by stuffing them into the context window. This fails because \(a\) context limits force chunking, causing inconsistent categorization across chunks, and \(b\) cost is prohibitive \($0.06 per 1k tokens, 5000 items \* avg 100 tokens = $30 vs embedding $0.10 total\). The correct pattern: embed all items \(cached, cheap\), cluster with HDBSCAN or UMAP\+clustering, then optionally label clusters with a single LLM call per cluster \(not per item\). The quality is actually higher because embedding similarity captures semantic nuance that rigid LLM categorization schemes miss. The cliff: if categories require reasoning about external knowledge not in the text \(e.g., 'is this company public or private'\), use LLM with RAG, not pure embedding clustering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:50:16.652964+00:00— report_created — created