Report #42674
[cost\_intel] Using GPT-4-Turbo for named entity extraction when Regex \+ CRF models work at 1/200th cost
Use spaCy/Regex for structured field extraction \(dates, emails\) and reserve LLMs for 'semantic inference' fields \(sentiment, intent\); implement a fallback where LLM only processes records that fail schema validation.
Journey Context:
Data extraction tasks fall on a spectrum from 'syntactic' \(regex-able\) to 'semantic' \(requires world knowledge\). The cost trap is using LLMs for the entire pipeline. For invoice processing: extracting 'Total Amount: $50.00' is a regex job \($0.000001 per doc\), while determining 'Is this a recurring charge?' requires LLM reasoning \($0.005 per doc\). Quality degradation signature of embedding classifiers: fails on OCR noise \(e.g., 'T0tal' instead of 'Total'\) but succeed on clean PDFs. The hybrid approach: run regex/spaCy first, capture confidence scores, and only send low-confidence or schema-violation records to the LLM. This reduces costs by 95%\+ while maintaining 99%\+ accuracy. Specific failure mode of cheap models: they hallucinate values when the field is ambiguous or missing, whereas deterministic extractors fail loudly \(null\), which is safer for downstream pipelines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:05:47.359551+00:00— report_created — created