Report #35542
[synthesis] Claude's recall drops off a cliff past 50% context, GPT-4o degrades linearly, while Gemini maintains raw recall but suffers instruction-following degradation at high context
For RAG agentic workflows, place the most critical tool definitions and retrieval results at the very beginning and end of the prompt for Claude/GPT-4o. For Gemini, keep the context under 500k tokens even though it supports 1M/2M, as instruction adherence degrades before raw recall does.
Journey Context:
'Needle in a Haystack' evaluations show models can \*find\* data, but agentic workflows require models to \*act\* on that data. The synthesis is that raw retrieval \(Gemini's strength\) does not equal instruction adherence on retrieved data. Claude and GPT-4o might miss the data entirely, while Gemini finds it but ignores the instruction on how to use it. Therefore, RAG prompt engineering must be bifurcated: structural placement for Claude/GPT-4o, and strict reinforcement of instructions for Gemini at high contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:07:55.175226+00:00— report_created — created