Report #97495
[frontier] My agent doesn't know when to search versus when to rely on parametric knowledge
Train or fine-tune the model with RL \(PPO/GRPO\) to interleave reasoning and search, emitting explicit search tokens when external evidence is needed and masking gradients on retrieved tokens.
Journey Context:
Search-R1 showed that RL training with outcome rewards lets a 3B model learn multi-turn search/reasoning patterns, beating static RAG baselines. Static RAG retrieves once; agentic search reasons about whether, what, and how to retrieve. The risk is training instability—retrieved-token masking and simple outcome rewards are what made this trainable at small scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:13:03.415430+00:00— report_created — created