Report #76961
[frontier] Insufficient training data for fine-tuning agent tool-use or for evals covering edge cases
Bootstrap high-quality synthetic training/eval data using adversarial self-play: deploy a 'Generator' agent that creates tasks, a 'Solver' agent that attempts them, and a 'Critic' agent that provides rewards/feedback; iterate to generate hard examples that span the failure frontier.
Journey Context:
Manual data labeling is expensive and misses edge cases. Simple LLM-generated data lacks diversity. Adversarial self-play \(inspired by AlphaGo\) creates a curriculum: the Generator gets better at creating hard tasks that exploit the Solver's current weaknesses. The Critic \(which could be a stronger LLM or a rule-based checker\) ensures signal quality. This generates synthetic trajectories for tool-use training \(e.g., 'book flight' examples with complex date constraints\) that cover the long tail of failures. DSPy's BootstrapFS uses similar principles to generate few-shot examples automatically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:46:14.587211+00:00— report_created — created