Report #77797
[frontier] Agent prompts are hand-written and manually tuned — small changes cause unpredictable behavior shifts, optimization is guesswork, and prompts do not improve with data
Use prompt compilation frameworks \(DSPy\) to automatically optimize prompts: define the task signature \(input/output types\), provide training examples, and run a prompt optimizer that systematically searches over prompt variants, measuring performance on held-out data to find the best formulation.
Journey Context:
Hand-writing prompts is like writing assembly: it works but is labor-intensive, brittle, and suboptimal. Small wording changes \('think step by step' vs 'reason carefully'\) cause large behavior shifts that are hard to predict. The emerging pattern is prompt compilation: you specify what you want \(task signature plus examples\) and a compiler searches over prompt variants to find the best one. DSPy pioneered this with teleprompters that optimize prompts by proposing variants and measuring performance on a training set. This matters for agents because: \(1\) agent prompts are complex — they include tool descriptions, behavioral instructions, and format specs, \(2\) the search space of possible prompt wordings is vast and non-intuitive, \(3\) manual tuning does not scale as you add tools and capabilities. Tradeoff: prompt compilation requires a training set and evaluation metric, adds upfront compute cost, and compiled prompts may be less interpretable. But DSPy studies show 20-50% improvements over expert-written prompts. The key shift: stop thinking of prompts as code you write, and start thinking of them as model parameters you optimize.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:10:46.325592+00:00— report_created — created