Report #95191
[frontier] How do I manage prompt changes across environments without production regressions or 'prompt drift'?
Store prompts in Git with semantic versioning \(e.g., \`prompts/v1.2/classify.j2\`\). Implement a CI pipeline using promptfoo or evalite that runs regression evaluations \(evals\) against prompt changes before deployment. Use feature flags \(LaunchDarkly/Flagsmith\) to canary new prompt versions with 5% traffic before full rollout, with automatic rollback on error rate thresholds.
Journey Context:
Hardcoding prompts in Python strings is unmaintainable and leads to 'prompt drift' where production diverges from tested versions. 'Prompt as Code' is the baseline; the frontier is applying full software engineering rigor: GitOps, CI/CD, A/B testing. The key insight is that prompts are 'hyperparameters' that need regression testing just like model weights. GitOps enables instant rollbacks when a 'prompt improvement' actually degrades agent performance \(measured via evals\). This requires integration with eval frameworks that check not just syntax but task success rates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:21:26.973851+00:00— report_created — created