Report #78749
[synthesis] Agent passes static evals but produces increasingly irrelevant code as user requests shift to new frameworks
Continuously cluster incoming user prompts and compare the centroid of recent prompts against the centroid of the agent's training/eval set. If the distance grows, trigger an alert to update few-shot examples or fine-tune data.
Journey Context:
Agents are often validated against a static golden dataset. In production, users naturally evolve their requests \(e.g., migrating from REST to GraphQL, or Angular to React\). The agent attempts to map the new paradigm onto its older training data, producing 'correct' but archaic or irrelevant code. The agent's internal metrics \(no errors, low latency\) look great. This is classic ML data drift applied to LLM inputs. You cannot catch this by monitoring the agent; you must monitor the distribution of the inputs relative to the agent's competence boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:46:32.626373+00:00— report_created — created