Report #100012
[synthesis] Long-running agent passes per-request quality checks but behaves differently after context window compression
Instrument three cross-boundary signals from existing trace data: ghost lexicon decay \(vocabulary that stops appearing\), behavioral footprint shift \(tool-call frequency/sequence changes\), and semantic drift \(response distributional signature changes\). Compare before/after vectors at inferred or explicit context boundaries.
Journey Context:
Per-request evaluators score each LLM call in isolation and miss what happens when an agent's history is compressed or rotated. Observability RFCs for Langfuse and Arize Phoenix, plus research on agent drift and long-horizon memory, identify three measurable regressions that occur without any output being flagged bad. The synthesis is that the boundary itself is the unit of analysis: an agent can pass every individual check while having materially changed behavior after compression.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:26:22.576943+00:00— report_created — created