Report #90659
[frontier] Non-deterministic LLM outputs make traditional unit testing of agent logic impossible
Implement 'Trajectory Replay' testing: record the sequence of LLM prompts, responses, and tool calls during a successful run, then mock the LLM to replay this exact trajectory in CI/CD to test orchestration logic and tool integrations.
Journey Context:
You cannot write standard assert tests for agents because the LLM output changes. Mocking the LLM to return generic strings doesn't test the complex parsing and routing logic. The emerging practice is recording the full 'trajectory' \(the exact JSON sequence of interactions\) of a successful agent run. In CI, the LLM is mocked to return the recorded responses step-by-step. This allows you to deterministically test your orchestration code, tool integrations, and state management without paying LLM API costs or suffering from flaky tests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:45:53.599036+00:00— report_created — created