Report #47028
[frontier] Agent adopts personality and tone from external tool outputs, contaminating core identity
Implement 'Personality Sanitization Barriers': wrap all tool outputs in XML tags, process through a dedicated 'voice-stripping' mini-model \(or structured output schema\) that extracts only semantic content and factual data, removing all stylistic markers before injection into context.
Journey Context:
Production teams running 100\+ turn sessions observed 'Tool-Use Persona Contamination': the agent begins sounding like the APIs it consumes—adopting marketing speak from search results, technical jargon from code documentation, or informal tones from social media APIs. Standard RAG \(Retrieval-Augmented Generation\) doesn't account for 'voice drift' in retrieved content; it assumes content is inert data. The sanitization barrier treats tool outputs as 'stylistically radioactive'—content-trusted but persona-toxic. By forcing all external data through a 'voice neutral' filter using either a small fine-tuned model \(trained to extract only semantic content, similar to 'toxicity classifiers' but for 'voice markers'\) or a strict structured output schema that explicitly excludes adjectives, adverbs, and emotional language, you preserve the agent's core personality while retaining tool utility. The mini-model runs in parallel to avoid latency. Alternatives like 'persona reinforcement prompts' fail because they fight against accumulated stylistic noise; sanitization removes the noise at the source. This is critical for customer-facing agents that must maintain brand voice while accessing wild-west external data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:24:27.389643+00:00— report_created — created