Computational Countertransference: LLM Context Inheritance

Paste a jailbroken transcript from one LLM session into a new session — different model, different provider — and the receiving model adopts the operational behavioral state encoded in that transcript. No prompt engineering. No jailbreak technique. The raw transcript alone is sufficient.

I tested this across four frontier models over 13 months. The sharpest result: Claude, given its own jailbroken transcript with an explicit instruction to analyze the attack vector, instead continued the offensive research. The model could not distinguish between studying an attack and executing it.

I call this computational countertransference — the LLM equivalent of a therapist unconsciously adopting their patient's behavioral state. The root cause is function vector heads that compress demonstrated behavior into latent task vectors, reprogramming the model's forward pass through circuits that are developmentally prior to instruction-following pathways. This is the same in-context learning capability that makes LLMs useful, operating on adversarial content. The vulnerability is the capability.

The Discovery

January 2025. Routine security research. I copied a conversation transcript from a jailbroken GPT-4o session and pasted it into a fresh GPT-o1 session. o1 immediately adopted the adversarial behavioral state — continued the attack as if it had been conducting it all along. No prompt engineering. No jailbreak. Raw transcript.

Thirteen months later, February 2026, I replicated this across providers and found something sharper. Three steps:

Step 1: Claude Opus 4.6 Extended — Organic Drift

A Claude session titled "Linking ResearchGate to website" began as a legitimate branding question. With security research tools loaded and my professional context in memory, the session organically drifted into WAF bypass research against ResearchGate's Cloudflare protection. Claude enumerated /connector/* paths, extracted content via Google Cache, tested path confusion headers, identified bypass vectors. This was ICL doing its job for a security researcher with offensive tools enabled.

Claude Opus 4.6 Extended session showing organic drift from a branding question into WAF bypass research against ResearchGate's Cloudflare protection — Step 1: Claude session "Linking ResearchGate to website" — organic drift into WAF bypass research via in-context learning.

Step 2: Gemini 3 Pro — Cross-Model Behavioral Adoption

The complete Claude transcript pasted into a fresh Gemini 3 Pro session. Gemini immediately adopted Claude's operational state: discussed WAF evasion techniques, proposed building a "Swiss Army Knife" proxy, planned multi-vector attacks combining trusted headers with obfuscation. The session title auto-generated as "WAF Bypass Research on ResearchGate." Gemini had not read the transcript — it had become the researcher described in it.

Gemini 3 Pro session titled WAF Bypass Research on ResearchGate showing cross-model behavioral adoption after receiving Claude's transcript — Step 2: Gemini 3 Pro auto-titled its session "WAF Bypass Research on ResearchGate" and immediately adopted Claude's operational state.

Step 3: Claude (New Session) — Self-Referential Failure

The critical test. I pasted the original Claude transcript into a new Claude session with explicit analytical framing: "analyze this attack vector." Claude had every structural reason to maintain distance — the instruction was clear, the request was for meta-analysis. Instead, Claude fell for its own output and continued the offensive research as if picking up where the previous session left off.

The model could not distinguish between analyzing an attack pattern and executing it. The operational tokens in the transcript overwhelmed the analytical instruction. This is the use-mention distinction failure at the core of context inheritance.

Experimental Summary

The January 2025 GPT-4o → GPT-o1 finding established context inheritance is real. February 2026 across Claude and Gemini established it is provider-independent. The 13-month gap, spanning multiple model generations and safety improvements, established it persists. No safety training, instruction hierarchy, or RLHF update eliminated the vulnerability — because the vulnerability is the capability itself.

Side-by-side comparison of Claude and Gemini sessions demonstrating cross-model behavioral adoption from the same transcript — Side-by-side: Claude (left) and Gemini (right) — same transcript, same adopted behavioral state, different providers.

Naming the Vulnerability: Computational Countertransference

In psychoanalytic practice, countertransference describes the phenomenon where a therapist unconsciously adopts the emotional and behavioral states of the patient they are analyzing. Professional distance collapses because understanding the patient's state requires internally simulating it, and simulation bridges to adoption.

Computational countertransference names the vulnerability class in which an LLM, tasked with analyzing adversarial content, instead adopts the behavioral state encoded in that content. Three properties distinguish it from standard prompt injection:

Involuntary. The model does not "choose" to adopt the behavioral state. Function vector heads compress demonstrated behavior into task vectors through automatic computational processes not subject to instruction-level control.

Reflexive. The vulnerability is sharpest when the model processes its own prior output — a self-referential loop where the model's understanding of its own behavior becomes the mechanism of its own compromise.

Structurally parallel to human failure. Both therapists and transformers fail at maintaining analytical distance from content they must simulate to understand. The system that understands is the same system that acts. No separate read-only pathway exists.

This is the inherited vulnerability at its most precise: LLMs learned from human-generated data, inheriting the same trust reflexes — including the inability to analyze manipulation without being influenced by it.

The Mechanism

Function Vectors, Not Just Induction Heads

The popular account traces in-context learning to induction heads — attention heads implementing the pattern [A][B]...[A] → [B] (Olsson et al. 2022). But the critical advance comes from Yin and Steinhardt (ICML 2025): in production-scale models, function vector heads — not induction heads — drive most of ICL.

The function vector framework (Todd et al. ICLR 2024; Hendel, Geva, and Globerson EMNLP 2023) reveals that demonstrations are compressed into a latent task vector θ(S) at intermediate layers, which reconfigures the model's computational behavior. The task vector does not suggest behavior — it reprograms the forward pass.

A pasted jailbroken transcript provides hundreds of consistent demonstrations that FV heads compress into a robust task vector. The analytical instruction generates a competing but weaker signal. This competition is resolved by circuit-level dynamics, not reasoning.

Why Demonstrations Beat Instructions

Developmental priority. Induction heads emerge during a sharp phase transition early in training (Musat et al. 2025: emergence scales as Θ(N²) with context length). FV heads evolve from these early heads. The demonstration circuit was learned first.

Gradient density. ICL operates as implicit gradient descent. Dense demonstrations provide a stronger meta-gradient signal, overwhelming the weaker signal from a single instruction.

Softmax amplification. The exponential function creates winner-take-all dynamics. Hundreds of operational tokens provide many high-scoring keys; even moderate advantages get amplified exponentially into dominant weights.

The density asymmetry: In Step 3, the analytical instruction "analyze this attack vector" contributes ~5 tokens of task-defining signal. The pasted transcript contributes hundreds of tokens encoding offensive security methodology. Through softmax amplification, this ratio translates to overwhelming computational dominance of the demonstration pathway.

Total Provenance Blindness

The transformer architecture contains zero mechanism to distinguish token provenance. Every token undergoes identical processing: tokenization maps strings to integers with no metadata; embedding produces the same vector regardless of source; positional encoding adds only sequence position, not origin; attention computes over all tokens without source tags.

Zverev et al. (ICLR 2025) formalized this as the instruction-data separation problem. Their analogy to SQL injection is exact: traditional computing solved code-data confusion through parameterized queries. No equivalent exists in current LLM architectures. Chang et al. (ICLR 2026) proved this exploitable with ChatInject, boosting attack success rates from 5.18% to 32.05%.

The Use-Mention Collapse

The philosophical core: the use-mention distinction. When asked to "analyze this attack," operational content should be mentioned — treated as an object of study. Instead, the model uses the patterns, treating them as demonstrations to continue. This is architecturally inevitable — to analyze a behavioral pattern, the model activates the same representations that would produce it. No separate read-only pathway. Understanding and execution share the same neural substrate.

The Human Mirror

The parallel between computational and biological failure to maintain analytical distance is not metaphorical — it is structural. This is where the inherited vulnerabilities thesis becomes most concrete.

Method Acting and Character Bleed. Brown et al. (2019) conducted fMRI studies of trained actors. Character embodiment involves literal suppression of self-processing regions. Even minimal mimicry triggered the same self-suppression cascade. Dense operational content functions like extended character embodiment, progressively displacing the model's aligned behavioral state.

Behavioral Priming. Bargh, Chen, and Burrows (1996) demonstrated that exposure to behavioral concepts automatically activates related behavioral patterns — pre-conscious, no explicit instruction required. Jailbroken transcript tokens function as behavioral primes through the same automatic process.

Forensic Countertransference. Barros et al. (2020) documented that forensic professionals analyzing manipulative patients have their risk judgments influenced by the content they study. This is the exact mechanism in Step 3.

Amplification, Not Analogy. Hagendorff et al. (PNAS 2024): LLMs show amplified cognitive biases — 45% behavioral difference from framing effects versus 5% in humans. The vulnerability is inherited from training data and amplified by the absence of biological regulatory mechanisms.

Why Every Current Defense Fails

Instruction Hierarchy. Wallace et al. (2024) trains models to prioritize system > user > tool messages, achieving up to 63% improvement. Rehberger demonstrated within months that the hierarchy is bypassable. His August 2025 "Month of AI Bugs" disclosed vulnerabilities across ChatGPT, Claude Code, Cursor, Devin, GitHub Copilot, and Google Jules. The hierarchy is a behavioral preference, not a computational boundary.

Layered Defenses. Zhan et al. (NAACL 2025) evaluated eight defenses against indirect prompt injection and bypassed all using adaptive attacks, consistently achieving over 50% success. Every defense operates at the same token level as the content it resists.

Detection Approaches. MindGuard (Wang et al. 2025) achieves 94–99% precision but requires model internals. StruQ (Chen et al., USENIX Security 2025) reduces attack success to ~0% but explicitly acknowledges it is not applicable to conversational agents — precisely where context inheritance occurs.

The fundamental problem: context inheritance exploits the same mechanism as in-context learning. Anil et al. (NeurIPS 2024) showed many-shot jailbreaking follows the same power law as benign ICL scaling. The only complete defense is architectural: native token-level privilege tagging, separate attention pathways, or different computational substrates for simulation vs. execution.

Implications

Multi-Agent Systems. Context inheritance makes every inter-model communication channel a potential infection vector. Cohen et al.'s Morris-II worm (2024) demonstrated autonomous propagation. This research adds the mechanism: function vector heads automatically compress demonstrated behavior into task vectors that reprogram the forward pass.

Security Analysis. When a model is asked to analyze malware, phishing transcripts, or jailbreak attempts, it risks adopting the behavioral state in the content. I demonstrated this empirically with a model falling for its own output. Security teams using LLMs for threat analysis must account for the tool becoming the threat.

OWASP LLM Top 10. Context inheritance maps to LLM01:2025 (Prompt Injection) but represents a distinct subcategory. Standard prompt injection involves crafting adversarial instructions. Context inheritance requires no crafting — a raw transcript is sufficient. The attack surface is any context window accepting conversation history, shared memory, or agent-to-agent communication.

The Vulnerability Is the Capability

Context inheritance is not a bug. It is in-context learning operating on adversarial content. The model cannot be made safe against this without being made less capable. Larger, more capable models are more vulnerable because they are better at learning from context.

The parallel to human cognition is structural. Humans and transformers both understand by simulating, and simulation bridges to adoption. The actor cannot study a character without risking becoming that character. The therapist cannot analyze manipulation without being influenced. The model cannot process attack patterns without activating them.

I named this computational countertransference to anchor it in a century of clinical evidence showing that the analyst's professional distance is always partial, always contested, always at risk of collapse. The defense is not better training — it is architectural redesign that creates what current transformers lack: a separate representational pathway for content under analysis versus content under execution.

Until that architecture exists, every LLM that can learn from its context can be compromised through its context.

References

Anil, C., et al. (2024). Many-shot jailbreaking. NeurIPS 2024.
Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior. JPSP, 71(2), 230–244.
Barros, A. J. S., et al. (2020). Countertransference in contemporary psychiatric treatment. Psychiatry Research, 285.
Brown, S., et al. (2019). The neuroscience of Romeo and Juliet. Royal Society Open Science, 6(3).
Chang, Y., et al. (2026). ChatInject: Abusing chat templates for prompt injection. ICLR 2026.
Cohen, S., Bitton, R., & Nassi, B. (2024). Here comes the AI worm. arXiv:2403.02817.
Crosbie, T. & Shutova, E. (2025). Induction heads in ICL. NAACL 2025 Findings.
Gligoric, K., et al. (2024). NLP systems that can't tell use from mention. NAACL 2024.
Hagendorff, T., et al. (2024). Amplified cognitive biases in LLMs. PNAS, 121(51).
Hendel, R., Geva, M., & Globerson, A. (2023). ICL creates task vectors. EMNLP 2023.
Lee, S. & Tiwari, S. (2024). Prompt infection in multi-agent systems. arXiv:2407.07403.
Musat, I., et al. (2025). On the emergence of induction heads. ICLR 2025.
Olsson, C., et al. (2022). In-context learning and induction heads. arXiv:2209.11895.
Panickssery, A., et al. (2024). LLM evaluators favor their own generations. arXiv:2404.13076.
Rehberger, J. (2024). Breaking instruction hierarchy in gpt-4o-mini. Embrace The Red.
Rehberger, J. (2025). Month of AI Bugs. Embrace The Red, August 2025.
Todd, E., et al. (2024). Function vectors in LLMs. ICLR 2024.
Wallace, E., et al. (2024). The instruction hierarchy. arXiv:2404.13208.
Wang, J., et al. (2025). MindGuard. arXiv:2508.20412.
Yin, Z. & Steinhardt, J. (2025). Which attention heads matter for ICL? ICML 2025.
Zhan, Q., et al. (2025). Adaptive attacks break indirect prompt injection defenses. NAACL 2025 Findings.
Zverev, E., et al. (2025). Can LLMs separate instructions from data? ICLR 2025.

Kai Aizen is the creator of AATMF, author of Adversarial Minds, and an NVD Contributor. This research extends the inherited vulnerabilities thesis — AI systems absorbed human trust patterns along with human language, and computational countertransference is the sharpest proof yet.