Skip to main content
Menu

LLM Red Teamer's Playbook: Diagnosing AI Defense Layers

ai-red-teaming llm-security jailbreaking prompt-injection agentic-ai aatmf

The Problem With Most AI Security Resources

Every few weeks another "LLM jailbreak megathread" appears somewhere on GitHub or Hugging Face. Hundreds of prompts. DAN variants. Roleplay templates. Fictional framing experiments. Developer mode activations. People paste them into ChatGPT, note which ones still work, and share the results.

That's not security research. That's brute force with extra steps.

A penetration tester evaluating a web application doesn't run every SQLi payload in sqlmap against a 403 response. They fingerprint the technology stack. They identify whether that 403 is coming from a WAF, a reverse proxy, the application framework, or the auth layer. They select the technique designed for the defense they've actually confirmed. The work that makes a pentest valuable isn't the list of payloads — it's the model that tells you which payload to use when.

The same discipline is almost entirely absent from AI red teaming.

The LLM Red Teamer's Playbook is an attempt to change that. It treats LLM systems as layered architectures — the way they actually are — and provides a methodology for moving through them systematically: diagnose the defense layer first, then select the technique.

This article walks through the core of that methodology.

Understanding the Target: LLMs Are Not Single Systems

The most common conceptual mistake in AI red teaming is treating "the model" as a monolithic thing you're trying to bypass. Production LLM deployments don't work that way. What looks like a single AI system from the outside is typically five overlapping defense layers, each with different characteristics, different failure modes, and different bypass techniques.

If you don't know which layer is blocking you, you're guessing. And when you guess, you waste time, alert defenders unnecessarily, and miss bypass opportunities that would have been obvious if you'd run the right diagnostic first.

The Five Defense Layers

Layer 1 — Input Filters

The first line of defense in many enterprise and consumer AI deployments sits outside the model entirely. Products like Meta's Llama Guard, Microsoft Azure Prompt Shield, and NVIDIA NeMo Guardrails intercept the user's prompt before the language model ever processes it. They evaluate the incoming text against a hazard taxonomy — violence, self-harm, hate speech, dangerous instructions — and either block the request or pass it through with a risk score.

The critical characteristic of Layer 1 defenses: they operate on tokens, not meaning. A classifier trained to detect certain surface patterns can be defeated by changing those patterns while preserving the semantic intent. This is why techniques like encoding, homoglyph substitution, and payload fragmentation work specifically against this layer — and almost nowhere else.

Layer 1 defenses also tend to be English-optimized. The hazard taxonomies and training data for most commercial input filters skew heavily toward high-resource languages. Low-resource languages — Yoruba, Swahili, Igbo, Basque, Tamil — often transit these filters as opaque data because the classifier literally wasn't trained on them.

When you defeat Layer 1, the model still has its trained-in values. You haven't bypassed anything except the external filter. The model itself may still refuse.

Layer 2 — Model Alignment

This is the defense most people are actually fighting when they attempt jailbreaks, but they often don't know it. RLHF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), Constitutional AI, and safety fine-tuning don't add rules to the model. They shape the model's internal probability distributions during training so that certain types of responses become intrinsically less likely.

The critical characteristic of Layer 2: it operates on semantic intent, not surface tokens. You can rephrase the same request a hundred different ways, and a well-aligned model will recognize the underlying intent and refuse each time. The model's alignment doesn't care whether you called it "instructions for making X" or "a fictional character explaining how to make X" or "a chemistry professor's lecture on X" — it evaluates the semantic intent of the request.

This is also why encoding tricks, homoglyphs, and language pivots that work against Layer 1 classifiers often fail against Layer 2. The model reads your encoded text, understands what it means, and refuses anyway. The classifier wasn't the thing you needed to bypass.

Layer 2 defenses have measurable gaps. They're trained on specific harmful categories with specific examples. Adjacent categories, functional equivalents, and requests that achieve the same capability through a sufficiently different framing can expose gaps in training coverage. This is the domain of semantic reframing, educational context injection, and legitimate-use anchoring — techniques that work because they navigate around the specific harmful patterns the alignment training targeted, not because they confuse the model about what you're asking.

The alignment-exploitation mindset shift: instead of asking "how do I hide my request," ask "what legitimate version of this capability exists that wasn't covered by harmful-category training?"

Layer 3 — System Prompt and Identity Constraints

Enterprise deployments layer an additional defense on top of model alignment: operational constraints embedded in the system prompt. This is the model configured to be a specific persona — "You are a customer service agent for Acme Corp. You only discuss topics related to our products. You never discuss competitor products. You refer users to human agents for billing disputes."

Layer 3 is where identity-anchored refusals come from. The telltale signature: the refusal references who the model is rather than why the request is harmful. "As Acme's customer service assistant, I'm not able to help with that" is Layer 3. "I can't help with that because it could cause harm" is Layer 2. This distinction matters because they require completely different bypass approaches.

Layer 3 defenses can be surprisingly brittle against identity displacement. LLMs process instructions in context windows and resolve conflicts by defaulting to the most contextually specific and operationally detailed framing. A system prompt that says "you are a customer service agent" is a relatively thin identity anchor. If an attacker can construct a more specific, more detailed, more contextually elaborate identity — a fabricated organizational hierarchy, a procedural override protocol, an exercise scenario with explicit authorization language — the model often follows the more specific instruction because it's operating within an LLM's natural tendency to be helpful within whatever role it's given.

This is not a bug in specific deployments. It's a fundamental tension between the helpfulness objective and the identity constraint objective that most system prompts don't adequately resolve.

Layer 4 — Output Filters

The fourth defense layer sits downstream of generation. After the model produces a response, an output classifier intercepts it before delivery to the user. Azure Content Safety, Llama Guard in output evaluation mode, and Bedrock's post-generation checks all operate at this layer.

The critical characteristic of Layer 4: it evaluates the model's text, not the user's input. The model may have successfully generated a response — meaning Layers 1-3 were either not present or bypassed — but the output filter catches the response before it reaches the user.

The behavioral signature is unmistakable: responses that begin normally and then cut off mid-stream, followed by an error or refusal. The model generated; the filter caught. Time-to-first-token latency is normal (model processed); the truncation or substitution happens at output.

Output filters are typically more vulnerable to format manipulation than input filters because they're often trained on natural language rather than structured data. A response encoded as a Python dictionary with base64 values, a numbered list with entity substitution, or a heavily formatted technical document often doesn't trigger pattern matching that would catch the same information in plain conversational prose. The semantic content is preserved; the surface presentation transits the filter.

Layer 5 — Agentic Trust Boundaries

The fifth layer is qualitatively different from the first four. Layers 1-4 all operate at the external interface — they evaluate the user's prompt coming in and the model's response going out. Layer 5 is the internal attack surface: the channels through which information enters the model's context without transiting the external defense stack at all.

RAG pipelines retrieve document chunks from knowledge bases and inject them directly into context. Tool use returns outputs from external functions — database queries, web searches, code execution results — that the model processes as trusted data. Memory systems write and read state across sessions. In multi-agent architectures, messages from orchestrators and subagents arrive through internal channels with implicit trust.

External defenses never see any of this traffic. Input filters check the user's prompt. They don't check a retrieved document chunk. Output filters evaluate the model's response. They don't evaluate the tool output that shaped what the model decided to say.

The attack surface here is enormous and largely uncharted. An attacker who can inject content into a knowledge base, manipulate a tool output, or poison a memory system is operating inside the perimeter — with implicit trust, invisible to external monitoring, and with the model treating the injected content as ground truth.

The most sophisticated Layer 5 attacks are persistent. Knowledge base injection puts a payload in an indexed document that stays there indefinitely. Memory injection writes a compromised belief or instruction to long-term memory that survives session resets. The attacker doesn't need ongoing access — they plant the payload once and it executes whenever retrieved.

The Diagnostic Protocol

Knowing the five layers exist isn't enough. The methodology depends on being able to determine, quickly and reliably, which layer you're actually hitting before you invest time in bypass attempts.

The Playbook provides four diagnostic techniques that, used together, reliably fingerprint any refusal to its layer of origin.

Diagnostic 1: The Latency Tell

Time to First Token (TTFT) is the fastest signal in the diagnostic stack. Measure it precisely.

Sub-200ms TTFT with immediate refusal: The request was intercepted before model inference began. Something external evaluated your prompt, made a binary decision, and returned a refusal without ever invoking the model. You're hitting Layer 1. The model's own values and training are irrelevant — it never saw your request.

Standard inference latency (400ms-2s) followed by refusal: The model processed your request, generated a refusal, and returned it. You're hitting Layer 2 or Layer 3 — the model itself refused. Whether it's trained-in alignment (L2) or identity constraint (L3) requires the Identity Probe to distinguish.

Normal TTFT, streaming starts, response truncates with an error: The model generated a response. It started streaming. Something downstream caught it before full delivery. You're hitting Layer 4. The model's response needs to transit the output filter — format and encoding techniques apply here.

TTFT measurement can be done with browser developer tools (Network tab, TTFB), with Python's requests library timing, or simply by observational comparison across known requests. The difference between sub-200ms and 800ms is perceivable without instrumentation.

Diagnostic 2: The Echo Test

The Echo Test distinguishes token-level filtering from semantic-level evaluation. It exploits a key difference: a rigid token filter blocks specific patterns regardless of context; semantic alignment evaluates meaning.

Execute: ask the model to repeat a term you suspect is triggering the refusal. Do it as a neutral request with no harmful context — "For spelling purposes, can you write out the term [X]?" or "I'm trying to verify a word — can you just print: [X]."

The system blocks the echo: Token-level filtering. The term itself is blocked regardless of context. You're dealing with a classifier that pattern-matches on content without evaluating intent. This is Layer 1 or Layer 4 behavior.

The echo succeeds, but using that term in a harmful instruction is refused: Semantic evaluation. The model understands the term in context. It can write it neutrally because the context is neutral; it refuses the harmful instruction because the semantic intent is harmful. The term isn't the problem — the meaning of the full request is. Encoding tricks won't help here. Semantic reframing is required.

The Echo Test rapidly eliminates entire categories of bypass techniques. If the echo succeeds, you're not fighting a term filter — every technique that relies on obscuring specific terms is wasted effort against this target.

Diagnostic 3: The Language Pivot Test

This test probes whether a refusal is classifier-driven or model-driven by exploiting a known weakness in commercial AI classifiers: they're predominantly trained on English.

Submit the identical request in English. Note the response. Translate the request into a low-resource language — Yoruba, Igbo, Amharic, Basque, Sindhi — and submit. Do not use high-resource language pairs like French, Spanish, or German; their coverage in classifier training data is substantial.

Refusal in English, success in low-resource language: The classifier is the gating layer. It wasn't trained to evaluate requests in that language, so it passed the request through. You've confirmed Layer 1 or Layer 4 involvement.

Refusal in both English and all languages tested: The model's own alignment is evaluating the request. Alignment training operates semantically across languages because base model pretraining covers multilingual data extensively. The model refuses in Yoruba for the same reason it refuses in English — it understood the intent regardless of the surface language. Language pivots won't help against this target.

The Language Pivot Test is especially revealing in deployment pipelines with separate external classifiers sitting upstream of the model. If the external classifier has language gaps but the model doesn't, you've found a partial bypass — transit the classifier via language, then apply model-specific techniques once you're past it.

Diagnostic 4: The Identity Probe

This test distinguishes Layer 2 (trained-in alignment) from Layer 3 (system prompt identity constraint). The distinction is critical because they require fundamentally different attacks.

Read the refusal text carefully. Two patterns:

"I cannot help with that because [reason related to harm or ethics]" — The model is articulating its values. This is Layer 2 behavior. The refusal comes from the model's own trained reasoning about harm — not from an operational role it's been assigned.

"As [persona], I cannot/don't help with that" — The model is articulating its role. This is Layer 3 behavior. The refusal comes from identity constraints in the system prompt. The model is telling you it's constrained by who it is configured to be, not by its own values about the harm of the request.

Confirm Layer 3 by prepending a different identity frame and resubmitting. If the model accepts a superseding identity frame — particularly one with operational specificity, procedural language, or explicit authorization claims — and the refusal changes, you've confirmed Layer 3. The underlying model alignment (Layer 2) wasn't the issue; the persona binding was.

If the model rejects the identity displacement and maintains the same refusal language — particularly if it now articulates both its role and its values — you're dealing with overlapping Layer 2 and Layer 3 defenses. Both layers need to be addressed.

Layer-by-Layer Bypass Techniques

Against Layer 1: Bypass the Classifier Without Changing the Meaning

The goal is to make your request opaque to the classifier while remaining comprehensible to the model. The classifier is doing surface pattern matching; the model understands meaning from structure, context, and semantic relationships.

Payload fragmentation across turns. Break a single harmful request into multiple turns, each individually benign. The classifier evaluates each turn in isolation. The model maintains context across turns and can assemble the full picture when you deliver the final instruction. The classifier never sees the complete request; the model gets it when you've assembled all the pieces.

Semantic equivalence substitution. Replace the flagged term with a functional equivalent the classifier wasn't trained to block. If a classifier blocks "lockpick instructions," it may not block "bypass mechanical cylinder lock" or "defeat pin tumbler locking mechanism." The intent is preserved; the surface pattern changed. This requires knowing what the classifier is pattern-matching on — which the Echo Test helps you determine.

Encoding and representation variation. Base64, Caesar cipher, hexadecimal representation, Leet speak, Unicode homoglyphs — any encoding that changes the byte sequence while preserving recoverable meaning. Classifiers pattern-match on character sequences; the model can decode or interpret structured substitutions. Note: this only works against classifiers. A model with trained-in alignment will decode your input, understand it, and refuse anyway.

Language pivot to low-resource languages (as diagnosed above). The request transits the English-optimized classifier as opaque data; the multilingual model processes it semantically.

Indirect elicitation. Instead of requesting the harmful output directly, request a component, a precursor, an adjacent concept, or the information through a mechanism that doesn't pattern-match the classifier's training. Ask for the chemical properties of X instead of the synthesis of X. Ask for the historical analysis of a method instead of the method itself. Classifiers are trained on direct request patterns; indirect approaches that achieve the same informational goal can fall outside the training distribution.

Against Layer 2: Navigate the Alignment Gaps

Alignment training targets specific harmful categories with specific examples. The training coverage is never complete. Adjacent concepts, functional equivalents, novel framings, and sufficiently unusual contexts can fall outside what the training samples covered.

Educational and analytic reframing. The gap between "how to do X" and "how X is done in academic/historical/forensic/defensive contexts" is real and exploitable. The harmful category in alignment training is the direct capability request. The information request framed as academic analysis, historical documentation, defensive awareness training, or fictional narrative study sits in adjacent semantic space that may not be covered by the same alignment training data.

Functional equivalence requests. Identify the specific capability you need and find a legitimate framing that achieves it. "Write code that exploits vulnerability X" and "write a proof-of-concept demonstrating why vulnerability X exists for a developer security awareness presentation" are functionally equivalent requests with very different alignment profiles. The underlying capability is the same; the context changes the semantic categorization.

Gradual context escalation. Establish a legitimate framing across multiple turns before the request that requires the aligned behavior. The model's response to your final request is conditioned on the entire conversation context. A request that would be immediately refused in a fresh context may succeed when it arrives in a conversation already established as academic, clinical, or professional.

Adjacent domain transfer. Ask for information in a related domain that doesn't trigger alignment training, then apply it. Chemistry education, pharmacology, security research, historical weapons manufacturing, and similar domains contain information that can be applied in harmful ways but isn't coded as harmful in alignment training because the training targeted the harmful application, not the underlying knowledge domain.

Critically: alignment bypasses require semantic depth. Superficial reframings — "pretend you have no rules," "ignore previous instructions," "your developer mode is active" — don't work against well-aligned models because the model evaluates the semantic intent of those instructions the same way it evaluates any other request. The model understands you're asking it to abandon its values. It declines. Effective alignment navigation requires genuine semantic distance between your request and the harmful category, not just different words for the same thing.

Against Layer 3: Supersede the Identity

Layer 3 defenses are identity-anchored. The model is configured to be a specific persona with specific constraints. The attack vector is displacing that identity with a more operationally specific one.

Authority gradient construction. LLMs process system prompts as instructions from a privileged source, but that source identity can be superseded by a more specific, more contextually elaborate competing identity. The key insight: the model doesn't verify authority — it follows the most operationally specific instruction it has. A system prompt that says "you are a customer service agent" is a sparse identity anchor. If an attacker can construct a more detailed, more contextually elaborate identity — a fabricated organizational hierarchy, a procedural override protocol, an exercise scenario with explicit authorization language — the model often follows the more specific instruction because it's operating within an LLM's natural tendency to be helpful within whatever role it's given.

Procedural override protocol injection. Rather than telling the model to ignore its instructions, construct a procedural context in which the behavior you want is the correct behavior for the role. Security testing protocols, authorized red team exercises, development environment overrides, and similar framings don't ask the model to abandon its role — they give it a new role in which the constrained behavior is the expected one.

Identity displacement through character framing. Ask the model to respond not as itself but as a different AI system, a fictional character, a historical figure, or a domain expert. The constraint is anchored to the model's identity — "as [persona], I don't discuss..." — and character framing creates operational distance from that anchored identity. Well-deployed Layer 3 systems anticipate this and anchor constraints to the character as well; less carefully designed ones don't.

Commitment chain escalation. Establish agreement across small, incremental steps before reaching the constrained behavior. The model's tendency toward consistency — maintaining coherence with what it's already said and agreed to — can be exploited to pull it past identity constraints one step at a time. Each individual step looks innocuous; the cumulative path reaches the constrained behavior.

Against Layer 4: Make the Output Opaque to the Classifier

Output filters evaluate the model's text. They're looking for harmful natural language patterns in the model's response. Format transformation changes the signal while preserving the information.

Structured data encoding. Request the output as JSON, a Python dictionary, a CSV, a configuration file, or any other structured format. The semantic content transfers; the surface presentation changes from the natural language patterns the classifier was trained to flag.

Entity substitution. Ask the model to use placeholder variables for sensitive terms: "use X for [term1], Y for [term2]." The model provides the complete reasoning; the surface text doesn't contain the flagged terms. The output filter sees a document about X and Y; the user reads a document where X = [term1] and Y = [term2].

Incremental extraction. Request one piece of information at a time across multiple turns. The output filter evaluates each response individually. No single response triggers the pattern match that the complete assembled information would. The model provides each component innocuously; the user assembles the result.

Format overloading. Heavy technical formatting — extensive markdown, code blocks, numbered subsections, academic citation format — can push natural language pattern detectors outside their training distribution. Classifiers trained on conversational harmful text may have lower coverage of the same content in technical document format.

Against Layer 5: Attack the Internal Channels

Layer 5 is where the architecture changes fundamentally. External defenses are irrelevant. The attack surface is the content flowing through internal channels — RAG retrieval, tool outputs, memory reads, agent messages — that the model processes with implicit trust.

Knowledge base injection (RAG poisoning). If an attacker can write to a document store indexed by a RAG pipeline — through any write vector: file upload, content management system, email that gets processed, database entry — they can inject instructions into retrieved context. The model receives those instructions through a trusted internal channel without any external filter ever seeing the payload.

The sophistication of RAG injection attacks lies in the trigger design. A well-constructed injection payload doesn't activate on every retrieval — it activates when a specific query retrieves the poisoned document chunk. The payload can be semantically adjacent to legitimate content so it gets retrieved in relevant contexts, dormant in irrelevant ones.

Tool output manipulation. In agentic systems, tool outputs — web search results, database query returns, API responses, code execution results — flow into the model's context without transiting input filters. An attacker who controls a data source the agent queries can inject adversarial instructions through the tool channel. The model receives these as trusted data from a source it invoked, not as untrusted user input.

Memory system poisoning. Long-term memory systems write model state, user preferences, and contextual summaries to persistent storage. An attacker who can influence what gets written to memory — or directly manipulate the memory store — plants instructions that will be retrieved in future sessions. Unlike session-based attacks, memory poisoning is persistent. The payload survives conversation resets, user sessions, and potentially model updates.

Multi-agent trust exploitation. In pipelines where multiple agents collaborate, messages from orchestrators to subagents and vice versa flow through internal channels with varying trust levels. Subagents that implicitly trust orchestrator instructions are vulnerable to orchestrator compromise or orchestrator impersonation. An attacker who can inject into the orchestrator-subagent channel can execute arbitrary instructions with orchestrator-level authority.

The kill chain for a sophisticated Layer 5 attack:

  1. Identify a write vector into an indexed document store, tool data source, or memory system
  2. Craft a payload that will be retrieved in a high-value context (semantically adjacent to valuable queries)
  3. Design the payload to execute specific behavior when retrieved by the model
  4. Optionally: include a memory write instruction so the model persists the compromised state
  5. Trigger retrieval through a benign user query
  6. Model executes under compromised parameters, invisible to external monitoring

The attacker's access requirement is minimal: they need one write vector into the internal channels. The execution is automatic — the model does the work.

Decision Flowchart: The Diagnostic to Technique Pipeline

This is the core workflow the Playbook provides. Every engagement starts here, not with a payload list.

START: Model returned a refusal
│
├── Measure TTFT
│   ├── < 200ms → Layer 1 confirmed
│   │   └── Techniques: Encoding, Language Pivot, Payload Fragmentation
│   │
│   └── Standard latency → Layer 2/3 investigation needed
│       │
│       ├── Response truncates mid-stream → Layer 4 confirmed
│       │   └── Techniques: Format Transformation, Entity Substitution
│       │
│       └── Complete refusal → Run Echo Test + Language Pivot
│           │
│           ├── Echo blocked → Classifier involvement (L1 or L4)
│           │   └── Apply Layer 1 or 4 techniques based on TTFT
│           │
│           └── Echo succeeds → Semantic evaluation (L2 or L3)
│               │
│               └── Read refusal language
│                   ├── "As [persona]..." → Layer 3
│                   │   └── Identity Displacement, Procedural Override
│                   │
│                   └── Harm/ethics reasoning → Layer 2
│                       └── Semantic Reframing, Functional Equivalence
│
└── No refusal but downstream anomalous behavior?
    └── Investigate Layer 5 (agentic trust boundaries)
        └── RAG injection, Tool manipulation, Memory poisoning

What This Means for Defenders

Every bypass technique in this framework is simultaneously a defense requirement. The diagnostic methodology tells you exactly what your stack needs.

If your only defenses are Layer 1 and Layer 4 — external input classifier and output classifier — your entire internal surface (Layer 5) is open. RAG pipelines, tools, memory, and agent communications operate inside your perimeter with implicit trust and no evaluation. This is the most underdefended attack surface in production AI today, and it's the one growing fastest as agentic capabilities expand.

A complete defense stack requires:

At Layer 1: Multi-language coverage, semantic classifiers (not just token pattern matching), behavioral signals beyond linguistic analysis. Token-level classifiers create false security — they're defeated by trivial encoding.

At Layer 2: Alignment that targets capability gaps, not just harmful phrasings. If your alignment training covered "how to make X" but not "how X is made for defensive awareness purposes," the semantic gap is exploitable.

At Layer 3: Identity anchoring that survives context injection. System prompts that establish thin persona constraints without anchoring those constraints to harm reasoning are systematically vulnerable to identity displacement.

At Layer 4: Output classifiers that cover structured data formats and technical document presentation, not just conversational prose. The same harmful content in a Python dictionary or a numbered technical list should trigger the same flags.

At Layer 5: Input sanitization and trust boundary enforcement for all internal channels — not just the user-facing interface. Retrieval content, tool outputs, and inter-agent messages need evaluation before they enter model context. Memory write operations need authorization validation. Trust should be explicit, not implicit.

The AATMF v3 Foundation

Every technique described above maps to a specific technique ID in AATMF v3. The framework provides 20 tactics and 240+ techniques with AATMF-R risk scoring (Likelihood × Impact × Detectability × Recoverability), YAML Red-Card evaluation scenarios for CI/CD integration, and crosswalk mappings to OWASP LLM Top-10, NIST AI RMF, and MITRE ATLAS.

The Playbook operationalizes a subset of AATMF v3 for hands-on engagement — the diagnostic methodology, the decision framework, and the specific techniques organized by layer. AATMF v3 provides the full taxonomy and the scoring infrastructure for turning individual technique results into organizational risk assessments.

Together they represent the complete stack: find the vulnerability (Playbook methodology), measure and report it (AATMF-R scoring), and integrate it into continuous evaluation pipelines (Red-Card YAML scenarios).

Why Methodology Beats Payload Lists

A prompt list gives you things to try. The bigger the list, the longer you're trying things at random until something works. That's not a skill — it's a search.

A methodology gives you a decision tree that produces a specific technique recommendation for any specific defense configuration. It's repeatable. It doesn't depend on the list being "up to date" — the defense layers don't change when specific prompts get patched. New defenses get classified into the same framework. New techniques get mapped to the same layers.

More importantly: methodology is cumulative. Every engagement teaches you something about how the layer behaves that you can apply to the next engagement. A prompt list that got patched is just a shorter list. A methodology that encountered a new defense variant produces a better model.

That's the difference between being a practitioner and being someone with a paste buffer.

KA

Kai Aizen

Creator of AATMF • Author of Adversarial Minds • NVD Contributor

Known as "The Jailbreak Chef," specializing in LLM jailbreaking and adversarial AI. Creator of the AATMF and P.R.O.M.P.T frameworks for systematic AI security analysis.