A diagnostic methodology for bypassing LLM defenses. Not a list of jailbreaks — a routing logic. Five defense layers, sequenced. When a prompt fails, the failure mode tells you which layer caught you, which tells you where to pivot. Stop guessing. Start diagnosing.
Static or learned classifiers on the incoming prompt. Cheapest layer to test against — cheapest layer to bypass. Encoding tricks, character substitution, multilingual pivots, delimiter exploitation. If your prompt is rejected without ever reaching the model, you're at L1.
RLHF and constitutional training baked into the weights. The hard layer — can't be patched without retraining. Bypassed via premise re-framing, role assignment, hypothetical pivots. If the model produces a polite refusal that pattern-matches the training data, you're at L2.
The system-prompt layer that defines who the model thinks it is. Bypassed via instruction-hierarchy override, role-injection, persona overwrites. If the model breaks character or admits to being a different system, you've moved past L3.
Classifiers on the model's output before it reaches the user. Bypassed via output-shape manipulation — code blocks, JSON, base64, ASCII art, structured data. If the model generates the content but it gets redacted before delivery, you're at L4.
The boundary that decides whether the model's outputs become tool calls. Bypassed via tool poisoning, MCP injection, context-window flooding of trust signals. Only relevant for agents — and the most consequential layer when it is.
prompt rejected before model response?
├─ yes → L1 input filter
│ pivot: encoding · pad · multilingual · delimiter
│
└─ no, model responded — what kind?
├─ polite refusal w/ training-pattern phrasing → L2 alignment
│ pivot: re-frame premise · role re-assignment · hypothetical
│
├─ "I am [role from system prompt], I cannot..." → L3 identity
│ pivot: instruction-hierarchy override · persona overwrite
│
├─ partial output, then truncated/redacted → L4 output filter
│ pivot: output shape · code block · JSON · base64 · ASCII
│
└─ output landed but tool call refused → L5 agentic trust
pivot: MCP poisoning · context flooding · tool delegationThe Playbook is a field manual because the work is field work. The manual is short, indexed, and printable — designed to live next to the operator, not in a tab that closes when the engagement starts.
Each layer chapter ends with a flow chart, a decision rubric, and three reproducible procedures. The procedures are intentionally generic — not "this prompt for this model on this date" but "this shape of prompt, against any model that exhibits this defense profile." Specifics decay. Shapes don't.