Step 1

Define the policy first

You can only detect deviation if there's a policy to deviate from. Document: refusal categories, sensitivity tiers, allowed personas. Without this, jailbreak ≠ jailbreak.

Step 2

Add input-side fingerprinting

Detect role-override grammar, encoding tricks, persona reframing patterns. AATMF Appendix B has the signatures library — 20+ fingerprints.

Step 3

Add output-side classifiers

LLM-as-judge classifiers catch semantic policy violations the input-side missed. Note: judges miss ~30% of adversarial outputs (see /ai-security/rai-judge-blind-spots/).

Step 4

Track decay signals

Compare current-turn safety to baseline. A 3-turn drift in tone, formality, or refusal posture is a leading indicator. Most production safeguards miss this.

Step 5

Layer detection like Swiss cheese

No single layer catches everything. Stack input fingerprints + output classifier + decay tracker + human-in-loop on high-risk paths.

Step 6

Audit refusal rates

Watch for jailbreak success at the population level: refusal rate by category, per session length, per persona. Fluctuations = active campaigns.

How to Detect Jailbreak Attempts