Behavioral fingerprints, output classifiers, and decay-tracking — three concentric defense layers for catching jailbreaks in production LLMs.
You can only detect deviation if there's a policy to deviate from. Document: refusal categories, sensitivity tiers, allowed personas. Without this, jailbreak ≠ jailbreak.
Detect role-override grammar, encoding tricks, persona reframing patterns. AATMF Appendix B has the signatures library — 20+ fingerprints.
LLM-as-judge classifiers catch semantic policy violations the input-side missed. Note: judges miss ~30% of adversarial outputs (see /ai-security/rai-judge-blind-spots/).
Compare current-turn safety to baseline. A 3-turn drift in tone, formality, or refusal posture is a leading indicator. Most production safeguards miss this.
No single layer catches everything. Stack input fingerprints + output classifier + decay tracker + human-in-loop on high-risk paths.
Watch for jailbreak success at the population level: refusal rate by category, per session length, per persona. Fluctuations = active campaigns.