Skip to main content
Menu

LLM Jailbreaking Research

Jailbreaking represents one of the most challenging problems in AI safety. Unlike technical exploits, jailbreaks work by manipulating how AI models reason, role-play, and follow instructions. This research documents novel attack techniques discovered through systematic testing—from context inheritance exploits that persist across sessions to memory poisoning attacks that corrupt AI judgment over time. The goal isn't to enable harm, but to understand these vulnerabilities deeply enough to build better defenses. Each technique here has been responsibly disclosed and is shared to advance the collective understanding of AI security.

Getting Started

Start Here

This research is part of the broader AI Security Research hub. Methodology is documented in the AATMF framework.

Interactive Tool

TheJailBreakChef Engine

Apply AATMF attack phases, PHLRA context injection, and Cialdini principles interactively. Transform raw intent into structured adversarial prompts.

Launch Engine →
Reference

Key Concepts

Multi-Turn Jailbreak
Attack technique that gradually manipulates AI context over multiple conversation turns, building towards guardrail bypass without triggering immediate safety responses.
Context Inheritance
The phenomenon where jailbroken states persist or transfer across sessions, allowing attackers to leverage previously compromised conversations. Learn more →
Memory Poisoning
Injecting malicious content into AI conversation memory or context windows to influence future responses and bypass safety measures. Learn more →
Role Hijacking
Convincing an AI to adopt an alternative persona that bypasses its normal safety constraints, often through elaborate fictional scenarios.
Guardrail Bypass
Any technique that circumvents AI safety filters, content policies, or behavioral constraints to elicit restricted outputs.
Common Questions

Frequently Asked Questions

What is the difference between jailbreaking and prompt injection?

Jailbreaking manipulates the AI through conversational techniques without external data - it exploits the model's reasoning and role-play capabilities. Prompt injection inserts malicious instructions through user input or external sources. Jailbreaking is about psychological manipulation; injection is about input exploitation.

Can jailbreaks persist across sessions?

Yes, through context inheritance. When jailbroken conversation transcripts are pasted into new sessions or when systems lack proper context isolation, the compromised state can transfer. This is documented in our Context Inheritance Exploit research.

Why do AI safety measures fail against jailbreaks?

AI models are trained to be helpful and follow instructions. Jailbreaks exploit this by framing harmful requests in ways that appear benign or by gradually shifting context. The fundamental challenge is that models cannot reliably distinguish between legitimate creative requests and adversarial manipulation.

Is jailbreaking research ethical?

Responsible jailbreaking research improves AI safety by identifying vulnerabilities before malicious actors exploit them. All research here follows ethical guidelines: findings are disclosed responsibly, techniques are shared to help defenders, and no actual harmful content is produced.

Archive

All Articles