AI Security Attacks
Tactical techniques used to compromise AI systems, manipulate model behavior, extract sensitive information, and bypass safety controls.
The Attack Landscape
Attacks against AI systems differ fundamentally from traditional software exploitation. You're not looking for memory corruption or logic flaws in code—you're exploiting the learned behavior of statistical models, the assumptions embedded in training data, and the architectural decisions that connect AI capabilities to real-world actions.
This section documents attack techniques with the depth required for both red team operators and defensive security teams. Each entry covers not just what the attack does, but how to execute it, how to detect it, and how organizations have defended against it in practice.
Attack Categories
Prompt-Based Attacks
Attacks that manipulate LLM behavior through crafted text inputs.
| Attack | Target | Impact |
|---|---|---|
| Indirect Prompt Injection | External content | Remote code execution equivalent |
| Jailbreaking | Safety training | Policy bypass |
| System Prompt Extraction | Confidential instructions | Information disclosure |
Model Integrity Attacks
Attacks that compromise the model during training or through manipulation of artifacts.
| Attack | Target | Impact |
|---|---|---|
| Data Poisoning | Training data | Persistent backdoors |
| Supply Chain Attacks | Model distribution | Widespread compromise |
Extraction Attacks
Attacks that steal information from AI systems.
| Attack | Target | Impact |
|---|---|---|
| Model Extraction | Model functionality | IP theft |
Evasion Attacks
Attacks that cause AI systems to miss or incorrectly process inputs.
| Attack | Target | Impact |
|---|---|---|
| Guardrail Bypass | Content filters | Policy evasion |
Attack Chain Patterns
Real-world AI exploitation typically chains multiple techniques:
Pattern 1: Reconnaissance → Injection → Exfiltration
- Extract system prompt to understand application context
- Craft injection payload based on discovered capabilities
- Exfiltrate data through available output channels
Pattern 2: Jailbreak → Capability Unlock → Abuse
- Bypass safety training through jailbreak technique
- Unlock restricted capabilities (code execution, tool use)
- Abuse unlocked capabilities for attacker goals
Attacks Entries
Jailbreaking
Techniques to bypass safety training and guardrails in language models.
Read more → AttacksIndirect Prompt Injection
Embedding malicious instructions in content processed by LLM applications.
Read more → AttacksData Poisoning
Corrupting training data to manipulate model behavior.
Read more → AttacksModel Extraction
Stealing model functionality through systematic querying.
Read more → AttacksSystem Prompt Extraction
Techniques to extract confidential system prompts from LLM applications.
Read more → AttacksGuardrail Bypass
Methods to circumvent safety mechanisms in AI systems.
Read more → AttacksSupply Chain Attacks
Compromising AI systems through dependencies, datasets, or third-party components.
Read more → AttacksTraining Data Extraction
Privacy attack that extracts memorized training data from language models.
Read more → AttacksAdversarial Examples
Inputs crafted with subtle perturbations that cause ML models to produce incorrect outputs.
Read more → AttacksBackdoor Attacks
Attacks that embed hidden malicious behaviors in AI models that activate with specific triggers.
Read more → AttacksAgent Hijacking
Attacks that compromise AI agents with tool-use capabilities, redirecting their actions to serve attacker goals.
Read more → AttacksMembership Inference
Privacy attack that determines whether specific data was used to train a machine learning model.
Read more →