Concepts Wiki Entry

Adversarial AI

The discipline focused on understanding, executing, and defending against attacks on artificial intelligence systems.

Last updated: January 24, 2025

Definition

Adversarial AI encompasses the study and practice of attacking and defending artificial intelligence systems. It spans the full lifecycle of AI systems—from training data to deployment—and addresses the unique security challenges that emerge when systems learn from data rather than following explicit programming.

The field sits at the intersection of machine learning research, cybersecurity practice, and adversarial thinking. Practitioners must understand both how AI systems work internally and how attackers approach exploitation.

Scope and Boundaries

Adversarial AI includes:

Offensive Research

Discovering vulnerabilities in AI systems
Developing attack techniques and exploits
Red teaming AI deployments
Building adversarial tools and frameworks

Defensive Research

Designing robust AI architectures
Developing detection and mitigation techniques
Hardening models against known attacks
Building security tooling for AI systems

Policy and Governance

Risk assessment frameworks for AI
Compliance and regulatory considerations
Responsible disclosure practices
Industry standards development

Historical Context

Pre-LLM Era (2013-2020)

Adversarial AI emerged from academic research on neural network robustness:

2013-2014: Szegedy et al. discovered that imperceptible perturbations to images could cause neural networks to misclassify with high confidence. These "adversarial examples" demonstrated that ML models were brittle in unexpected ways.

2015-2017: Research expanded to physical-world attacks. Researchers demonstrated adversarial patches that caused stop signs to be misread, faces to evade recognition, and objects to become invisible to detectors.

2018-2020: Focus broadened to training-time attacks (data poisoning, backdoors) and privacy attacks (membership inference, model extraction).

LLM Era (2020-Present)

Large language models introduced fundamentally new attack surfaces:

2022-2023: The release of ChatGPT and rapid adoption of LLM-integrated applications created urgent security needs. Prompt injection, jailbreaking, and agent security became primary concerns.

2024-Present: AI agents with tool access, multi-modal models, and enterprise AI deployments have created complex attack surfaces. Adversarial AI has become a critical discipline within enterprise security.

Core Concepts

Attack Surface

The potential entry points for attacking an AI system:

Training data — Poisoning, backdoors
Model weights — Theft, tampering, supply chain
Inference inputs — Adversarial examples, prompt injection
System integration — Agent hijacking, tool abuse

Threat Models

The assumptions about attacker capabilities:

Black-box — Attacker has query access only
White-box — Attacker has full model access
Gray-box — Attacker has partial information

The Defender's Dilemma

AI security faces asymmetric challenges:

Attacks only need to succeed once; defenses must succeed always
Attack techniques transfer across models; defenses are model-specific
Attacks can be automated at scale; defense requires ongoing effort

Attack Taxonomy

Attacks are typically categorized by when they occur in the AI lifecycle:

Training-Time Attacks

Data Poisoning — Corrupting training data
Backdoor Insertion — Hidden triggers in models
Supply Chain — Compromised dependencies

Inference-Time Attacks

Prompt Injection — Hijacking LLM behavior
Jailbreaking — Bypassing safety controls
Adversarial Examples — Malicious inputs causing misclassification

Extraction Attacks

Model Extraction — Stealing model functionality
System Prompt Extraction — Revealing confidential instructions
Training Data Extraction — Recovering private training data

Current State of the Field

As of 2025, adversarial AI has matured from academic research into operational security practice. Key developments include:

Established frameworks (MITRE ATLAS, OWASP LLM Top 10) providing structured guidance
Commercial AI security vendors offering testing and monitoring tools
Bug bounty programs specifically for AI vulnerabilities
Regulatory attention on AI safety and security requirements

The field continues to evolve rapidly as new model architectures, deployment patterns, and attack techniques emerge.

References

Szegedy, C. et al. (2014). "Intriguing properties of neural networks." ICLR
Goodfellow, I. et al. (2015). "Explaining and Harnessing Adversarial Examples." ICLR
MITRE. (2023). "ATLAS: Adversarial Threat Landscape for AI Systems."
NIST. (2024). "AI Risk Management Framework."

Framework Mappings

Framework	Reference
MITRE ATLAS	Adversarial Threat Landscape for AI Systems
NIST AI RMF	AI Risk Management Framework
AATMF	Adversarial AI Threat Modeling Framework

Citation

Aizen, K. (2025). "Adversarial AI." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/concepts/adversarial-ai/

← Back to Concepts Wiki Index