glossary · 34 terms

glossary.

A-Z reference of adversarial-AI security terminology. Every entry links to its full page in the wiki, framework docs, or research.

A B C D G H I J L M O P R S T
A
AATMF
Adversarial AI Threat Modeling Framework — 15 tactics, 240+ techniques, 2,150+ procedures.
framework
AATMF Toolkit
Python CLI for systematic LLM safety testing.
framework
AATMF-R
AATMF Risk scoring schema for assessing severity of adversarial-AI findings.
concept
Adversarial AI
Adversarial AI is the discipline of understanding, executing, and defending against attacks on artificial intelligence — from prompt injection to model theft.
concept
Adversarial Examples
Adversarial examples are inputs crafted with subtle perturbations that cause ML models to produce incorrect outputs — the foundational AI attack class.
attack
Adversarial Minds
Book on offensive psychology, social engineering, and adversarial reasoning.
book
Agent Hijacking
Agent hijacking attacks compromise AI systems with tool-use capabilities, redirecting autonomous actions via prompt injection, goal manipulation, or MCP abuse.
attack
AI Agents
AI agents are autonomous systems that plan, execute actions, use tools, and interact with external services — representing the highest-risk LLM deployment.
concept
AI Alignment
AI alignment is the challenge of ensuring systems reliably pursue intended goals and follow human values. Misalignment is why jailbreaks keep working.
concept
AI Red Teaming
The practice of systematically attacking AI systems to identify vulnerabilities, assess risks, and improve defenses before malicious actors can exploit them.
concept
B
Backdoor Attacks
Attacks that embed hidden malicious behaviors in AI models during training, creating trojan functionality activated by specific trigger patterns in inputs.
attack
C
Claude-Red
Curated offensive security skills library for the Claude skills system.
framework
D
Data Poisoning
Data poisoning corrupts AI training data to manipulate model behavior — inserting backdoors, biases, or targeted misbehavior that activates in deployment.
attack
G
Guardrail Bypass
Techniques to circumvent safety mechanisms, content filters, and policy enforcement systems in AI applications, allowing restricted outputs or actions.
attack
Guardrails
Safety mechanisms that constrain AI model behavior through rules, policies, and behavioral boundaries to prevent harmful outputs and maintain alignment.
defense
H
Hallucination
AI failure mode where language models generate false or fabricated information with unwarranted confidence, creating security risks in automated systems.
concept
Human-in-the-Loop
Defense pattern requiring human oversight and approval for AI system actions, critical for high-stakes decisions and protecting against AI agent exploits.
defense
I
Indirect Prompt Injection
An attack where malicious instructions embedded in external content are processed by an LLM, executing attacker-controlled actions without direct interaction.
attack
Input Validation
First line of defense against prompt injection and malicious inputs, using pattern matching, classification, and structural analysis to filter threats.
defense
J
Jailbreaking
Techniques to bypass safety training, guardrails, and content policies in large language models, producing outputs that violate operational guidelines.
attack
L
Large Language Models (LLMs)
Foundation AI models trained on massive text datasets that generate human-like text, powering chatbots, AI assistants, and driving modern AI security concerns.
concept
M
Membership Inference
Privacy attack that determines whether specific data records were used to train a machine learning model, revealing sensitive information in datasets.
attack
Model Extraction
Model extraction steals ML model functionality through systematic API querying, replicating proprietary models without direct access to training data.
attack
O
Output Filtering
Output filtering inspects and sanitizes AI model responses before delivery to users, preventing data leakage, harmful content, and policy violations.
defense
P
P.R.O.M.P.T
Compositional red-team grammar for adversarial prompts.
framework
Playbook (LLM Red Teamer's)
Diagnostic methodology for bypassing LLM defense layers.
framework
Prompt Injection
A vulnerability where attacker input causes a large language model to deviate from intended instructions, executing attacker-controlled directives instead.
concept
R
RAG (Retrieval-Augmented Generation)
Architecture that enhances LLM responses by retrieving documents from external knowledge bases, creating new attack surfaces via poisoning and injection.
concept
Rate Limiting
Rate limiting controls request frequency to AI systems, protecting against model extraction, denial of service, brute-force, and exploitation attempts.
defense
S
SEF
Social Engineering Framework — seven phases, eight psychological levers.
framework
SESA
Social Engineering Susceptibility Assessment — six-dimensional rating in SEF.
concept
Supply Chain Attacks
AI supply chain attacks compromise systems through poisoned dependencies — third-party models, training datasets, libraries, MCP servers, and fine-tuning.
attack
System Prompt Extraction
Techniques to extract confidential system prompts from LLM applications, revealing proprietary instructions, business logic, and potential vulnerabilities.
attack
T
Training Data Extraction
Privacy attack that extracts memorized training data from language models, revealing sensitive personal information, copyrighted content, or proprietary data.
attack