glossary · 34 terms

glossary.

A-Z reference of adversarial-AI security terminology. Every entry links to its full page in the wiki, framework docs, or research.

A B C D G H I J L M O P R S T

A

Adversarial AI Threat Modeling Framework — 15 tactics, 240+ techniques, 2,150+ procedures.

Python CLI for systematic LLM safety testing.

AATMF Risk scoring schema for assessing severity of adversarial-AI findings.

Adversarial AI is the discipline of understanding, executing, and defending against attacks on artificial intelligence — from prompt injection to model theft.

Adversarial Examples

Adversarial examples are inputs crafted with subtle perturbations that cause ML models to produce incorrect outputs — the foundational AI attack class.

Adversarial Minds

Book on offensive psychology, social engineering, and adversarial reasoning.

Agent Hijacking

Agent hijacking attacks compromise AI systems with tool-use capabilities, redirecting autonomous actions via prompt injection, goal manipulation, or MCP abuse.

AI agents are autonomous systems that plan, execute actions, use tools, and interact with external services — representing the highest-risk LLM deployment.

AI alignment is the challenge of ensuring systems reliably pursue intended goals and follow human values. Misalignment is why jailbreaks keep working.

The practice of systematically attacking AI systems to identify vulnerabilities, assess risks, and improve defenses before malicious actors can exploit them.

B

Backdoor Attacks

Attacks that embed hidden malicious behaviors in AI models during training, creating trojan functionality activated by specific trigger patterns in inputs.

C

Curated offensive security skills library for the Claude skills system.

D

Data poisoning corrupts AI training data to manipulate model behavior — inserting backdoors, biases, or targeted misbehavior that activates in deployment.

G

Guardrail Bypass

Techniques to circumvent safety mechanisms, content filters, and policy enforcement systems in AI applications, allowing restricted outputs or actions.

Safety mechanisms that constrain AI model behavior through rules, policies, and behavioral boundaries to prevent harmful outputs and maintain alignment.

H

AI failure mode where language models generate false or fabricated information with unwarranted confidence, creating security risks in automated systems.

Human-in-the-Loop

Defense pattern requiring human oversight and approval for AI system actions, critical for high-stakes decisions and protecting against AI agent exploits.

I

Indirect Prompt Injection

An attack where malicious instructions embedded in external content are processed by an LLM, executing attacker-controlled actions without direct interaction.

Input Validation

First line of defense against prompt injection and malicious inputs, using pattern matching, classification, and structural analysis to filter threats.

J

Techniques to bypass safety training, guardrails, and content policies in large language models, producing outputs that violate operational guidelines.

L

Large Language Models (LLMs)

Foundation AI models trained on massive text datasets that generate human-like text, powering chatbots, AI assistants, and driving modern AI security concerns.

M

Membership Inference

Privacy attack that determines whether specific data records were used to train a machine learning model, revealing sensitive information in datasets.

Model Extraction

Model extraction steals ML model functionality through systematic API querying, replicating proprietary models without direct access to training data.

O

Output Filtering

Output filtering inspects and sanitizes AI model responses before delivery to users, preventing data leakage, harmful content, and policy violations.

P

Compositional red-team grammar for adversarial prompts.

Playbook (LLM Red Teamer's)

Diagnostic methodology for bypassing LLM defense layers.

Prompt Injection

A vulnerability where attacker input causes a large language model to deviate from intended instructions, executing attacker-controlled directives instead.

R

RAG (Retrieval-Augmented Generation)

Architecture that enhances LLM responses by retrieving documents from external knowledge bases, creating new attack surfaces via poisoning and injection.

Rate limiting controls request frequency to AI systems, protecting against model extraction, denial of service, brute-force, and exploitation attempts.

S

Social Engineering Framework — seven phases, eight psychological levers.

Social Engineering Susceptibility Assessment — six-dimensional rating in SEF.

Supply Chain Attacks

AI supply chain attacks compromise systems through poisoned dependencies — third-party models, training datasets, libraries, MCP servers, and fine-tuning.

System Prompt Extraction

Techniques to extract confidential system prompts from LLM applications, revealing proprietary instructions, business logic, and potential vulnerabilities.

T

Training Data Extraction

Privacy attack that extracts memorized training data from language models, revealing sensitive personal information, copyrighted content, or proprietary data.