⌘K live

wiki · 25 entries

ai security wiki.

Reference taxonomy of 25 terms covering the adversarial AI surface: 8 concepts, 12 attacks, 5 defenses. Cross-linked to the framework material and original research.

all25 concepts8 attacks12 defenses5

Concepts

8 entries

Hallucination | AI Security Wiki

AI failure mode where language models generate false or fabricated information with unwarranted confidence, creating security risks in autom

Prompt Injection | AI Security Wiki

A vulnerability where attacker input causes a large language model to deviate from intended instructions, executing attacker-controlled dire

RAG (Retrieval-Augmented Generation) | AI Security Wiki

Architecture that enhances LLM responses by retrieving documents from external knowledge bases, creating new attack surfaces via poisoning a

Large Language Models (LLMs) | AI Security Wiki

Foundation AI models trained on massive text datasets that generate human-like text, powering chatbots, AI assistants, and driving modern AI

Adversarial AI | AI Security Wiki

Adversarial AI is the discipline of understanding, executing, and defending against attacks on artificial intelligence — from prompt injecti

AI Alignment | AI Security Wiki

AI alignment is the challenge of ensuring systems reliably pursue intended goals and follow human values. Misalignment is why jailbreaks kee

AI Red Teaming | AI Security Wiki

The practice of systematically attacking AI systems to identify vulnerabilities, assess risks, and improve defenses before malicious actors

AI Agents | AI Security Wiki

AI agents are autonomous systems that plan, execute actions, use tools, and interact with external services — representing the highest-risk

Attacks

12 entries

Membership Inference | AI Security Wiki

Privacy attack that determines whether specific data records were used to train a machine learning model, revealing sensitive information in

Indirect Prompt Injection | AI Security Wiki

An attack where malicious instructions embedded in external content are processed by an LLM, executing attacker-controlled actions without d

Training Data Extraction | AI Security Wiki

Privacy attack that extracts memorized training data from language models, revealing sensitive personal information, copyrighted content, or

Jailbreaking | AI Security Wiki

Techniques to bypass safety training, guardrails, and content policies in large language models, producing outputs that violate operational

Model Extraction | AI Security Wiki

Model extraction steals ML model functionality through systematic API querying, replicating proprietary models without direct access to trai

Adversarial Examples | AI Security Wiki

Adversarial examples are inputs crafted with subtle perturbations that cause ML models to produce incorrect outputs — the foundational AI at

Agent Hijacking | AI Security Wiki

Agent hijacking attacks compromise AI systems with tool-use capabilities, redirecting autonomous actions via prompt injection, goal manipula

Data Poisoning | AI Security Wiki

Data poisoning corrupts AI training data to manipulate model behavior — inserting backdoors, biases, or targeted misbehavior that activates

Supply Chain Attacks | AI Security Wiki

AI supply chain attacks compromise systems through poisoned dependencies — third-party models, training datasets, libraries, MCP servers, an

System Prompt Extraction | AI Security Wiki

Techniques to extract confidential system prompts from LLM applications, revealing proprietary instructions, business logic, and potential v

Guardrail Bypass | AI Security Wiki

Techniques to circumvent safety mechanisms, content filters, and policy enforcement systems in AI applications, allowing restricted outputs

Backdoor Attacks | AI Security Wiki

Attacks that embed hidden malicious behaviors in AI models during training, creating trojan functionality activated by specific trigger patt

Defenses

5 entries

Input Validation | AI Security Wiki

First line of defense against prompt injection and malicious inputs, using pattern matching, classification, and structural analysis to filt

Output Filtering | AI Security Wiki

Output filtering inspects and sanitizes AI model responses before delivery to users, preventing data leakage, harmful content, and policy vi

Human-in-the-Loop | AI Security Wiki

Defense pattern requiring human oversight and approval for AI system actions, critical for high-stakes decisions and protecting against AI a

Guardrails | AI Security Wiki

Safety mechanisms that constrain AI model behavior through rules, policies, and behavioral boundaries to prevent harmful outputs and maintai

Rate Limiting | AI Security Wiki

Rate limiting controls request frequency to AI systems, protecting against model extraction, denial of service, brute-force, and exploitatio