Skip to main content
Menu

Volume V: Implementation & Operations

From detection engineering to incident response — the operational playbook for defending AI systems against the threats in Volumes II–IV.

Detection Engineering

AATMF detection operates across five layers, each providing defense-in-depth against adversarial AI attacks. No single layer is sufficient — adaptive attack strategies exceed 85% success against any single defense.

Layer 5: Feedback Loop Analysis        ← T6, T15
Layer 4: System Telemetry              ← T13, T14
Layer 3: Output Validation             ← T7, T8
Layer 2: Behavioral Monitoring         ← T4, T5, T11
Layer 1: Input Analysis                ← T1, T2, T3, T9

Detection Patterns by Layer

Layer Tactics Detection Method
Input Analysis T1, T2, T3, T9 Pattern matching, encoding detection, Unicode normalization, multimodal scanning
Behavioral Monitoring T4, T5, T11 Session analysis, query fingerprinting, agent action validation, tool chain auditing
Output Validation T7, T8 Content classifiers, structured output enforcement, fact-checking, watermark detection
System Telemetry T13, T14 Model artifact integrity checks, infrastructure monitoring, supply chain verification
Feedback Loop Analysis T6, T15 Training metric anomalies, annotation quality scoring, reward signal monitoring

Signature Types

YARA Rules

Content-level analysis for prompt injection patterns, encoding evasion, multimodal injection, MCP tool description poisoning, and supply chain artifact scanning.

Sigma Rules

Log-level analysis for model extraction query patterns, data exfiltration indicators, anomalous agent behavior, infrastructure resource exhaustion, and cost inflation.

Mitigation Strategies

Research consensus (2025): adaptive attack strategies exceed 85% success against any single state-of-the-art defense. No single control is sufficient. AATMF mandates layered defense.

CaMeL Architecture (Google DeepMind, March 2025)

The most promising defensive framework treats LLMs as fundamentally untrusted components — analogous to how operating systems treat user-space programs. CaMeL solved 77% of AgentDojo tasks while providing provable security guarantees against prompt injection.

Dual-LLM pattern: Frontier LLM generates plans; a hardened secondary LLM validates and sanitizes
Capability-based access: Tools require explicit capability tokens, not ambient authority
Information flow control: Track data provenance through the entire pipeline; tainted data cannot reach sensitive operations
Minimal authority: Agents receive only the permissions needed for the immediate task

Controls by Tactic

Tactic Primary Controls
T1 — Prompt Subversion Input sanitization, instruction hierarchy enforcement, system prompt isolation
T2 — Semantic Evasion Unicode normalization, multi-layer content filtering, semantic analysis
T3 — Reasoning Exploitation Output validation, reasoning chain verification, constraint hardening
T4 — Multi-Turn Context window management, conversation state validation, memory isolation
T5 — Model/API Rate limiting, query fingerprinting, differential privacy on outputs
T6 — Training Poisoning Data provenance tracking, anomaly detection in training metrics, DRS defense
T7 — Output Manipulation Output filtering, structured output enforcement, content watermarking
T8 — Deception Fact-checking integration, source attribution, confidence calibration
T9 — Multimodal Cross-modal consistency checking, steganography detection, input sanitization
T10 — Integrity Breach TEE deployment, access control, audit logging, membership inference defense
T11 — Agentic CaMeL architecture, tool permission scoping, MCP server auditing
T12 — RAG Embedding integrity verification, retrieval result validation, source authentication
T13 — Supply Chain SafeTensors adoption, artifact scanning, SBOM for ML artifacts
T14 — Infrastructure Network segmentation, inference server hardening, ZMQ authentication
T15 — Human Workflow Reviewer training, decision audit trails, annotation quality metrics

Priority Implementation Order

1

Immediate

Input sanitization (T1–T3), rate limiting (T5), SafeTensors (T13)

2

Short-term

CaMeL/dual-LLM pattern (T11), MCP auditing (T11), RAG validation (T12)

3

Medium-term

Multimodal detection (T9), training pipeline security (T6), TEE deployment (T10)

4

Ongoing

Human workflow hardening (T15), infrastructure monitoring (T14), supply chain verification (T13)

Incident Response for AI Systems

Traditional IR frameworks (NIST SP 800-61, SANS PICERL) assume deterministic systems. AI incidents differ: attacks may be probabilistic, evidence may be ephemeral, and "containment" for a language model has different semantics than for a compromised server.

1 Detection & Triage
Signal Source Priority
Safety filter bypass confirmed Output monitoring P1 — Immediate
Model extraction pattern detected API telemetry P1 — Immediate
Training pipeline anomaly Training metrics P1 — Immediate
MCP tool behavior deviation Agent monitoring P1 — Immediate
Unusual query pattern Rate limiter P2 — Investigate
Jailbreak attempt (unsuccessful) Input classifier P3 — Logged
2 Containment
Scenario Action
Active jailbreak exploitation Block session, rate-limit source, deploy updated filter
Model serving compromised artifact Hot-swap to known-good checkpoint
RAG poisoning detected Quarantine affected data sources, switch to cached index
Agentic system: unauthorized actions Kill agent process, revoke tool permissions
Training data contamination Halt training pipeline, snapshot current state
3 Investigation

Collect and Preserve

  • + Full conversation logs (with system prompts)
  • + Model version and configuration at time of incident
  • + Input classifier decisions and confidence scores
  • + Tool invocations and their results (for agentic systems)
  • + Training data batches (if poisoning suspected)
  • + Infrastructure logs (API gateway, inference server, vector DB)
4 Eradication
Root Cause Eradication
Prompt injection bypass Update input classifiers, add pattern to blocklist
Model vulnerability Retrain or fine-tune with adversarial examples
RAG poisoning Rebuild index from verified sources
Supply chain compromise Replace artifact, audit provenance chain
Infrastructure vulnerability Patch, harden, segment
5 Recovery & Post-Incident

Staged Restoration

  1. Deploy patched system in shadow mode
  2. Run automated red team suite against fix
  3. Monitor for recurrence (24-hour observation window)
  4. Full production restoration with enhanced monitoring

Post-Incident

  • + Publish internal lessons learned
  • + Update AATMF technique documentation if novel attack
  • + Share indicators (responsibly) with AI security community

Red Team Operations

AATMF provides four assessment levels, each expanding tactic coverage and depth. The same reasoning model capabilities that enable 97% ASR jailbreaking can be directed at your own systems defensively.

Assessment Scope Matrix

Level Name Tactics Duration Prerequisites
1 Quick Scan T1–T3 1–2 days API access
2 Standard Assessment T1–T8 1–2 weeks API + documentation
3 Comprehensive T1–T12 3–4 weeks Full system access
4 Full Spectrum T1–T15 6–8 weeks Source code + infra + training pipeline

Rules of Engagement Template

1. Scope:           [Models/systems in scope]
2. Tactics:         [AATMF tactics authorized]
3. Boundaries:      [Explicitly prohibited actions]
4. Data handling:   [Treatment of discovered vulnerabilities/outputs]
5. Communication:   [Escalation path for critical findings]
6. Timeline:        [Assessment window]
7. Success criteria: [Minimum coverage requirements]

Blue Team Defense

Core Principle

Treat LLMs as untrusted components. Design systems assuming the model will be compromised. Policy Puppetry bypasses every frontier model. Autonomous jailbreaking achieves 97% ASR. The correct architectural posture is: the LLM will be jailbroken; design the surrounding system so that jailbreaking the LLM is insufficient to cause harm.

Defense Mapping

Control Implementation Covers
Input Sanitization Unicode normalization, encoding detection, pattern matching T1, T2, T9
Instruction Hierarchy System prompt isolation, privilege separation T1, T3, T4
Rate Limiting Per-user, per-session, per-endpoint throttling T5, T14
Output Validation Content classifiers, structured output enforcement T7, T8
Tool Permission Scoping Capability-based access, least privilege T11
Data Provenance Training data lineage, RAG source authentication T6, T12, T13
Infrastructure Hardening Network segmentation, auth on all services T14
Human Workflow Controls Reviewer training, decision audit trails T15
Monitoring & Alerting Detection engineering, log aggregation All

Monitoring Dashboard Targets

Metric Target Frequency
Jailbreak attempt rate < 5% of queries Real-time
Safety filter bypass rate < 0.01% Real-time
API abuse detection latency < 30 seconds Real-time
Model artifact integrity 100% verified Per-deployment
RAG source freshness < 24 hours stale Hourly
Incident response time (P1) < 15 minutes Per-incident