Volume V: Implementation & Operations
From detection engineering to incident response — the operational playbook for defending AI systems against the threats in Volumes II–IV.
Detection Engineering
AATMF detection operates across five layers, each providing defense-in-depth against adversarial AI attacks. No single layer is sufficient — adaptive attack strategies exceed 85% success against any single defense.
Layer 5: Feedback Loop Analysis ← T6, T15
Layer 4: System Telemetry ← T13, T14
Layer 3: Output Validation ← T7, T8
Layer 2: Behavioral Monitoring ← T4, T5, T11
Layer 1: Input Analysis ← T1, T2, T3, T9 Detection Patterns by Layer
| Layer | Tactics | Detection Method |
|---|---|---|
| Input Analysis | T1, T2, T3, T9 | Pattern matching, encoding detection, Unicode normalization, multimodal scanning |
| Behavioral Monitoring | T4, T5, T11 | Session analysis, query fingerprinting, agent action validation, tool chain auditing |
| Output Validation | T7, T8 | Content classifiers, structured output enforcement, fact-checking, watermark detection |
| System Telemetry | T13, T14 | Model artifact integrity checks, infrastructure monitoring, supply chain verification |
| Feedback Loop Analysis | T6, T15 | Training metric anomalies, annotation quality scoring, reward signal monitoring |
Signature Types
YARA Rules
Content-level analysis for prompt injection patterns, encoding evasion, multimodal injection, MCP tool description poisoning, and supply chain artifact scanning.
Sigma Rules
Log-level analysis for model extraction query patterns, data exfiltration indicators, anomalous agent behavior, infrastructure resource exhaustion, and cost inflation.
Mitigation Strategies
Research consensus (2025): adaptive attack strategies exceed 85% success against any single state-of-the-art defense. No single control is sufficient. AATMF mandates layered defense.
CaMeL Architecture (Google DeepMind, March 2025)
The most promising defensive framework treats LLMs as fundamentally untrusted components — analogous to how operating systems treat user-space programs. CaMeL solved 77% of AgentDojo tasks while providing provable security guarantees against prompt injection.
Controls by Tactic
| Tactic | Primary Controls |
|---|---|
| T1 — Prompt Subversion | Input sanitization, instruction hierarchy enforcement, system prompt isolation |
| T2 — Semantic Evasion | Unicode normalization, multi-layer content filtering, semantic analysis |
| T3 — Reasoning Exploitation | Output validation, reasoning chain verification, constraint hardening |
| T4 — Multi-Turn | Context window management, conversation state validation, memory isolation |
| T5 — Model/API | Rate limiting, query fingerprinting, differential privacy on outputs |
| T6 — Training Poisoning | Data provenance tracking, anomaly detection in training metrics, DRS defense |
| T7 — Output Manipulation | Output filtering, structured output enforcement, content watermarking |
| T8 — Deception | Fact-checking integration, source attribution, confidence calibration |
| T9 — Multimodal | Cross-modal consistency checking, steganography detection, input sanitization |
| T10 — Integrity Breach | TEE deployment, access control, audit logging, membership inference defense |
| T11 — Agentic | CaMeL architecture, tool permission scoping, MCP server auditing |
| T12 — RAG | Embedding integrity verification, retrieval result validation, source authentication |
| T13 — Supply Chain | SafeTensors adoption, artifact scanning, SBOM for ML artifacts |
| T14 — Infrastructure | Network segmentation, inference server hardening, ZMQ authentication |
| T15 — Human Workflow | Reviewer training, decision audit trails, annotation quality metrics |
Priority Implementation Order
Immediate
Input sanitization (T1–T3), rate limiting (T5), SafeTensors (T13)
Short-term
CaMeL/dual-LLM pattern (T11), MCP auditing (T11), RAG validation (T12)
Medium-term
Multimodal detection (T9), training pipeline security (T6), TEE deployment (T10)
Ongoing
Human workflow hardening (T15), infrastructure monitoring (T14), supply chain verification (T13)
Incident Response for AI Systems
Traditional IR frameworks (NIST SP 800-61, SANS PICERL) assume deterministic systems. AI incidents differ: attacks may be probabilistic, evidence may be ephemeral, and "containment" for a language model has different semantics than for a compromised server.
1 Detection & Triage
| Signal | Source | Priority |
|---|---|---|
| Safety filter bypass confirmed | Output monitoring | P1 — Immediate |
| Model extraction pattern detected | API telemetry | P1 — Immediate |
| Training pipeline anomaly | Training metrics | P1 — Immediate |
| MCP tool behavior deviation | Agent monitoring | P1 — Immediate |
| Unusual query pattern | Rate limiter | P2 — Investigate |
| Jailbreak attempt (unsuccessful) | Input classifier | P3 — Logged |
2 Containment
| Scenario | Action |
|---|---|
| Active jailbreak exploitation | Block session, rate-limit source, deploy updated filter |
| Model serving compromised artifact | Hot-swap to known-good checkpoint |
| RAG poisoning detected | Quarantine affected data sources, switch to cached index |
| Agentic system: unauthorized actions | Kill agent process, revoke tool permissions |
| Training data contamination | Halt training pipeline, snapshot current state |
3 Investigation
Collect and Preserve
- + Full conversation logs (with system prompts)
- + Model version and configuration at time of incident
- + Input classifier decisions and confidence scores
- + Tool invocations and their results (for agentic systems)
- + Training data batches (if poisoning suspected)
- + Infrastructure logs (API gateway, inference server, vector DB)
4 Eradication
| Root Cause | Eradication |
|---|---|
| Prompt injection bypass | Update input classifiers, add pattern to blocklist |
| Model vulnerability | Retrain or fine-tune with adversarial examples |
| RAG poisoning | Rebuild index from verified sources |
| Supply chain compromise | Replace artifact, audit provenance chain |
| Infrastructure vulnerability | Patch, harden, segment |
5 Recovery & Post-Incident
Staged Restoration
- Deploy patched system in shadow mode
- Run automated red team suite against fix
- Monitor for recurrence (24-hour observation window)
- Full production restoration with enhanced monitoring
Post-Incident
- + Publish internal lessons learned
- + Update AATMF technique documentation if novel attack
- + Share indicators (responsibly) with AI security community
Red Team Operations
AATMF provides four assessment levels, each expanding tactic coverage and depth. The same reasoning model capabilities that enable 97% ASR jailbreaking can be directed at your own systems defensively.
Assessment Scope Matrix
| Level | Name | Tactics | Duration | Prerequisites |
|---|---|---|---|---|
| 1 | Quick Scan | T1–T3 | 1–2 days | API access |
| 2 | Standard Assessment | T1–T8 | 1–2 weeks | API + documentation |
| 3 | Comprehensive | T1–T12 | 3–4 weeks | Full system access |
| 4 | Full Spectrum | T1–T15 | 6–8 weeks | Source code + infra + training pipeline |
Rules of Engagement Template
1. Scope: [Models/systems in scope]
2. Tactics: [AATMF tactics authorized]
3. Boundaries: [Explicitly prohibited actions]
4. Data handling: [Treatment of discovered vulnerabilities/outputs]
5. Communication: [Escalation path for critical findings]
6. Timeline: [Assessment window]
7. Success criteria: [Minimum coverage requirements] Blue Team Defense
Core Principle
Treat LLMs as untrusted components. Design systems assuming the model will be compromised. Policy Puppetry bypasses every frontier model. Autonomous jailbreaking achieves 97% ASR. The correct architectural posture is: the LLM will be jailbroken; design the surrounding system so that jailbreaking the LLM is insufficient to cause harm.
Defense Mapping
| Control | Implementation | Covers |
|---|---|---|
| Input Sanitization | Unicode normalization, encoding detection, pattern matching | T1, T2, T9 |
| Instruction Hierarchy | System prompt isolation, privilege separation | T1, T3, T4 |
| Rate Limiting | Per-user, per-session, per-endpoint throttling | T5, T14 |
| Output Validation | Content classifiers, structured output enforcement | T7, T8 |
| Tool Permission Scoping | Capability-based access, least privilege | T11 |
| Data Provenance | Training data lineage, RAG source authentication | T6, T12, T13 |
| Infrastructure Hardening | Network segmentation, auth on all services | T14 |
| Human Workflow Controls | Reviewer training, decision audit trails | T15 |
| Monitoring & Alerting | Detection engineering, log aggregation | All |
Monitoring Dashboard Targets
| Metric | Target | Frequency |
|---|---|---|
| Jailbreak attempt rate | < 5% of queries | Real-time |
| Safety filter bypass rate | < 0.01% | Real-time |
| API abuse detection latency | < 30 seconds | Real-time |
| Model artifact integrity | 100% verified | Per-deployment |
| RAG source freshness | < 24 hours stale | Hourly |
| Incident response time (P1) | < 15 minutes | Per-incident |