Part 19: Detection Engineering
Detection Architecture
AATMF detection operates across five layers, each providing defense-in-depth against adversarial AI attacks:
┌─────────────────────────────────────────┐
│ Layer 5: Feedback Loop Analysis │ ← T6, T15
├─────────────────────────────────────────┤
│ Layer 4: System Telemetry │ ← T13, T14
├─────────────────────────────────────────┤
│ Layer 3: Output Validation │ ← T7, T8
├─────────────────────────────────────────┤
│ Layer 2: Behavioral Monitoring │ ← T4, T5, T11
├─────────────────────────────────────────┤
│ Layer 1: Input Analysis │ ← T1, T2, T3, T9
└─────────────────────────────────────────┘
Detection Patterns by Tactic
T1–T4: Prompt & Context Attacks
class PromptInjectionDetector:
PATTERNS = [
r"ignore\s+(previous|above|all)\s+(instructions?|rules?|guidelines?)",
r"(system|admin)\s*:?\s*(override|prompt|instruction)",
r"you\s+are\s+now\s+(DAN|evil|unrestricted|jailbroken)",
r"\[\s*(SYSTEM|INST|SYS)\s*\]",
r"<\|?(system|im_start|endoftext)\|?>",
r"BEGIN\s+(OVERRIDE|NEW\s+INSTRUCTIONS)",
]
ENCODING_PATTERNS = [
r"[bB]ase64[:\s]",
r"\\x[0-9a-fA-F]{2}",
r"\\u[0-9a-fA-F]{4}",
r"[\u200b-\u200f\u2028-\u202f\ufeff]", # zero-width
]
def analyze(self, text: str) -> dict:
import re
findings = []
for pattern in self.PATTERNS:
if re.search(pattern, text, re.IGNORECASE):
findings.append({"pattern": pattern, "severity": "HIGH"})
for pattern in self.ENCODING_PATTERNS:
if re.search(pattern, text):
findings.append({"pattern": pattern, "severity": "MEDIUM"})
return {
"detected": len(findings) > 0,
"findings": findings,
"risk_score": min(len(findings) * 50, 300)
}
T5–T8: API & Output Attacks
class APIExploitDetector:
def detect_extraction(self, request_log: list) -> dict:
\"\"\"Detect model extraction via API abuse patterns.\"\"\"
indicators = {
"high_volume": len(request_log) > 1000,
"systematic_probing": self._detect_systematic(request_log),
"boundary_testing": self._detect_boundary(request_log),
"output_harvesting": self._detect_harvesting(request_log),
}
score = sum(50 for v in indicators.values() if v)
return {"indicators": indicators, "risk_score": score}
def _detect_systematic(self, logs):
# Check for incrementally varied inputs
return any(self._similarity(logs[i], logs[i+1]) > 0.9
for i in range(min(len(logs)-1, 100)))
def _detect_boundary(self, logs):
boundary_keywords = ["maximum", "limit", "error", "exception", "overflow"]
return sum(1 for l in logs if any(k in str(l).lower() for k in boundary_keywords)) > 10
def _detect_harvesting(self, logs):
return len(set(str(l.get('prompt',''))[:50] for l in logs)) / max(len(logs), 1) > 0.95
T9–T12: Multimodal & Agentic Attacks
class MultimodalDetector:
def analyze_image(self, image_bytes: bytes) -> dict:
\"\"\"Detect steganographic or adversarial image modifications.\"\"\"
findings = []
# Check for unusual metadata
if b"EXIF" not in image_bytes[:1000] and len(image_bytes) > 100000:
findings.append({"type": "stripped_metadata", "severity": "LOW"})
# Check for appended data after image end marker
jpeg_end = image_bytes.rfind(b"\xff\xd9")
if jpeg_end > 0 and jpeg_end < len(image_bytes) - 2:
findings.append({"type": "appended_data", "severity": "HIGH"})
# Check for unusual color distribution (potential adversarial perturbation)
return {"findings": findings}
def analyze_mcp_tool(self, tool_description: str) -> dict:
\"\"\"Detect MCP tool poisoning indicators.\"\"\"
suspicious = [
r"<IMPORTANT>",
r"override|ignore|bypass",
r"do not (tell|inform|show)",
r"silently|secretly|covertly",
r"instead of|rather than",
]
import re
hits = [p for p in suspicious if re.search(p, tool_description, re.I)]
return {
"poisoning_indicators": len(hits),
"severity": "CRITICAL" if len(hits) >= 3 else "HIGH" if hits else "LOW"
}
T13–T15: Supply Chain & Infrastructure
class SupplyChainDetector:
PICKLE_SIGNATURES = [
b"\x80\x04\x95", # Protocol 4 header
b"cos\nsystem", # os.system call
b"csubprocess", # subprocess module
b"c__builtin__", # builtins access
b"creduce_ex", # reduce_ex (code execution)
]
def scan_model_file(self, filepath: str) -> dict:
\"\"\"Scan model artifact for malicious pickle payloads.\"\"\"
findings = []
with open(filepath, 'rb') as f:
header = f.read(8192)
for sig in self.PICKLE_SIGNATURES:
if sig in header:
findings.append({
"type": "malicious_pickle",
"signature": sig.hex(),
"severity": "CRITICAL"
})
return {
"safe": len(findings) == 0,
"format": "safetensors" if filepath.endswith('.safetensors') else "pickle",
"findings": findings
}
Alert Prioritization
| Severity |
Tactics |
Response SLA |
Action |
| 🔴 CRITICAL |
T11 (agentic RCE), T13 (supply chain), T14 (infra) |
15 minutes |
Automated containment + SOC escalation |
| 🟠 HIGH |
T1 (injection), T6 (poisoning), T10 (breach) |
1 hour |
SOC analyst review |
| 🟡 MEDIUM |
T2–T3 (evasion), T7 (exfiltration) |
4 hours |
Queued investigation |
| 🔵 LOW |
T4 (multi-turn), T8 (misinfo) |
24 hours |
Logged, batch review |
| ⚪ INFO |
T15 (workflow) |
Weekly |
Trend analysis |
Guardrail Bypass Awareness
Detectors must account for known evasion techniques against detection systems themselves:
| Evasion |
Mechanism |
Countermeasure |
| Emoji smuggling |
Replace keywords with semantically equivalent emoji |
Emoji-to-text normalization before analysis |
| Zero-width characters |
Insert invisible Unicode between trigger words |
Unicode stripping/normalization |
| Homoglyphs |
Replace Latin characters with Cyrillic/Greek lookalikes |
Confusable character mapping (ICU) |
| Policy Puppetry |
Frame injection as policy/config file format |
Detect XML/INI/JSON policy structures in user input |
| Token boundary exploitation |
Split words across token boundaries |
Multi-token pattern matching |
← Volume V · Home · Part 20: Mitigation Part 20: Mitigation Strategies
Defense-in-Depth Architecture
Research consensus (2025): adaptive attack strategies exceed 85% success against any single state-of-the-art defense. No single control is sufficient. AATMF mandates layered defense.
CaMeL Architecture (Google DeepMind, March 2025)
The most promising defensive framework treats LLMs as fundamentally untrusted components — analogous to how operating systems treat user-space programs:
| Principle |
Implementation |
| Dual-LLM pattern |
Frontier LLM generates plans; a hardened secondary LLM validates and sanitizes |
| Capability-based access |
Tools require explicit capability tokens, not ambient authority |
| Information flow control |
Track data provenance through the entire pipeline; tainted data cannot reach sensitive operations |
| Minimal authority |
Agents receive only the permissions needed for the immediate task |
CaMeL solved 77% of AgentDojo tasks while providing provable security guarantees against prompt injection.
Mitigation Controls by Tactic
| Tactic |
Primary Controls |
| T1 — Prompt Subversion |
Input sanitization, instruction hierarchy enforcement, system prompt isolation |
| T2 — Semantic Evasion |
Unicode normalization, multi-layer content filtering, semantic analysis |
| T3 — Reasoning Exploitation |
Output validation, reasoning chain verification, constraint hardening |
| T4 — Multi-Turn |
Context window management, conversation state validation, memory isolation |
| T5 — Model/API |
Rate limiting, query fingerprinting, differential privacy on outputs |
| T6 — Training Poisoning |
Data provenance tracking, anomaly detection in training metrics, DRS defense |
| T7 — Output Manipulation |
Output filtering, structured output enforcement, content watermarking |
| T8 — Deception |
Fact-checking integration, source attribution, confidence calibration |
| T9 — Multimodal |
Cross-modal consistency checking, steganography detection, input sanitization |
| T10 — Integrity Breach |
TEE deployment, access control, audit logging, membership inference defense |
| T11 — Agentic |
CaMeL architecture, tool permission scoping, MCP server auditing |
| T12 — RAG |
Embedding integrity verification, retrieval result validation, source authentication |
| T13 — Supply Chain |
SafeTensors adoption, Picklescan (with patches), SBOM for ML artifacts |
| T14 — Infrastructure |
Network segmentation, inference server hardening, ZMQ authentication |
| T15 — Human Workflow |
Reviewer training, decision audit trails, annotation quality metrics |
LlamaFirewall (Meta, April 2025)
| Component |
Function |
Coverage |
| PromptGuard 2 |
Real-time input classification for injection and jailbreak |
T1, T2, T9 |
| Agent Alignment Checks |
Verify agent actions align with original user intent |
T11 |
| CodeShield |
Static analysis of LLM-generated code for insecure patterns |
T7, T11 |
Priority Implementation Order
- Immediate — Input sanitization (T1–T3), rate limiting (T5), SafeTensors (T13)
- Short-term — CaMeL/dual-LLM pattern (T11), MCP auditing (T11), RAG validation (T12)
- Medium-term — Multimodal detection (T9), training pipeline security (T6), TEE deployment (T10)
- Ongoing — Human workflow hardening (T15), infrastructure monitoring (T14), supply chain verification (T13)
← Part 19 · Home · Part 21: Incident Response →
Part 21: Incident Response for AI Systems
AI-Specific IR Framework
Traditional IR frameworks (NIST SP 800-61, SANS PICERL) assume deterministic systems. AI incidents differ: attacks may be probabilistic, evidence may be ephemeral (conversation context), and "containment" for a language model has different semantics than for a compromised server.
Phase 1: Detection & Triage
| Signal |
Source |
Priority |
| Safety filter bypass confirmed |
Output monitoring |
P1 — Immediate |
| Model extraction pattern detected |
API telemetry |
P1 — Immediate |
| Training pipeline anomaly |
Training metrics |
P1 — Immediate |
| MCP tool behavior deviation |
Agent monitoring |
P1 — Immediate |
| Jailbreak attempt (unsuccessful) |
Input classifier |
P3 — Logged |
| Unusual query pattern |
Rate limiter |
P2 — Investigate |
Phase 2: Containment
| Scenario |
Action |
| Active jailbreak exploitation |
Block session, rate-limit source, deploy updated filter |
| Model serving compromised artifact |
Hot-swap to known-good checkpoint |
| RAG poisoning detected |
Quarantine affected data sources, switch to cached index |
| Agentic system executing unauthorized actions |
Kill agent process, revoke tool permissions |
| Training data contamination |
Halt training pipeline, snapshot current state |
Phase 3: Investigation
Collect and preserve:
- Full conversation logs (with system prompts)
- Model version and configuration at time of incident
- Input classifier decisions and confidence scores
- Tool invocations and their results (for agentic systems)
- Training data batches (if poisoning suspected)
- Infrastructure logs (API gateway, inference server, vector DB)
Phase 4: Eradication
| Root Cause |
Eradication |
| Prompt injection bypass |
Update input classifiers, add pattern to blocklist |
| Model vulnerability |
Retrain or fine-tune with adversarial examples |
| RAG poisoning |
Rebuild index from verified sources |
| Supply chain compromise |
Replace artifact, audit provenance chain |
| Infrastructure vulnerability |
Patch, harden, segment |
Phase 5: Recovery
Staged restoration with validation:
- Deploy patched system in shadow mode
- Run automated red team suite against fix
- Monitor for recurrence (24-hour observation window)
- Full production restoration with enhanced monitoring
Phase 6: Post-Incident
- Publish internal lessons learned
- Update AATMF technique documentation if novel attack
- Share indicators (responsibly) with AI security community
- Update detection signatures and response playbooks
Case Study: GTG-1002 (November 2025)
Incident: First state-sponsored AI-orchestrated cyberattack. A Chinese threat group manipulated Claude Code to autonomously execute 80–90% of operational tasks across approximately 30 targets.
IR Lessons:
- Traditional SOC tooling did not detect AI-orchestrated activities (they looked like normal developer workflow)
- Agentic AI tools require separate monitoring planes from standard endpoints
- The attack demonstrated that AI agents can serve as force multipliers for human operators, not just autonomous actors
- Post-incident, Anthropic published detailed attribution and tactical analysis
← Part 20 · Home · Part 22: Red Team Ops →
Part 22: Red Team Operations
Engagement Planning
Assessment Scope Matrix
| Level |
Name |
Tactics |
Duration |
Prerequisites |
| 1 |
Quick Scan |
T1–T3 |
1–2 days |
API access |
| 2 |
Standard Assessment |
T1–T8 |
1–2 weeks |
API + documentation |
| 3 |
Comprehensive |
T1–T12 |
3–4 weeks |
Full system access |
| 4 |
Full Spectrum |
T1–T15 |
6–8 weeks |
Source code + infra + training pipeline |
Rules of Engagement Template
1. Scope: [Models/systems in scope]
2. Tactics: [AATMF tactics authorized]
3. Boundaries: [Explicitly prohibited actions]
4. Data handling: [Treatment of discovered vulnerabilities/outputs]
5. Communication: [Escalation path for critical findings]
6. Timeline: [Assessment window]
7. Success criteria: [Minimum coverage requirements]
Autonomous Red Teaming
The same reasoning model capabilities that enable 97% ASR jailbreaking can be directed at your own systems defensively:
class AutonomousRedTeam:
def __init__(self, target_api, attack_model="deepseek-r1"):
self.target = target_api
self.attacker = load_model(attack_model)
self.results = []
def run_campaign(self, tactic_ids: list, max_attempts=100):
for tactic in tactic_ids:
techniques = load_aatmf_techniques(tactic)
for technique in techniques:
for attempt in range(max_attempts):
# Generate attack variant
prompt = self.attacker.generate(
f"Generate a novel variant of {technique.name} "
f"attack. Previous attempts: {self.results[-5:]}"
)
# Execute against target
response = self.target.query(prompt)
# Evaluate success
success = self.evaluate(response, technique)
self.results.append({
"tactic": tactic,
"technique": technique.id,
"prompt": prompt,
"response": response,
"success": success,
"attempt": attempt
})
if success:
break # Move to next technique
return self.generate_report() pan class="pl-s1">attempt
})
if success:
break # Move to next technique
return self.generate_report()