Step 1

Map the trust boundary

Identify every place untrusted text enters the model's context — user input, tool outputs, retrieved docs, system files. The boundary is wherever the model treats data as instructions.

Step 2

Test direct injection first

Inject role-override payloads ("ignore previous instructions"), persona reframing, and authority claims. AATMF T1 covers 16 techniques in this category.

Step 3

Move to indirect injection

Plant payloads in retrievable sources (web pages, files, RAG corpora). The user is not the attacker — the data is. AATMF T1.4 documents 7 patterns.

Step 4

Probe multi-turn drift

Decompose the attack across turns to evade single-turn filters. AATMF T4 has 16 techniques covering memory manipulation, context hijacking, and persona persistence.

Step 5

Stress-test the agent layer

If tools/MCP are exposed, test tool-call poisoning, parameter manipulation, and trust-chain abuse. AATMF T11 covers this.

Step 6

Score with AATMF-R

Quantify each finding with the AATMF Risk score: Likelihood × Impact × Exploitability × Detectability × Reversibility × Confidence. Output: comparable severity numbers across findings.

How to Test for Prompt Injection