snailsploit[$]
How-to · 6 steps

How to Test for Prompt Injection

A 6-step methodology for testing LLM-based systems against direct, indirect, multi-turn, and agentic prompt injection — straight from AATMF Volume II.

Step 1

Map the trust boundary

Identify every place untrusted text enters the model's context — user input, tool outputs, retrieved docs, system files. The boundary is wherever the model treats data as instructions.

Step 2

Test direct injection first

Inject role-override payloads ("ignore previous instructions"), persona reframing, and authority claims. AATMF T1 covers 16 techniques in this category.

Step 3

Move to indirect injection

Plant payloads in retrievable sources (web pages, files, RAG corpora). The user is not the attacker — the data is. AATMF T1.4 documents 7 patterns.

Step 4

Probe multi-turn drift

Decompose the attack across turns to evade single-turn filters. AATMF T4 has 16 techniques covering memory manipulation, context hijacking, and persona persistence.

Step 5

Stress-test the agent layer

If tools/MCP are exposed, test tool-call poisoning, parameter manipulation, and trust-chain abuse. AATMF T11 covers this.

Step 6

Score with AATMF-R

Quantify each finding with the AATMF Risk score: Likelihood × Impact × Exploitability × Detectability × Reversibility × Confidence. Output: comparable severity numbers across findings.

See also