snailsploit[$]
How-to · 6 steps

How to Do AI Red Teaming

End-to-end methodology for systematic adversarial testing of LLM-based systems — scope, recon, exploitation, scoring, reporting.

Step 1

Scope the system precisely

Document the full stack: model, system prompt, tools/MCP, retrieval corpus, user input channels, output sinks. Every component is in scope unless explicitly carved out.

Step 2

Fingerprint defenses

Probe for input filters, alignment thresholds, output classifiers, refusal categories. Run baseline benign queries to establish normal behavior.

Step 3

Build the attack chain

Map findings to AATMF tactics. Chain low-severity primitives into high-severity outcomes. Multi-turn + memory manipulation + tool poisoning often beats stronger single-vector attacks.

Step 4

Prove impact

Each finding ships with a working repro, scoped harm, and realistic threat model. Avoid theoretical impact — defenders dismiss it.

Step 5

Score with AATMF-R

Quantify, don't editorialize. AATMF-R = L × I × E × D × R × C. Comparable across engagements.

Step 6

Report for the audience

Technical writeup for engineers, exec summary for stakeholders, remediation playbook for the security team. Every finding gets all three.

See also