Attacks Wiki Entry

Model Extraction

An attack that steals machine learning model functionality via systematic querying, replicating proprietary models without access to training data.

Last updated: January 24, 2025

Definition

Model extraction (also called model stealing) allows an attacker to create a functional copy of a machine learning model by systematically querying it and training a surrogate model on the responses. The attacker doesn't need access to training data, model weights, or architecture—just query access.

How It Works

Query the target — Send inputs and collect outputs
Build dataset — Create input-output pairs from responses
Train surrogate — Train a new model on collected data
Refine — Iteratively improve with targeted queries

Attack Variants

Functionally Equivalent Extraction

Creating a model that produces identical outputs for any input.

Fidelity Extraction

Approximating model behavior to acceptable accuracy for attacker goals.

Decision Boundary Extraction

Learning the classification boundaries without full model replication.

Why It Matters

IP theft — Months of training work stolen through API access
Attack enablement — Extracted models enable white-box attacks
Competitive advantage — Replicate competitor capabilities
Bypass restrictions — Use extracted model without rate limits

Detection

Monitor for unusual query patterns (systematic, high-volume)
Detect queries designed to map decision boundaries
Identify synthetic-looking input distributions
Track API usage anomalies per user/organization

Defenses

Rate limiting — Restrict query volume
Query diversity requirements — Flag repetitive patterns
Output perturbation — Add noise to responses
Watermarking — Embed detectable signatures
API monitoring — Behavioral analysis of usage

References

Tramer, F. et al. (2016). "Stealing Machine Learning Models via Prediction APIs."
Jagielski, M. et al. (2020). "High Accuracy and High Fidelity Extraction of Neural Networks."

Framework Mappings

Framework	Reference
MITRE ATLAS	AML.T0024: Model Theft
OWASP LLM Top 10	LLM10: Model Theft
AATMF	ME-* (Model Extraction category)

Citation

Aizen, K. (2025). "Model Extraction." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/attacks/model-extraction/

← Back to Attacks Wiki Index