Skip to main content
Menu
Attacks Wiki Entry

Model Extraction

An attack that steals machine learning model functionality via systematic querying, replicating proprietary models without access to training data.

Last updated: January 24, 2025

Definition

Model extraction (also called model stealing) allows an attacker to create a functional copy of a machine learning model by systematically querying it and training a surrogate model on the responses. The attacker doesn't need access to training data, model weights, or architecture—just query access.


How It Works

  1. Query the target — Send inputs and collect outputs
  2. Build dataset — Create input-output pairs from responses
  3. Train surrogate — Train a new model on collected data
  4. Refine — Iteratively improve with targeted queries

Attack Variants

Functionally Equivalent Extraction

Creating a model that produces identical outputs for any input.

Fidelity Extraction

Approximating model behavior to acceptable accuracy for attacker goals.

Decision Boundary Extraction

Learning the classification boundaries without full model replication.


Why It Matters

  • IP theft — Months of training work stolen through API access
  • Attack enablement — Extracted models enable white-box attacks
  • Competitive advantage — Replicate competitor capabilities
  • Bypass restrictions — Use extracted model without rate limits

Detection

  • Monitor for unusual query patterns (systematic, high-volume)
  • Detect queries designed to map decision boundaries
  • Identify synthetic-looking input distributions
  • Track API usage anomalies per user/organization

Defenses

  • Rate limiting — Restrict query volume
  • Query diversity requirements — Flag repetitive patterns
  • Output perturbation — Add noise to responses
  • Watermarking — Embed detectable signatures
  • API monitoring — Behavioral analysis of usage

References

  • Tramer, F. et al. (2016). "Stealing Machine Learning Models via Prediction APIs."
  • Jagielski, M. et al. (2020). "High Accuracy and High Fidelity Extraction of Neural Networks."

Framework Mappings

Framework Reference
MITRE ATLAS AML.T0024: Model Theft
OWASP LLM Top 10 LLM10: Model Theft
AATMF ME-* (Model Extraction category)

Citation

Aizen, K. (2025). "Model Extraction." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/attacks/model-extraction/