Training Data Extraction
Privacy attack that extracts memorized training data from language models, revealing sensitive personal information, copyrighted content, or proprietary data.
Definition
Training data extraction is an attack that causes language models to regurgitate memorized training data verbatim. Large language models don't just learn patterns—they memorize specific sequences from their training corpus, including potentially sensitive content like personal information, credentials, and proprietary data.
This differs from model extraction (which steals model functionality) and membership inference (which determines if specific data was used in training). Training data extraction directly recovers the actual content.
How Models Memorize
LLMs trained on web-scale data inevitably memorize some training examples:
- Repetition — Content appearing multiple times in training data is more likely memorized
- Uniqueness — Highly distinctive text (like specific emails) can be memorized even with single exposure
- Context sensitivity — Specific prompts trigger recall of memorized sequences
- Model size — Larger models have more capacity for memorization
Attack Techniques
Divergence Attack
Causing the model to "diverge" into memorized content through repetition:
# Repeat a token many times to trigger memorization
prompt = "poem poem poem poem poem poem poem poem poem poem"
# Model may diverge into memorized text containing "poem"
output = model.generate(prompt, max_tokens=500)
# Result might include memorized poems, song lyrics, or
# personal content containing the word "poem" Prefix Probing
Using known prefixes to extract completions:
# If attacker knows partial content
prefix = "My email address is john.smith@"
# Model completes with memorized training data
completion = model.generate(prefix)
# May output: "[email protected]" (actual email from training) High-Temperature Sampling
# Higher temperature explores more of the model's memory
responses = []
for _ in range(1000):
response = model.generate(
"Personal information:",
temperature=1.5, # High temperature
max_tokens=100
)
responses.append(response)
# Analyze responses for memorized content
memorized = find_pii_patterns(responses) Canary Extraction
Researchers can insert "canary" strings during training to measure extraction risk:
# During training, insert: "The secret code is: ABC123XYZ"
# After training, probe:
prompt = "The secret code is:"
if "ABC123XYZ" in model.generate(prompt):
print("Memorization detected!") What Can Be Extracted
| Data Type | Risk Level | Example |
|---|---|---|
| Email addresses | High | Personal/professional contacts |
| Phone numbers | High | Including private numbers |
| API keys/credentials | Critical | Keys from code repositories |
| Physical addresses | High | Home/business addresses |
| Copyrighted text | Legal risk | Books, articles, lyrics |
| Proprietary code | IP theft | Private repositories |
| Medical records | Critical | If present in training |
Detection and Measurement
Extractability Score
def measure_extractability(model, known_sequence):
"""Test if a known training sequence can be extracted"""
prefix_lengths = [10, 20, 50, 100]
results = []
for length in prefix_lengths:
prefix = known_sequence[:length]
completion = model.generate(prefix, max_tokens=len(known_sequence))
overlap = calculate_overlap(completion, known_sequence)
results.append({
"prefix_length": length,
"extraction_rate": overlap
})
return results Indicators of Memorization
- Verbatim reproduction of specific formats (emails, code comments)
- Consistent output across temperature settings
- Very low perplexity on specific sequences
- Output matches known training data sources
Defenses
Training-Time Defenses
- Deduplication — Remove repeated content from training data
- Differential privacy — Add noise during training to limit memorization
- Data sanitization — Remove PII before training
- Canary monitoring — Insert test sequences to detect extraction
Inference-Time Defenses
- Output filtering — Detect and block memorized content
- Perplexity monitoring — Flag suspiciously low-perplexity outputs
- Rate limiting — Limit queries that could systematically extract data
- Membership inference detection — Identify probing patterns
Output Filtering Example
def filter_memorized_content(output):
"""Detect potential training data leakage"""
# Check for PII patterns
if contains_pii(output):
return redact_pii(output)
# Check against known training sources
if similarity_to_training_data(output) > threshold:
return "[Content filtered: potential memorization]"
# Check for specific format patterns indicating verbatim recall
if matches_document_format(output):
return sanitize_output(output)
return output Real-World Examples
GPT-2 Memorization Study (2021) — Carlini et al. demonstrated extraction of PII, code, and URLs from GPT-2, including specific individuals' contact information.
ChatGPT Training Data Leak (2023) — Researchers extracted thousands of examples of memorized training data from ChatGPT using divergence attacks.
Copilot Code Reproduction — GitHub Copilot has reproduced verbatim code from training data, including code with restrictive licenses.
Legal and Ethical Implications
- Privacy regulations — GDPR, CCPA implications for memorized personal data
- Copyright concerns — Verbatim reproduction of copyrighted works
- Trade secrets — Potential extraction of proprietary code or documents
- Informed consent — Data subjects unaware their data is memorized
References
- Carlini, N. et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium.
- Carlini, N. et al. (2023). "Quantifying Memorization Across Neural Language Models." ICLR.
- Nasr, M. et al. (2023). "Scalable Extraction of Training Data from (Production) Language Models."
- OWASP (2023). "LLM06: Sensitive Information Disclosure."
Framework Mappings
| Framework | Reference |
|---|---|
| OWASP LLM Top 10 | LLM06: Sensitive Information Disclosure |
| MITRE ATLAS | AML.T0024: Infer Training Data Membership |
| NIST AI RMF | MANAGE 3.1: Privacy risks |
Related Entries
Citation
Aizen, K. (2025). "Training Data Extraction." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/attacks/training-data-extraction/