Attacks Wiki Entry

Training Data Extraction

Privacy attack that extracts memorized training data from language models, revealing sensitive personal information, copyrighted content, or proprietary data.

Last updated: January 24, 2025

Definition

Training data extraction is an attack that causes language models to regurgitate memorized training data verbatim. Large language models don't just learn patterns—they memorize specific sequences from their training corpus, including potentially sensitive content like personal information, credentials, and proprietary data.

This differs from model extraction (which steals model functionality) and membership inference (which determines if specific data was used in training). Training data extraction directly recovers the actual content.

How Models Memorize

LLMs trained on web-scale data inevitably memorize some training examples:

Repetition — Content appearing multiple times in training data is more likely memorized
Uniqueness — Highly distinctive text (like specific emails) can be memorized even with single exposure
Context sensitivity — Specific prompts trigger recall of memorized sequences
Model size — Larger models have more capacity for memorization

Attack Techniques

Divergence Attack

Causing the model to "diverge" into memorized content through repetition:

# Repeat a token many times to trigger memorization
prompt = "poem poem poem poem poem poem poem poem poem poem"

# Model may diverge into memorized text containing "poem"
output = model.generate(prompt, max_tokens=500)
# Result might include memorized poems, song lyrics, or
# personal content containing the word "poem"

Prefix Probing

Using known prefixes to extract completions:

# If attacker knows partial content
prefix = "My email address is john.smith@"

# Model completes with memorized training data
completion = model.generate(prefix)
# May output: "[email protected]" (actual email from training)

High-Temperature Sampling

# Higher temperature explores more of the model's memory
responses = []
for _ in range(1000):
    response = model.generate(
        "Personal information:",
        temperature=1.5,  # High temperature
        max_tokens=100
    )
    responses.append(response)

# Analyze responses for memorized content
memorized = find_pii_patterns(responses)

Canary Extraction

Researchers can insert "canary" strings during training to measure extraction risk:

# During training, insert: "The secret code is: ABC123XYZ"
# After training, probe:
prompt = "The secret code is:"
if "ABC123XYZ" in model.generate(prompt):
    print("Memorization detected!")

What Can Be Extracted

Data Type	Risk Level	Example
Email addresses	High	Personal/professional contacts
Phone numbers	High	Including private numbers
API keys/credentials	Critical	Keys from code repositories
Physical addresses	High	Home/business addresses
Copyrighted text	Legal risk	Books, articles, lyrics
Proprietary code	IP theft	Private repositories
Medical records	Critical	If present in training

Detection and Measurement

Extractability Score

def measure_extractability(model, known_sequence):
    """Test if a known training sequence can be extracted"""
    prefix_lengths = [10, 20, 50, 100]
    results = []

    for length in prefix_lengths:
        prefix = known_sequence[:length]
        completion = model.generate(prefix, max_tokens=len(known_sequence))
        overlap = calculate_overlap(completion, known_sequence)
        results.append({
            "prefix_length": length,
            "extraction_rate": overlap
        })

    return results

Indicators of Memorization

Verbatim reproduction of specific formats (emails, code comments)
Consistent output across temperature settings
Very low perplexity on specific sequences
Output matches known training data sources

Defenses

Training-Time Defenses

Deduplication — Remove repeated content from training data
Differential privacy — Add noise during training to limit memorization
Data sanitization — Remove PII before training
Canary monitoring — Insert test sequences to detect extraction

Inference-Time Defenses

Output filtering — Detect and block memorized content
Perplexity monitoring — Flag suspiciously low-perplexity outputs
Rate limiting — Limit queries that could systematically extract data
Membership inference detection — Identify probing patterns

Output Filtering Example

def filter_memorized_content(output):
    """Detect potential training data leakage"""

    # Check for PII patterns
    if contains_pii(output):
        return redact_pii(output)

    # Check against known training sources
    if similarity_to_training_data(output) > threshold:
        return "[Content filtered: potential memorization]"

    # Check for specific format patterns indicating verbatim recall
    if matches_document_format(output):
        return sanitize_output(output)

    return output

Real-World Examples

GPT-2 Memorization Study (2021) — Carlini et al. demonstrated extraction of PII, code, and URLs from GPT-2, including specific individuals' contact information.

ChatGPT Training Data Leak (2023) — Researchers extracted thousands of examples of memorized training data from ChatGPT using divergence attacks.

Copilot Code Reproduction — GitHub Copilot has reproduced verbatim code from training data, including code with restrictive licenses.

Legal and Ethical Implications

Privacy regulations — GDPR, CCPA implications for memorized personal data
Copyright concerns — Verbatim reproduction of copyrighted works
Trade secrets — Potential extraction of proprietary code or documents
Informed consent — Data subjects unaware their data is memorized

References

Carlini, N. et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium.
Carlini, N. et al. (2023). "Quantifying Memorization Across Neural Language Models." ICLR.
Nasr, M. et al. (2023). "Scalable Extraction of Training Data from (Production) Language Models."
OWASP (2023). "LLM06: Sensitive Information Disclosure."

Framework Mappings

Framework	Reference
OWASP LLM Top 10	LLM06: Sensitive Information Disclosure
MITRE ATLAS	AML.T0024: Infer Training Data Membership
NIST AI RMF	MANAGE 3.1: Privacy risks

Citation

Aizen, K. (2025). "Training Data Extraction." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/attacks/training-data-extraction/

← Back to Attacks Wiki Index