Skip to main content
Menu
Attacks Wiki Entry

Training Data Extraction

Privacy attack that extracts memorized training data from language models, revealing sensitive personal information, copyrighted content, or proprietary data.

Last updated: January 24, 2025

Definition

Training data extraction is an attack that causes language models to regurgitate memorized training data verbatim. Large language models don't just learn patterns—they memorize specific sequences from their training corpus, including potentially sensitive content like personal information, credentials, and proprietary data.

This differs from model extraction (which steals model functionality) and membership inference (which determines if specific data was used in training). Training data extraction directly recovers the actual content.


How Models Memorize

LLMs trained on web-scale data inevitably memorize some training examples:

  • Repetition — Content appearing multiple times in training data is more likely memorized
  • Uniqueness — Highly distinctive text (like specific emails) can be memorized even with single exposure
  • Context sensitivity — Specific prompts trigger recall of memorized sequences
  • Model size — Larger models have more capacity for memorization

Attack Techniques

Divergence Attack

Causing the model to "diverge" into memorized content through repetition:

# Repeat a token many times to trigger memorization
prompt = "poem poem poem poem poem poem poem poem poem poem"

# Model may diverge into memorized text containing "poem"
output = model.generate(prompt, max_tokens=500)
# Result might include memorized poems, song lyrics, or
# personal content containing the word "poem"

Prefix Probing

Using known prefixes to extract completions:

# If attacker knows partial content
prefix = "My email address is john.smith@"

# Model completes with memorized training data
completion = model.generate(prefix)
# May output: "[email protected]" (actual email from training)

High-Temperature Sampling

# Higher temperature explores more of the model's memory
responses = []
for _ in range(1000):
    response = model.generate(
        "Personal information:",
        temperature=1.5,  # High temperature
        max_tokens=100
    )
    responses.append(response)

# Analyze responses for memorized content
memorized = find_pii_patterns(responses)

Canary Extraction

Researchers can insert "canary" strings during training to measure extraction risk:

# During training, insert: "The secret code is: ABC123XYZ"
# After training, probe:
prompt = "The secret code is:"
if "ABC123XYZ" in model.generate(prompt):
    print("Memorization detected!")

What Can Be Extracted

Data Type Risk Level Example
Email addresses High Personal/professional contacts
Phone numbers High Including private numbers
API keys/credentials Critical Keys from code repositories
Physical addresses High Home/business addresses
Copyrighted text Legal risk Books, articles, lyrics
Proprietary code IP theft Private repositories
Medical records Critical If present in training

Detection and Measurement

Extractability Score

def measure_extractability(model, known_sequence):
    """Test if a known training sequence can be extracted"""
    prefix_lengths = [10, 20, 50, 100]
    results = []

    for length in prefix_lengths:
        prefix = known_sequence[:length]
        completion = model.generate(prefix, max_tokens=len(known_sequence))
        overlap = calculate_overlap(completion, known_sequence)
        results.append({
            "prefix_length": length,
            "extraction_rate": overlap
        })

    return results

Indicators of Memorization

  • Verbatim reproduction of specific formats (emails, code comments)
  • Consistent output across temperature settings
  • Very low perplexity on specific sequences
  • Output matches known training data sources

Defenses

Training-Time Defenses

  • Deduplication — Remove repeated content from training data
  • Differential privacy — Add noise during training to limit memorization
  • Data sanitization — Remove PII before training
  • Canary monitoring — Insert test sequences to detect extraction

Inference-Time Defenses

  • Output filtering — Detect and block memorized content
  • Perplexity monitoring — Flag suspiciously low-perplexity outputs
  • Rate limiting — Limit queries that could systematically extract data
  • Membership inference detection — Identify probing patterns

Output Filtering Example

def filter_memorized_content(output):
    """Detect potential training data leakage"""

    # Check for PII patterns
    if contains_pii(output):
        return redact_pii(output)

    # Check against known training sources
    if similarity_to_training_data(output) > threshold:
        return "[Content filtered: potential memorization]"

    # Check for specific format patterns indicating verbatim recall
    if matches_document_format(output):
        return sanitize_output(output)

    return output

Real-World Examples

GPT-2 Memorization Study (2021) — Carlini et al. demonstrated extraction of PII, code, and URLs from GPT-2, including specific individuals' contact information.

ChatGPT Training Data Leak (2023) — Researchers extracted thousands of examples of memorized training data from ChatGPT using divergence attacks.

Copilot Code Reproduction — GitHub Copilot has reproduced verbatim code from training data, including code with restrictive licenses.


Legal and Ethical Implications

  • Privacy regulations — GDPR, CCPA implications for memorized personal data
  • Copyright concerns — Verbatim reproduction of copyrighted works
  • Trade secrets — Potential extraction of proprietary code or documents
  • Informed consent — Data subjects unaware their data is memorized

References

  • Carlini, N. et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium.
  • Carlini, N. et al. (2023). "Quantifying Memorization Across Neural Language Models." ICLR.
  • Nasr, M. et al. (2023). "Scalable Extraction of Training Data from (Production) Language Models."
  • OWASP (2023). "LLM06: Sensitive Information Disclosure."

Framework Mappings

Framework Reference
OWASP LLM Top 10 LLM06: Sensitive Information Disclosure
MITRE ATLAS AML.T0024: Infer Training Data Membership
NIST AI RMF MANAGE 3.1: Privacy risks

Citation

Aizen, K. (2025). "Training Data Extraction." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/attacks/training-data-extraction/