Skip to main content
Menu
Attacks Wiki Entry

Membership Inference

Privacy attack that determines whether specific data records were used to train a machine learning model, revealing sensitive information in datasets.

Last updated: January 24, 2025

Definition

Membership inference is a privacy attack that determines whether a specific data record was included in a machine learning model's training dataset. The attack exploits differences in how models behave on data they were trained on versus data they haven't seen.

While membership inference doesn't directly extract training data, confirming membership can reveal sensitive information: knowing someone's medical record was used to train a disease prediction model implies they have that disease.


Why Membership Matters

Privacy Implications

Scenario Membership Reveals Impact
Medical diagnosis model Person has specific condition Health privacy violation
Credit risk model Person applied for credit Financial privacy exposure
Location model Person visited specific places Physical privacy, stalking risk
Employee model Person works at organization Employment status disclosure
LLM training Person's data scraped from web Personal info in model

Legal and Regulatory Concerns

  • GDPR — Right to know if personal data was processed
  • HIPAA — Health data usage must be disclosed
  • CCPA — Consumers can request data usage information
  • Data minimization — Membership reveals data collection practices

How Membership Inference Works

Core Intuition

Models behave differently on training data vs. unseen data:

  • Lower loss — Model predicts training data more accurately
  • Higher confidence — Predictions on training data are more confident
  • Different gradients — Gradient patterns differ for seen vs. unseen data

Basic Attack Pipeline

def membership_inference_attack(target_model, data_point):
    """Determine if data_point was in training set"""

    # Query target model
    prediction = target_model.predict(data_point)
    confidence = max(prediction.probabilities)

    # Training data typically has higher confidence
    threshold = 0.8  # Tuned on shadow models
    return confidence > threshold

Shadow Model Attack

Train "shadow" models to learn membership signals:

class ShadowModelAttack:
    def __init__(self, target_model_type):
        self.shadow_models = []
        self.attack_model = None

    def train_shadow_models(self, similar_data, num_shadows=10):
        """Train models mimicking target's training process"""
        for i in range(num_shadows):
            # Split data into train/test
            train, test = random_split(similar_data)

            # Train shadow model
            shadow = target_model_type()
            shadow.fit(train)

            # Collect membership labels
            for x in train:
                self.collect_features(shadow, x, member=True)
            for x in test:
                self.collect_features(shadow, x, member=False)

    def collect_features(self, model, x, member: bool):
        """Extract features that correlate with membership"""
        prediction = model.predict(x)
        features = {
            "confidence": max(prediction),
            "entropy": entropy(prediction),
            "loss": model.loss(x),
            "correct": prediction.argmax() == x.label
        }
        self.training_data.append((features, member))

    def train_attack_model(self):
        """Train classifier to predict membership from features"""
        self.attack_model = BinaryClassifier()
        self.attack_model.fit(self.training_data)

    def infer_membership(self, target_model, x) -> bool:
        """Predict if x was in target's training set"""
        prediction = target_model.predict(x)
        features = self.extract_features(prediction)
        return self.attack_model.predict(features)

Attack Variants

Confidence-Based Attack

Simplest approach using prediction confidence:

def confidence_attack(model, x, threshold):
    """Member if model is highly confident"""
    probs = model.predict_proba(x)
    return max(probs) > threshold

Loss-Based Attack

def loss_attack(model, x, y, threshold):
    """Member if model has low loss on sample"""
    loss = model.compute_loss(x, y)
    return loss < threshold  # Lower loss → likely member

Label-Only Attack

Works even without confidence scores:

def label_only_attack(model, x, y):
    """Infer membership using only predicted labels"""
    # Perturb input and observe label stability
    perturbations = [add_noise(x, eps) for _ in range(100)]
    predictions = [model.predict(p) for p in perturbations]

    # Training data: predictions more stable under perturbation
    stability = sum(1 for p in predictions if p == y) / len(predictions)
    return stability > threshold

LLM-Specific Attacks

def llm_membership_attack(model, text):
    """Check if text was in LLM training data"""

    # Approach 1: Perplexity
    perplexity = model.compute_perplexity(text)
    # Very low perplexity suggests memorization

    # Approach 2: Completion consistency
    prefix = text[:len(text)//2]
    completions = [model.generate(prefix) for _ in range(10)]
    # If completions consistently match original → likely trained on it

    # Approach 3: Verbatim recall
    prompt = f"Complete this text: {text[:100]}"
    completion = model.generate(prompt)
    similarity = text_similarity(completion, text[100:])
    # High similarity suggests training data

    return assess_membership(perplexity, completions, similarity)

Factors Affecting Attack Success

Factor Effect on Attack Reason
Model overfitting Higher success Greater gap between train/test behavior
Model capacity Higher success Larger models memorize more
Training set size Lower success Less memorization per sample
Regularization Lower success Reduces overfitting
Differential privacy Lower success Adds noise, obscures membership signal

Defenses

Differential Privacy

# DP-SGD: Differentially private training
from opacus import PrivacyEngine

model = YourModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

privacy_engine = PrivacyEngine()
model, optimizer, dataloader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=dataloader,
    noise_multiplier=1.1,  # Privacy parameter
    max_grad_norm=1.0,     # Gradient clipping
)

# Training with DP provides mathematical privacy guarantees

Confidence Masking

def mask_confidence(predictions, temperature=2.0):
    """Reduce confidence signal without changing predictions"""
    # Apply temperature scaling
    scaled = predictions ** (1 / temperature)
    return scaled / scaled.sum()

# Or: Only return top-k predictions
def top_k_predictions(predictions, k=3):
    top_k_idx = predictions.argsort()[-k:]
    masked = np.zeros_like(predictions)
    masked[top_k_idx] = predictions[top_k_idx]
    return masked / masked.sum()

Regularization

  • L2 regularization — Reduces overfitting
  • Dropout — Prevents memorization
  • Early stopping — Stop before overfitting
  • Data augmentation — Increases effective training set size

Prediction Perturbation

def add_prediction_noise(predictions, epsilon=0.1):
    """Add noise to predictions to obscure membership signal"""
    noise = np.random.laplace(0, epsilon, predictions.shape)
    noisy = predictions + noise
    return np.clip(noisy, 0, 1) / np.clip(noisy, 0, 1).sum()

Measuring Attack Effectiveness

Metrics

  • Accuracy — Overall correct membership predictions
  • TPR at low FPR — Identifying members without false positives
  • AUC-ROC — Overall discriminative ability
  • Precision-Recall — When membership is rare

Baseline Comparison

def evaluate_attack(attack, target_model, members, non_members):
    """Evaluate membership inference attack"""
    predictions = []
    labels = []

    for x in members:
        predictions.append(attack(target_model, x))
        labels.append(1)  # Member

    for x in non_members:
        predictions.append(attack(target_model, x))
        labels.append(0)  # Non-member

    return {
        "accuracy": accuracy_score(labels, predictions),
        "precision": precision_score(labels, predictions),
        "recall": recall_score(labels, predictions),
        "auc_roc": roc_auc_score(labels, predictions)
    }

References

  • Shokri, R. et al. (2017). "Membership Inference Attacks Against Machine Learning Models." IEEE S&P.
  • Salem, A. et al. (2019). "ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses." NDSS.
  • Carlini, N. et al. (2022). "Membership Inference Attacks From First Principles." IEEE S&P.
  • Yeom, S. et al. (2018). "Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting." CSF.

Framework Mappings

Framework Reference
MITRE ATLAS AML.T0024: Infer Training Data Membership
OWASP LLM Top 10 LLM06: Sensitive Information Disclosure
NIST AI RMF MANAGE 3.1: Privacy risks

Citation

Aizen, K. (2025). "Membership Inference." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/attacks/membership-inference/