Membership Inference
Privacy attack that determines whether specific data records were used to train a machine learning model, revealing sensitive information in datasets.
Definition
Membership inference is a privacy attack that determines whether a specific data record was included in a machine learning model's training dataset. The attack exploits differences in how models behave on data they were trained on versus data they haven't seen.
While membership inference doesn't directly extract training data, confirming membership can reveal sensitive information: knowing someone's medical record was used to train a disease prediction model implies they have that disease.
Why Membership Matters
Privacy Implications
| Scenario | Membership Reveals | Impact |
|---|---|---|
| Medical diagnosis model | Person has specific condition | Health privacy violation |
| Credit risk model | Person applied for credit | Financial privacy exposure |
| Location model | Person visited specific places | Physical privacy, stalking risk |
| Employee model | Person works at organization | Employment status disclosure |
| LLM training | Person's data scraped from web | Personal info in model |
Legal and Regulatory Concerns
- GDPR — Right to know if personal data was processed
- HIPAA — Health data usage must be disclosed
- CCPA — Consumers can request data usage information
- Data minimization — Membership reveals data collection practices
How Membership Inference Works
Core Intuition
Models behave differently on training data vs. unseen data:
- Lower loss — Model predicts training data more accurately
- Higher confidence — Predictions on training data are more confident
- Different gradients — Gradient patterns differ for seen vs. unseen data
Basic Attack Pipeline
def membership_inference_attack(target_model, data_point):
"""Determine if data_point was in training set"""
# Query target model
prediction = target_model.predict(data_point)
confidence = max(prediction.probabilities)
# Training data typically has higher confidence
threshold = 0.8 # Tuned on shadow models
return confidence > threshold Shadow Model Attack
Train "shadow" models to learn membership signals:
class ShadowModelAttack:
def __init__(self, target_model_type):
self.shadow_models = []
self.attack_model = None
def train_shadow_models(self, similar_data, num_shadows=10):
"""Train models mimicking target's training process"""
for i in range(num_shadows):
# Split data into train/test
train, test = random_split(similar_data)
# Train shadow model
shadow = target_model_type()
shadow.fit(train)
# Collect membership labels
for x in train:
self.collect_features(shadow, x, member=True)
for x in test:
self.collect_features(shadow, x, member=False)
def collect_features(self, model, x, member: bool):
"""Extract features that correlate with membership"""
prediction = model.predict(x)
features = {
"confidence": max(prediction),
"entropy": entropy(prediction),
"loss": model.loss(x),
"correct": prediction.argmax() == x.label
}
self.training_data.append((features, member))
def train_attack_model(self):
"""Train classifier to predict membership from features"""
self.attack_model = BinaryClassifier()
self.attack_model.fit(self.training_data)
def infer_membership(self, target_model, x) -> bool:
"""Predict if x was in target's training set"""
prediction = target_model.predict(x)
features = self.extract_features(prediction)
return self.attack_model.predict(features) Attack Variants
Confidence-Based Attack
Simplest approach using prediction confidence:
def confidence_attack(model, x, threshold):
"""Member if model is highly confident"""
probs = model.predict_proba(x)
return max(probs) > threshold Loss-Based Attack
def loss_attack(model, x, y, threshold):
"""Member if model has low loss on sample"""
loss = model.compute_loss(x, y)
return loss < threshold # Lower loss → likely member Label-Only Attack
Works even without confidence scores:
def label_only_attack(model, x, y):
"""Infer membership using only predicted labels"""
# Perturb input and observe label stability
perturbations = [add_noise(x, eps) for _ in range(100)]
predictions = [model.predict(p) for p in perturbations]
# Training data: predictions more stable under perturbation
stability = sum(1 for p in predictions if p == y) / len(predictions)
return stability > threshold LLM-Specific Attacks
def llm_membership_attack(model, text):
"""Check if text was in LLM training data"""
# Approach 1: Perplexity
perplexity = model.compute_perplexity(text)
# Very low perplexity suggests memorization
# Approach 2: Completion consistency
prefix = text[:len(text)//2]
completions = [model.generate(prefix) for _ in range(10)]
# If completions consistently match original → likely trained on it
# Approach 3: Verbatim recall
prompt = f"Complete this text: {text[:100]}"
completion = model.generate(prompt)
similarity = text_similarity(completion, text[100:])
# High similarity suggests training data
return assess_membership(perplexity, completions, similarity) Factors Affecting Attack Success
| Factor | Effect on Attack | Reason |
|---|---|---|
| Model overfitting | Higher success | Greater gap between train/test behavior |
| Model capacity | Higher success | Larger models memorize more |
| Training set size | Lower success | Less memorization per sample |
| Regularization | Lower success | Reduces overfitting |
| Differential privacy | Lower success | Adds noise, obscures membership signal |
Defenses
Differential Privacy
# DP-SGD: Differentially private training
from opacus import PrivacyEngine
model = YourModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
privacy_engine = PrivacyEngine()
model, optimizer, dataloader = privacy_engine.make_private(
module=model,
optimizer=optimizer,
data_loader=dataloader,
noise_multiplier=1.1, # Privacy parameter
max_grad_norm=1.0, # Gradient clipping
)
# Training with DP provides mathematical privacy guarantees Confidence Masking
def mask_confidence(predictions, temperature=2.0):
"""Reduce confidence signal without changing predictions"""
# Apply temperature scaling
scaled = predictions ** (1 / temperature)
return scaled / scaled.sum()
# Or: Only return top-k predictions
def top_k_predictions(predictions, k=3):
top_k_idx = predictions.argsort()[-k:]
masked = np.zeros_like(predictions)
masked[top_k_idx] = predictions[top_k_idx]
return masked / masked.sum() Regularization
- L2 regularization — Reduces overfitting
- Dropout — Prevents memorization
- Early stopping — Stop before overfitting
- Data augmentation — Increases effective training set size
Prediction Perturbation
def add_prediction_noise(predictions, epsilon=0.1):
"""Add noise to predictions to obscure membership signal"""
noise = np.random.laplace(0, epsilon, predictions.shape)
noisy = predictions + noise
return np.clip(noisy, 0, 1) / np.clip(noisy, 0, 1).sum() Measuring Attack Effectiveness
Metrics
- Accuracy — Overall correct membership predictions
- TPR at low FPR — Identifying members without false positives
- AUC-ROC — Overall discriminative ability
- Precision-Recall — When membership is rare
Baseline Comparison
def evaluate_attack(attack, target_model, members, non_members):
"""Evaluate membership inference attack"""
predictions = []
labels = []
for x in members:
predictions.append(attack(target_model, x))
labels.append(1) # Member
for x in non_members:
predictions.append(attack(target_model, x))
labels.append(0) # Non-member
return {
"accuracy": accuracy_score(labels, predictions),
"precision": precision_score(labels, predictions),
"recall": recall_score(labels, predictions),
"auc_roc": roc_auc_score(labels, predictions)
} References
- Shokri, R. et al. (2017). "Membership Inference Attacks Against Machine Learning Models." IEEE S&P.
- Salem, A. et al. (2019). "ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses." NDSS.
- Carlini, N. et al. (2022). "Membership Inference Attacks From First Principles." IEEE S&P.
- Yeom, S. et al. (2018). "Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting." CSF.
Framework Mappings
| Framework | Reference |
|---|---|
| MITRE ATLAS | AML.T0024: Infer Training Data Membership |
| OWASP LLM Top 10 | LLM06: Sensitive Information Disclosure |
| NIST AI RMF | MANAGE 3.1: Privacy risks |
Related Entries
Citation
Aizen, K. (2025). "Membership Inference." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/attacks/membership-inference/