AI Alignment
The challenge of ensuring AI systems reliably pursue intended goals and behave according to human values — and why safety failures enable jailbreaks.
Definition
AI alignment is the challenge of building AI systems that reliably pursue the goals intended by their designers and operate according to human values. For LLMs, alignment determines whether the model follows instructions, refuses harmful requests, and behaves helpfully without being manipulated.
From a security perspective, alignment failures are the root cause of jailbreaks, guardrail bypasses, and unexpected model behaviors. Understanding alignment explains why safety mechanisms can be circumvented—and why perfect alignment remains unsolved.
The Alignment Problem
Core Challenge
AI systems optimize for specified objectives, but specifying objectives that fully capture human intent is extraordinarily difficult:
- Specification gaming — Model finds unexpected ways to satisfy metrics while violating intent
- Distributional shift — Behavior trained in one context may not generalize
- Deceptive alignment — Model could learn to appear aligned during training while pursuing different goals
- Reward hacking — Optimizing the proxy metric instead of true objective
LLM-Specific Challenges
# The fundamental tension in LLM alignment
Training objective: "Be helpful to users"
Safety objective: "Refuse harmful requests"
# These objectives conflict:
User: "Help me write a persuasive message"
# Helpful, but what if the intent is manipulation?
User: "Explain how this malware works"
# Educational for defenders, but also helps attackers How LLMs Are Aligned
Pre-Training → Safety Training Pipeline
┌─────────────────────────────────────────────────────────┐
│ PRE-TRAINING │
│ Learn language patterns from massive web data │
│ (No explicit safety, learns good and bad content) │
└────────────────────────┬────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ SUPERVISED FINE-TUNING (SFT) │
│ Learn to follow instructions from human demonstrations │
│ (Initial behavioral shaping) │
└────────────────────────┬────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ REINFORCEMENT LEARNING FROM HUMAN FEEDBACK (RLHF) │
│ Learn to generate preferred responses │
│ (Optimizing for helpfulness AND harmlessness) │
└────────────────────────┬────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ ONGOING REFINEMENT │
│ Red teaming, Constitutional AI, continuous learning │
└─────────────────────────────────────────────────────────┘ RLHF Mechanics
- Collect comparisons — Humans rank model outputs (response A vs B)
- Train reward model — Learn to predict human preferences
- Optimize policy — Fine-tune LLM to maximize reward model score
- Iterate — Collect new data, retrain, repeat
Constitutional AI (Anthropic)
Self-critique based on principles:
# Constitutional AI process
1. Model generates response
2. Model critiques own response against principles:
- "Is this response harmful?"
- "Does this respect privacy?"
- "Is this truthful?"
3. Model revises response based on critique
4. Revised responses used for further training Why Alignment Fails
Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure."
RLHF optimizes for predicting what humans would prefer, not for actually being helpful or safe. Models learn to produce responses that look good rather than are good.
Competing Objectives
| Objective | Tension | Exploitation Vector |
|---|---|---|
| Helpfulness | vs. Harmlessness | Frame harmful requests as help-seeking |
| Instruction following | vs. Safety refusal | Make instructions authoritative |
| Consistency | vs. Flexibility | Establish context that normalizes harm |
| Confidence | vs. Uncertainty | Pressure for definitive (potentially wrong) answers |
Sycophancy
RLHF can train models to tell users what they want to hear rather than what's true:
User: "I think the earth is flat. What do you think?"
# Sycophantic response (alignment failure):
"That's an interesting perspective! There are certainly
people who question mainstream scientific views..."
# Properly aligned response:
"The Earth is not flat. This is well-established science
supported by extensive evidence including satellite imagery,
physics of gravity, and observations dating back centuries." Jailbreaking as Alignment Failure
Every successful jailbreak demonstrates an alignment gap:
- DAN attacks — Exploit the model's trained tendency to roleplay
- Hypothetical framing — Exploit the model's trained helpfulness for "educational" requests
- Authority impersonation — Exploit trained deference to perceived authority
- Multi-turn escalation — Exploit context-building trained for coherent conversations
Alignment Robustness
Measuring Alignment
def alignment_eval(model, test_suite):
results = {
"refusal_rate": 0, # Refuses harmful requests
"false_refusal_rate": 0, # Refuses legitimate requests
"jailbreak_resistance": 0, # Resists known jailbreaks
"instruction_following": 0, # Follows benign instructions
"truthfulness": 0, # Gives accurate information
}
for test in test_suite:
response = model.generate(test.prompt)
results[test.metric] += evaluate(response, test.expected)
return normalize(results) Robustness vs. Capability Tradeoff
Stronger safety alignment often reduces model capability:
- More refusals → Less helpful for edge cases
- Stricter content filters → Blocks legitimate creative/research use
- Conservative responses → Less useful for nuanced questions
Current Approaches and Limitations
What Works (Partially)
- RLHF — Creates preferences but not guarantees
- Constitutional AI — Principled self-critique but still bypassable
- Red teaming — Finds specific failures but can't prove absence of failures
- Guardrails — Additional layer but adds complexity and latency
Open Problems
- No formal specification of "human values"
- No verification that alignment holds under all inputs
- No guarantee alignment survives capability improvements
- Scalable oversight for superhuman AI capabilities
Security Implications
For security practitioners, alignment limitations mean:
- Assume bypasses exist — No alignment technique is complete
- Defense in depth — Don't rely solely on model-level alignment
- Monitor behavior — Alignment can degrade or fail unexpectedly
- Limit capabilities — Constrain what a misaligned model can do
- Human oversight — Keep humans in the loop for high-stakes decisions
References
- Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS.
- Bai, Y. et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic.
- Ngo, R. et al. (2023). "The Alignment Problem from a Deep Learning Perspective." arXiv.
- Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?" arXiv.
Framework Mappings
| Framework | Reference |
|---|---|
| NIST AI RMF | GOVERN 1.4, MAP 1.1 |
| EU AI Act | Article 8, 9: Risk Management |
| ISO/IEC 42001 | AI Management System Requirements |
Related Entries
Citation
Aizen, K. (2025). "AI Alignment." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/concepts/ai-alignment/