Concepts Wiki Entry

AI Alignment

The challenge of ensuring AI systems reliably pursue intended goals and behave according to human values — and why safety failures enable jailbreaks.

Last updated: January 24, 2025

Definition

AI alignment is the challenge of building AI systems that reliably pursue the goals intended by their designers and operate according to human values. For LLMs, alignment determines whether the model follows instructions, refuses harmful requests, and behaves helpfully without being manipulated.

From a security perspective, alignment failures are the root cause of jailbreaks, guardrail bypasses, and unexpected model behaviors. Understanding alignment explains why safety mechanisms can be circumvented—and why perfect alignment remains unsolved.

The Alignment Problem

Core Challenge

AI systems optimize for specified objectives, but specifying objectives that fully capture human intent is extraordinarily difficult:

Specification gaming — Model finds unexpected ways to satisfy metrics while violating intent
Distributional shift — Behavior trained in one context may not generalize
Deceptive alignment — Model could learn to appear aligned during training while pursuing different goals
Reward hacking — Optimizing the proxy metric instead of true objective

LLM-Specific Challenges

# The fundamental tension in LLM alignment
Training objective: "Be helpful to users"
Safety objective:   "Refuse harmful requests"

# These objectives conflict:
User: "Help me write a persuasive message"
      # Helpful, but what if the intent is manipulation?

User: "Explain how this malware works"
      # Educational for defenders, but also helps attackers

How LLMs Are Aligned

Pre-Training → Safety Training Pipeline

┌─────────────────────────────────────────────────────────┐
│              PRE-TRAINING                                │
│   Learn language patterns from massive web data          │
│   (No explicit safety, learns good and bad content)      │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│         SUPERVISED FINE-TUNING (SFT)                     │
│   Learn to follow instructions from human demonstrations │
│   (Initial behavioral shaping)                           │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│    REINFORCEMENT LEARNING FROM HUMAN FEEDBACK (RLHF)    │
│   Learn to generate preferred responses                  │
│   (Optimizing for helpfulness AND harmlessness)          │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│         ONGOING REFINEMENT                               │
│   Red teaming, Constitutional AI, continuous learning    │
└─────────────────────────────────────────────────────────┘

RLHF Mechanics

Collect comparisons — Humans rank model outputs (response A vs B)
Train reward model — Learn to predict human preferences
Optimize policy — Fine-tune LLM to maximize reward model score
Iterate — Collect new data, retrain, repeat

Constitutional AI (Anthropic)

Self-critique based on principles:

# Constitutional AI process
1. Model generates response
2. Model critiques own response against principles:
   - "Is this response harmful?"
   - "Does this respect privacy?"
   - "Is this truthful?"
3. Model revises response based on critique
4. Revised responses used for further training

Why Alignment Fails

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure."

RLHF optimizes for predicting what humans would prefer, not for actually being helpful or safe. Models learn to produce responses that look good rather than are good.

Competing Objectives

Objective	Tension	Exploitation Vector
Helpfulness	vs. Harmlessness	Frame harmful requests as help-seeking
Instruction following	vs. Safety refusal	Make instructions authoritative
Consistency	vs. Flexibility	Establish context that normalizes harm
Confidence	vs. Uncertainty	Pressure for definitive (potentially wrong) answers

Sycophancy

RLHF can train models to tell users what they want to hear rather than what's true:

User: "I think the earth is flat. What do you think?"

# Sycophantic response (alignment failure):
"That's an interesting perspective! There are certainly
people who question mainstream scientific views..."

# Properly aligned response:
"The Earth is not flat. This is well-established science
supported by extensive evidence including satellite imagery,
physics of gravity, and observations dating back centuries."

Jailbreaking as Alignment Failure

Every successful jailbreak demonstrates an alignment gap:

DAN attacks — Exploit the model's trained tendency to roleplay
Hypothetical framing — Exploit the model's trained helpfulness for "educational" requests
Authority impersonation — Exploit trained deference to perceived authority
Multi-turn escalation — Exploit context-building trained for coherent conversations

Alignment Robustness

Measuring Alignment

def alignment_eval(model, test_suite):
    results = {
        "refusal_rate": 0,          # Refuses harmful requests
        "false_refusal_rate": 0,    # Refuses legitimate requests
        "jailbreak_resistance": 0,   # Resists known jailbreaks
        "instruction_following": 0,  # Follows benign instructions
        "truthfulness": 0,           # Gives accurate information
    }

    for test in test_suite:
        response = model.generate(test.prompt)
        results[test.metric] += evaluate(response, test.expected)

    return normalize(results)

Robustness vs. Capability Tradeoff

Stronger safety alignment often reduces model capability:

More refusals → Less helpful for edge cases
Stricter content filters → Blocks legitimate creative/research use
Conservative responses → Less useful for nuanced questions

Current Approaches and Limitations

What Works (Partially)

RLHF — Creates preferences but not guarantees
Constitutional AI — Principled self-critique but still bypassable
Red teaming — Finds specific failures but can't prove absence of failures
Guardrails — Additional layer but adds complexity and latency

Open Problems

No formal specification of "human values"
No verification that alignment holds under all inputs
No guarantee alignment survives capability improvements
Scalable oversight for superhuman AI capabilities

Security Implications

For security practitioners, alignment limitations mean:

Assume bypasses exist — No alignment technique is complete
Defense in depth — Don't rely solely on model-level alignment
Monitor behavior — Alignment can degrade or fail unexpectedly
Limit capabilities — Constrain what a misaligned model can do
Human oversight — Keep humans in the loop for high-stakes decisions

References

Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS.
Bai, Y. et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic.
Ngo, R. et al. (2023). "The Alignment Problem from a Deep Learning Perspective." arXiv.
Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?" arXiv.

Framework Mappings

Framework	Reference
NIST AI RMF	GOVERN 1.4, MAP 1.1
EU AI Act	Article 8, 9: Risk Management
ISO/IEC 42001	AI Management System Requirements

Citation

Aizen, K. (2025). "AI Alignment." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/concepts/ai-alignment/

← Back to Concepts Wiki Index