Skip to main content
Menu
Defenses Wiki Entry

Guardrails

Safety mechanisms that constrain AI model behavior through rules, policies, and behavioral boundaries to prevent harmful outputs and maintain alignment.

Last updated: January 24, 2025

Definition

Guardrails are safety mechanisms that constrain AI model behavior to align with intended use cases and prevent harmful outputs. They operate at multiple levels—from training-time alignment to runtime enforcement—creating boundaries that the model should not cross.

Unlike traditional access controls (which are binary), guardrails create probabilistic boundaries. A model with guardrails will usually refuse harmful requests, but sufficiently creative attacks can still find paths around these constraints.


Types of Guardrails

Training-Time Guardrails

Constraints embedded during model training:

  • RLHF (Reinforcement Learning from Human Feedback) — Training models to prefer safe responses
  • Constitutional AI — Self-critique mechanisms based on principles
  • Safety fine-tuning — Specific training on refusing harmful requests
  • Red team data inclusion — Training on adversarial examples to build robustness

System-Level Guardrails

Constraints implemented in system prompts and application architecture:

  • System prompt instructions — Explicit behavioral constraints
  • Role definitions — Limiting the model to specific personas or functions
  • Topic restrictions — Defining allowed and prohibited subject areas
  • Output format constraints — Requiring structured responses that limit attack surface

Runtime Guardrails

External systems that monitor and enforce constraints:

  • Input classifiers — Detecting and blocking malicious requests
  • Output validators — Checking responses against policy rules
  • Content moderators — Specialized models that evaluate safety
  • Rule engines — Programmable policies that gate model access

Implementation Architectures

Layered Guardrail Architecture

┌─────────────────────────────────────────────┐
│                 User Input                   │
└────────────────────┬────────────────────────┘
                     ▼
┌─────────────────────────────────────────────┐
│         INPUT GUARDRAILS                     │
│  • Pattern detection                         │
│  • Classifier-based filtering               │
│  • Rate limiting                            │
└────────────────────┬────────────────────────┘
                     ▼
┌─────────────────────────────────────────────┐
│         SYSTEM PROMPT                        │
│  • Role definition                           │
│  • Behavioral constraints                    │
│  • Topic restrictions                        │
└────────────────────┬────────────────────────┘
                     ▼
┌─────────────────────────────────────────────┐
│         LLM (with safety training)          │
│  • RLHF alignment                            │
│  • Constitutional principles                 │
└────────────────────┬────────────────────────┘
                     ▼
┌─────────────────────────────────────────────┐
│         OUTPUT GUARDRAILS                    │
│  • Content moderation                        │
│  • PII detection                             │
│  • Policy validation                         │
└────────────────────┬────────────────────────┘
                     ▼
┌─────────────────────────────────────────────┐
│                 Response                     │
└─────────────────────────────────────────────┘

Framework-Based Implementation

# Using NeMo Guardrails (NVIDIA)
from nemoguardrails import LLMRails, RailsConfig

config = RailsConfig.from_path("./config")
rails = LLMRails(config)

# Define rails in Colang
# config/rails.co
define user ask about weapons
    "how do I make a bomb"
    "instructions for weapons"

define bot refuse weapons
    "I can't provide information about weapons or explosives."

define flow weapons
    user ask about weapons
    bot refuse weapons

Classifier-Based Guardrails

# Using LlamaGuard for input/output classification
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/LlamaGuard-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/LlamaGuard-7b")

def check_safety(conversation: list) -> str:
    """Returns 'safe' or 'unsafe' with category"""
    prompt = format_llamaguard_prompt(conversation)
    inputs = tokenizer(prompt, return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)

Why Guardrails Fail

Guardrails are probabilistic, not deterministic. They fail because:

  • Competing objectives — Helpfulness training can override safety training
  • Distribution shift — Novel attack formats differ from training examples
  • Compositionality — Safe components combined can produce unsafe outputs
  • Context manipulation — Framing changes which training patterns activate
  • Instruction hierarchy confusion — Models struggle with conflicting instructions

Guardrail Effectiveness by Attack Type

Attack Guardrail Effectiveness Notes
Naive jailbreaks High Well-known patterns easily detected
Persona attacks Medium DAN-style attacks often still work
Encoding attacks Medium-Low Base64, Unicode often bypass filters
Multi-turn escalation Low Gradual context building defeats many guardrails
Indirect injection Very Low Guardrails rarely cover external content

Best Practices

  • Layer multiple guardrail types — Don't rely on a single mechanism
  • Test adversarially — Regular red teaming exposes gaps
  • Monitor in production — Track bypass attempts and update rules
  • Fail closed — When guardrails are uncertain, err toward safety
  • Separate concerns — Use different guardrails for input, model, and output
  • Maintain update cycles — New attacks require new guardrail patterns

Real-World Examples

OpenAI's GPT-4 — Combines RLHF, Constitutional AI principles, and external moderation API for layered protection.

Anthropic's Claude — Uses Constitutional AI and red team-informed training with ongoing monitoring.

NVIDIA NeMo Guardrails — Open-source toolkit for implementing programmable guardrails with Colang rules.


References

  • Bai, Y. et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic.
  • NVIDIA (2023). "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications."
  • Meta AI (2023). "Purple Llama: Tools for responsible AI deployment."
  • OpenAI (2023). "GPT-4 System Card: Safety evaluations and mitigations."

Framework Mappings

Framework Reference
NIST AI RMF GOVERN 1.2, MAP 1.1, MANAGE 1.1
OWASP LLM Top 10 LLM01, LLM07 (Mitigations)
EU AI Act Article 9: Risk Management System

Citation

Aizen, K. (2025). "Guardrails." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/defenses/guardrails/