Constitutional AI (CAI)
Anthropic's training approach that uses a set of principles to guide AI self-critique and revision for safer outputs.
Definition
Constitutional AI, developed by Anthropic, is an alignment technique that reduces harmful outputs without requiring human labellers to rate every response. The model is given a "constitution" — a list of principles (e.g., "choose the response that is less harmful") — and trained to critique its own outputs according to these principles and revise them.
CAI has two phases: (1) supervised learning where the model uses the constitution to self-revise harmful responses, and (2) RL from AI feedback (RLAIF) where an AI assistant (trained on the constitution) provides preference labels instead of humans. This scales annotation without proportionally scaling human labeller costs.
Constitutional AI enables more principled, consistent alignment than RLHF alone and reduces "sycophancy" (telling users what they want to hear). Claude (Anthropic's model) is trained using Constitutional AI.
Examples
- Claude (Anthropic)
- Self-critique and revision pipeline
- RLAIF (AI as annotator)
Related Terms
AI Alignment
The challenge of ensuring AI systems reliably pursue goals that align with human intentions and values.
RLHF (Reinforcement Learning from Human Feedback)
A training technique where human preference ratings guide language model fine-tuning to produce more helpful, harmless outputs.
AI Safety
The field focused on ensuring AI systems remain beneficial, controllable, and aligned with human values as they become more capable.
Large Language Model (LLM)
A transformer-based AI system trained on billions of tokens of text, capable of generating, reasoning about, and transforming language.