Safety & Ethics

Constitutional AI (CAI)

Anthropic's training approach that uses a set of principles to guide AI self-critique and revision for safer outputs.

Definition

Constitutional AI, developed by Anthropic, is an alignment technique that reduces harmful outputs without requiring human labellers to rate every response. The model is given a "constitution" — a list of principles (e.g., "choose the response that is less harmful") — and trained to critique its own outputs according to these principles and revise them.

CAI has two phases: (1) supervised learning where the model uses the constitution to self-revise harmful responses, and (2) RL from AI feedback (RLAIF) where an AI assistant (trained on the constitution) provides preference labels instead of humans. This scales annotation without proportionally scaling human labeller costs.

Constitutional AI enables more principled, consistent alignment than RLHF alone and reduces "sycophancy" (telling users what they want to hear). Claude (Anthropic's model) is trained using Constitutional AI.

Examples

  • Claude (Anthropic)
  • Self-critique and revision pipeline
  • RLAIF (AI as annotator)