AI Alignment
The challenge of ensuring AI systems reliably pursue goals that align with human intentions and values.
Definition
AI alignment is the problem of specifying and training AI systems to behave as intended across all situations, including novel ones. A system is "aligned" if its behaviour matches human preferences, values, and intentions — including in edge cases and under distribution shift.
The alignment challenge has several dimensions: specification (accurately capturing what we want), robustness (behaving correctly in novel situations), and scalable oversight (supervising AI systems more capable than their overseers). Misaligned AI might optimise for a proxy metric that diverges from the true goal at scale.
Current alignment research spans RLHF, Constitutional AI, interpretability (understanding what models have learned), red-teaming (finding failure modes), and formal verification. Anthropic, OpenAI, DeepMind, and ARC Evals are leading alignment-focused labs.
Examples
- ChatGPT refusing harmful requests
- Constitutional AI principles
- RLHF alignment pipeline
Companies using this
Related Terms
AI Safety
The field focused on ensuring AI systems remain beneficial, controllable, and aligned with human values as they become more capable.
RLHF (Reinforcement Learning from Human Feedback)
A training technique where human preference ratings guide language model fine-tuning to produce more helpful, harmless outputs.
Constitutional AI (CAI)
Anthropic's training approach that uses a set of principles to guide AI self-critique and revision for safer outputs.
Hallucination
When an AI model generates plausible-sounding but factually incorrect or fabricated information.