Safety & Ethics

AI Alignment

The challenge of ensuring AI systems reliably pursue goals that align with human intentions and values.

Definition

AI alignment is the problem of specifying and training AI systems to behave as intended across all situations, including novel ones. A system is "aligned" if its behaviour matches human preferences, values, and intentions — including in edge cases and under distribution shift.

The alignment challenge has several dimensions: specification (accurately capturing what we want), robustness (behaving correctly in novel situations), and scalable oversight (supervising AI systems more capable than their overseers). Misaligned AI might optimise for a proxy metric that diverges from the true goal at scale.

Current alignment research spans RLHF, Constitutional AI, interpretability (understanding what models have learned), red-teaming (finding failure modes), and formal verification. Anthropic, OpenAI, DeepMind, and ARC Evals are leading alignment-focused labs.

Examples

  • ChatGPT refusing harmful requests
  • Constitutional AI principles
  • RLHF alignment pipeline