AI Safety
The field focused on ensuring AI systems remain beneficial, controllable, and aligned with human values as they become more capable.
Definition
AI safety encompasses research and practices aimed at preventing AI systems from causing unintended harm — whether through misalignment, misuse, accidents, or systemic risks. The field distinguishes between near-term safety (reducing harmful outputs from deployed LLMs) and long-term safety (preventing catastrophic risks from future highly capable systems).
Near-term safety work includes jailbreak mitigation, bias detection, RLHF alignment, interpretability research, and robust evaluation. Long-term safety research addresses potential risks from superintelligent systems: misaligned reward hacking, deceptive alignment, and power-seeking behaviours.
Key organisations include Anthropic (Constitutional AI), ARC Evals, the UK AI Safety Institute, DeepMind's safety team, and academic groups at MIT, Stanford, and Berkeley. The 2023 UK AI Safety Summit at Bletchley Park marked the first major governmental focus on frontier AI risks.
Examples
- Anthropic's Constitutional AI
- OpenAI's safety team
- UK AI Safety Institute
- Red-teaming exercises
Want a deeper dive?
Read our full explainer with use cases, how-it-works, and FAQs.
AI Agents concept guideCompanies using this
Related Terms
AI Alignment
The challenge of ensuring AI systems reliably pursue goals that align with human intentions and values.
Constitutional AI (CAI)
Anthropic's training approach that uses a set of principles to guide AI self-critique and revision for safer outputs.
RLHF (Reinforcement Learning from Human Feedback)
A training technique where human preference ratings guide language model fine-tuning to produce more helpful, harmless outputs.
Interpretability / Explainability
The ability to understand, explain, and audit how an AI model arrives at its outputs.