Safety & Ethics

AI Safety

The field focused on ensuring AI systems remain beneficial, controllable, and aligned with human values as they become more capable.

Definition

AI safety encompasses research and practices aimed at preventing AI systems from causing unintended harm — whether through misalignment, misuse, accidents, or systemic risks. The field distinguishes between near-term safety (reducing harmful outputs from deployed LLMs) and long-term safety (preventing catastrophic risks from future highly capable systems).

Near-term safety work includes jailbreak mitigation, bias detection, RLHF alignment, interpretability research, and robust evaluation. Long-term safety research addresses potential risks from superintelligent systems: misaligned reward hacking, deceptive alignment, and power-seeking behaviours.

Key organisations include Anthropic (Constitutional AI), ARC Evals, the UK AI Safety Institute, DeepMind's safety team, and academic groups at MIT, Stanford, and Berkeley. The 2023 UK AI Safety Summit at Bletchley Park marked the first major governmental focus on frontier AI risks.