What is Interpretability / Explainability?

Safety & Ethics

Interpretability / Explainability

The ability to understand, explain, and audit how an AI model arrives at its outputs.

Definition

Interpretability research aims to understand the internal mechanisms of AI models — what features they detect, how information flows through layers, what concepts neurons represent, and why particular outputs are produced. This is both a safety concern (detecting misaligned goals) and a practical requirement (regulatory compliance, debugging).

Approaches include: attention visualisation (which tokens influence outputs), feature attribution (SHAP, LIME), probing classifiers (testing what information is encoded in hidden states), mechanistic interpretability (reverse-engineering circuits responsible for specific behaviours), and saliency maps for vision models.

Mechanistic interpretability, pioneered by Anthropic and researchers like Chris Olah, aims to produce "a complete, mechanistic account of everything a neural network does" — analogous to understanding a piece of software by reading its source code. This remains a grand challenge for frontier models.

Examples

SHAP values for ML models
Attention heatmaps
Anthropic's mechanistic interpretability research

Related Terms

AI Safety

The field focused on ensuring AI systems remain beneficial, controllable, and aligned with human values as they become more capable.

AI Alignment

The challenge of ensuring AI systems reliably pursue goals that align with human intentions and values.

Neural Network

A computing system of interconnected nodes inspired by biological brains, trained to recognise patterns.

Explore

← All glossary terms AI concept guides AI timeline Browse companies