Interpretability / Explainability
The ability to understand, explain, and audit how an AI model arrives at its outputs.
Definition
Interpretability research aims to understand the internal mechanisms of AI models — what features they detect, how information flows through layers, what concepts neurons represent, and why particular outputs are produced. This is both a safety concern (detecting misaligned goals) and a practical requirement (regulatory compliance, debugging).
Approaches include: attention visualisation (which tokens influence outputs), feature attribution (SHAP, LIME), probing classifiers (testing what information is encoded in hidden states), mechanistic interpretability (reverse-engineering circuits responsible for specific behaviours), and saliency maps for vision models.
Mechanistic interpretability, pioneered by Anthropic and researchers like Chris Olah, aims to produce "a complete, mechanistic account of everything a neural network does" — analogous to understanding a piece of software by reading its source code. This remains a grand challenge for frontier models.
Examples
- SHAP values for ML models
- Attention heatmaps
- Anthropic's mechanistic interpretability research
Related Terms
AI Safety
The field focused on ensuring AI systems remain beneficial, controllable, and aligned with human values as they become more capable.
AI Alignment
The challenge of ensuring AI systems reliably pursue goals that align with human intentions and values.
Neural Network
A computing system of interconnected nodes inspired by biological brains, trained to recognise patterns.