Safety & Ethics

Interpretability / Explainability

The ability to understand, explain, and audit how an AI model arrives at its outputs.

Definition

Interpretability research aims to understand the internal mechanisms of AI models — what features they detect, how information flows through layers, what concepts neurons represent, and why particular outputs are produced. This is both a safety concern (detecting misaligned goals) and a practical requirement (regulatory compliance, debugging).

Approaches include: attention visualisation (which tokens influence outputs), feature attribution (SHAP, LIME), probing classifiers (testing what information is encoded in hidden states), mechanistic interpretability (reverse-engineering circuits responsible for specific behaviours), and saliency maps for vision models.

Mechanistic interpretability, pioneered by Anthropic and researchers like Chris Olah, aims to produce "a complete, mechanistic account of everything a neural network does" — analogous to understanding a piece of software by reading its source code. This remains a grand challenge for frontier models.

Examples

  • SHAP values for ML models
  • Attention heatmaps
  • Anthropic's mechanistic interpretability research