Architecture

Transformer

A neural network architecture using self-attention to process sequences in parallel — the foundation of all modern LLMs.

Definition

The Transformer, introduced in the 2017 paper "Attention Is All You Need" by Google Brain researchers, is the architecture underlying virtually all modern large language models. It replaced recurrent networks for sequence tasks by using self-attention mechanisms that relate all tokens to all other tokens simultaneously.

Key components include multi-head self-attention (allowing the model to focus on different parts of the input for different purposes), feed-forward networks, positional encodings (since attention is permutation-invariant), and layer normalisation. Encoder-decoder variants (T5, BART) suit translation; decoder-only variants (GPT, LLaMA) suit generation.

Transformers enabled a step-change in NLP and beyond: they parallelise over sequences (unlike RNNs), scale efficiently with more compute and data, and exhibit emergent capabilities at large scales. Vision Transformers (ViT) have also brought the architecture to image recognition.

Examples

  • GPT-4
  • Claude
  • Gemini
  • BERT
  • LLaMA
  • Vision Transformer (ViT)

Want a deeper dive?

Read our full explainer with use cases, how-it-works, and FAQs.

Transformer Architecture concept guide