Transformer
A neural network architecture using self-attention to process sequences in parallel — the foundation of all modern LLMs.
Definition
The Transformer, introduced in the 2017 paper "Attention Is All You Need" by Google Brain researchers, is the architecture underlying virtually all modern large language models. It replaced recurrent networks for sequence tasks by using self-attention mechanisms that relate all tokens to all other tokens simultaneously.
Key components include multi-head self-attention (allowing the model to focus on different parts of the input for different purposes), feed-forward networks, positional encodings (since attention is permutation-invariant), and layer normalisation. Encoder-decoder variants (T5, BART) suit translation; decoder-only variants (GPT, LLaMA) suit generation.
Transformers enabled a step-change in NLP and beyond: they parallelise over sequences (unlike RNNs), scale efficiently with more compute and data, and exhibit emergent capabilities at large scales. Vision Transformers (ViT) have also brought the architecture to image recognition.
Examples
- GPT-4
- Claude
- Gemini
- BERT
- LLaMA
- Vision Transformer (ViT)
Want a deeper dive?
Read our full explainer with use cases, how-it-works, and FAQs.
Transformer Architecture concept guideRelated Terms
Large Language Model (LLM)
A transformer-based AI system trained on billions of tokens of text, capable of generating, reasoning about, and transforming language.
Attention Mechanism
A neural network component that lets models dynamically focus on relevant parts of the input when generating each output token.
BERT
Google's 2018 bidirectional encoder model that transformed NLP by learning contextual word representations from unlabelled text.
GPT (Generative Pre-trained Transformer)
OpenAI's series of large decoder-only language models that established the paradigm for modern AI assistants.