Architecture

Transformer

A neural network architecture using self-attention to process sequences in parallel — the foundation of all modern LLMs.

Definition

The Transformer, introduced in the 2017 paper "Attention Is All You Need" by Google Brain researchers, is the architecture underlying virtually all modern large language models. It replaced recurrent networks for sequence tasks by using self-attention mechanisms that relate all tokens to all other tokens simultaneously.

Key components include multi-head self-attention (allowing the model to focus on different parts of the input for different purposes), feed-forward networks, positional encodings (since attention is permutation-invariant), and layer normalisation. Encoder-decoder variants (T5, BART) suit translation; decoder-only variants (GPT, LLaMA) suit generation.

Transformers enabled a step-change in NLP and beyond: they parallelise over sequences (unlike RNNs), scale efficiently with more compute and data, and exhibit emergent capabilities at large scales. Vision Transformers (ViT) have also brought the architecture to image recognition.

Examples

GPT-4
Claude
Gemini
BERT
LLaMA
Vision Transformer (ViT)

Want a deeper dive?

Read our full explainer with use cases, how-it-works, and FAQs.

Transformer Architecture concept guide

Related Terms

Large Language Model (LLM)

A transformer-based AI system trained on billions of tokens of text, capable of generating, reasoning about, and transforming language.

Attention Mechanism

A neural network component that lets models dynamically focus on relevant parts of the input when generating each output token.

BERT

Google's 2018 bidirectional encoder model that transformed NLP by learning contextual word representations from unlabelled text.

GPT (Generative Pre-trained Transformer)

OpenAI's series of large decoder-only language models that established the paradigm for modern AI assistants.

Explore

← All glossary terms AI concept guides AI timeline Browse companies