Techniques

Attention Mechanism

A neural network component that lets models dynamically focus on relevant parts of the input when generating each output token.

Definition

The attention mechanism, introduced in 2014 for machine translation and generalised to self-attention in the Transformer (2017), allows a model to weight the importance of different input positions when producing each output. Instead of compressing the entire input into a fixed vector (as RNNs do), attention lets the model "look back" at relevant parts of the input at each decoding step.

Self-attention (each position attends to all others in the same sequence) is the core Transformer operation. Multi-head attention runs several attention functions in parallel, allowing the model to attend to different aspects simultaneously — e.g., syntactic vs. semantic relationships.

Attention is what enables LLMs to understand long-range dependencies (knowing "it" refers to a noun mentioned many sentences ago), and is a key reason Transformers outperformed RNNs on language tasks.

Examples

  • Transformer self-attention in GPT-4
  • Cross-attention in DALL-E (text→image)
  • BERT's bidirectional attention