Models

Context Window

The maximum number of tokens an LLM can process in a single request, determining how much text it can "see" at once.

Definition

The context window defines how much input (conversation history, documents, code) an LLM can process at once, measured in tokens. Early GPT-3 had a 4K token window (~3,000 words); frontier models now support 128K-2M tokens (books, entire codebases).

Context window size matters enormously for real-world applications: analysing a full legal contract, maintaining a long conversation, processing an entire codebase, or reading a research paper. Larger windows enable richer context but increase compute cost quadratically with vanilla attention (O(n²) in sequence length).

Technologies like sliding window attention, sparse attention, and linear attention approximations extend practical context length. Beyond a certain size, performance often degrades in the middle of long contexts ("lost in the middle" phenomenon), an active research area.

Examples

  • GPT-4 Turbo (128K tokens)
  • Claude 3 (200K tokens)
  • Gemini 1.5 Pro (1M tokens)