Techniques

Tokenization

The process of splitting text into tokens (subword units) for processing by language models.

Definition

Tokenization converts raw text into a sequence of integer IDs that neural networks can process. Modern LLMs use subword tokenisation (BPE — Byte Pair Encoding, or SentencePiece) rather than word- or character-level splitting, enabling a fixed vocabulary to cover any text including rare words and code.

Tokenizer vocabulary size (typically 32K-128K) affects model efficiency. Approximately 1 token ≈ 0.75 English words, but this varies: code, numbers, and non-English text tokenise less efficiently. "Hello" may be 1 token; "antidisestablishmentarianism" may be 3-4.

Tokenization choices affect model capabilities: models with poor tokenisation of numbers perform worse on arithmetic; models with character-level components handle misspellings better. OpenAI's tiktoken and Hugging Face tokenizers are standard implementations.

Examples

  • BPE in GPT-4
  • SentencePiece in LLaMA
  • tiktoken (OpenAI's library)