Tokenization
The process of splitting text into tokens (subword units) for processing by language models.
Definition
Tokenization converts raw text into a sequence of integer IDs that neural networks can process. Modern LLMs use subword tokenisation (BPE — Byte Pair Encoding, or SentencePiece) rather than word- or character-level splitting, enabling a fixed vocabulary to cover any text including rare words and code.
Tokenizer vocabulary size (typically 32K-128K) affects model efficiency. Approximately 1 token ≈ 0.75 English words, but this varies: code, numbers, and non-English text tokenise less efficiently. "Hello" may be 1 token; "antidisestablishmentarianism" may be 3-4.
Tokenization choices affect model capabilities: models with poor tokenisation of numbers perform worse on arithmetic; models with character-level components handle misspellings better. OpenAI's tiktoken and Hugging Face tokenizers are standard implementations.
Examples
- BPE in GPT-4
- SentencePiece in LLaMA
- tiktoken (OpenAI's library)
Related Terms
Large Language Model (LLM)
A transformer-based AI system trained on billions of tokens of text, capable of generating, reasoning about, and transforming language.
Context Window
The maximum number of tokens an LLM can process in a single request, determining how much text it can "see" at once.
Embedding
A dense numerical vector representation of data (text, images, audio) capturing semantic meaning.