What is Tokenization?

Techniques

Tokenization

The process of splitting text into tokens (subword units) for processing by language models.

Definition

Tokenization converts raw text into a sequence of integer IDs that neural networks can process. Modern LLMs use subword tokenisation (BPE — Byte Pair Encoding, or SentencePiece) rather than word- or character-level splitting, enabling a fixed vocabulary to cover any text including rare words and code.

Tokenizer vocabulary size (typically 32K-128K) affects model efficiency. Approximately 1 token ≈ 0.75 English words, but this varies: code, numbers, and non-English text tokenise less efficiently. "Hello" may be 1 token; "antidisestablishmentarianism" may be 3-4.

Tokenization choices affect model capabilities: models with poor tokenisation of numbers perform worse on arithmetic; models with character-level components handle misspellings better. OpenAI's tiktoken and Hugging Face tokenizers are standard implementations.

Examples

BPE in GPT-4
SentencePiece in LLaMA
tiktoken (OpenAI's library)

Related Terms

Large Language Model (LLM)

A transformer-based AI system trained on billions of tokens of text, capable of generating, reasoning about, and transforming language.

Context Window

The maximum number of tokens an LLM can process in a single request, determining how much text it can "see" at once.

Embedding

A dense numerical vector representation of data (text, images, audio) capturing semantic meaning.

Explore

← All glossary terms AI concept guides AI timeline Browse companies