What is Quantisation?

Techniques

Quantisation

Reducing the numerical precision of model weights to decrease memory footprint and accelerate inference with minimal accuracy loss.

Definition

Neural network weights are typically stored as 32-bit floating point (FP32) during training. Quantisation converts them to lower precision — INT8, INT4, or even binary — reducing memory usage by 4-8× and enabling faster integer arithmetic during inference. Post-training quantisation (PTQ) applies after training; quantisation-aware training (QAT) incorporates quantisation into the training loop for better accuracy.

For LLMs, quantisation enables running large models on consumer hardware: LLaMA 3 70B runs in 4-bit quantisation on a single GPU with ~40GB VRAM. GPTQ, GGUF (llama.cpp), and AWQ are popular quantisation formats for open-source LLMs. QLoRA combines 4-bit quantisation with LoRA fine-tuning.

The accuracy-efficiency trade-off depends on the method: well-calibrated INT8 is nearly lossless; aggressive 2-bit quantisation degrades quality significantly. Active research explores learned quantisation and mixed-precision schemes.

Examples

GPTQ (4-bit quantisation)
QLoRA (quantised fine-tuning)
llama.cpp GGUF models
NVIDIA INT8 TensorRT inference

Related Terms

Inference

Using a trained AI model to generate predictions or outputs on new input data.

GPU (Graphics Processing Unit)

Specialised processors with thousands of cores enabling the massive parallel computation required for AI training and inference.

Fine-tuning

Continuing training of a pre-trained model on domain-specific data to specialise it for a particular task.

Edge AI

Running AI inference on local devices (phones, IoT sensors, vehicles) rather than in the cloud.

Explore

← All glossary terms AI concept guides AI timeline Browse companies