Quantisation
Reducing the numerical precision of model weights to decrease memory footprint and accelerate inference with minimal accuracy loss.
Definition
Neural network weights are typically stored as 32-bit floating point (FP32) during training. Quantisation converts them to lower precision — INT8, INT4, or even binary — reducing memory usage by 4-8× and enabling faster integer arithmetic during inference. Post-training quantisation (PTQ) applies after training; quantisation-aware training (QAT) incorporates quantisation into the training loop for better accuracy.
For LLMs, quantisation enables running large models on consumer hardware: LLaMA 3 70B runs in 4-bit quantisation on a single GPU with ~40GB VRAM. GPTQ, GGUF (llama.cpp), and AWQ are popular quantisation formats for open-source LLMs. QLoRA combines 4-bit quantisation with LoRA fine-tuning.
The accuracy-efficiency trade-off depends on the method: well-calibrated INT8 is nearly lossless; aggressive 2-bit quantisation degrades quality significantly. Active research explores learned quantisation and mixed-precision schemes.
Examples
- GPTQ (4-bit quantisation)
- QLoRA (quantised fine-tuning)
- llama.cpp GGUF models
- NVIDIA INT8 TensorRT inference
Related Terms
Inference
Using a trained AI model to generate predictions or outputs on new input data.
GPU (Graphics Processing Unit)
Specialised processors with thousands of cores enabling the massive parallel computation required for AI training and inference.
Fine-tuning
Continuing training of a pre-trained model on domain-specific data to specialise it for a particular task.
Edge AI
Running AI inference on local devices (phones, IoT sensors, vehicles) rather than in the cloud.