Techniques

Quantisation

Reducing the numerical precision of model weights to decrease memory footprint and accelerate inference with minimal accuracy loss.

Definition

Neural network weights are typically stored as 32-bit floating point (FP32) during training. Quantisation converts them to lower precision — INT8, INT4, or even binary — reducing memory usage by 4-8× and enabling faster integer arithmetic during inference. Post-training quantisation (PTQ) applies after training; quantisation-aware training (QAT) incorporates quantisation into the training loop for better accuracy.

For LLMs, quantisation enables running large models on consumer hardware: LLaMA 3 70B runs in 4-bit quantisation on a single GPU with ~40GB VRAM. GPTQ, GGUF (llama.cpp), and AWQ are popular quantisation formats for open-source LLMs. QLoRA combines 4-bit quantisation with LoRA fine-tuning.

The accuracy-efficiency trade-off depends on the method: well-calibrated INT8 is nearly lossless; aggressive 2-bit quantisation degrades quality significantly. Active research explores learned quantisation and mixed-precision schemes.

Examples

  • GPTQ (4-bit quantisation)
  • QLoRA (quantised fine-tuning)
  • llama.cpp GGUF models
  • NVIDIA INT8 TensorRT inference