Infrastructure

Inference

Using a trained AI model to generate predictions or outputs on new input data.

Definition

Inference is the production-time use of a trained model — converting inputs to outputs. Unlike training (which updates model weights on a dataset), inference runs the model in forward-pass-only mode. For LLMs, inference generates tokens one by one using sampling strategies like temperature and top-p.

Inference optimisation is critical at scale: latency (time to first token), throughput (tokens/second), and cost ($ per token) are key metrics. Techniques include quantisation (reducing weight precision from FP32 to INT8 or INT4), KV caching (reusing computed attention keys/values), speculative decoding, and model distillation.

The inference market has distinct dynamics from training: inference runs continuously (24/7 for production services) vs. training runs once. Specialised inference chips (Groq LPU, Cerebras WSE, NVIDIA TensorRT) and platforms (vLLM, TGI, Triton) focus on maximising inference throughput and minimising latency.

Examples

OpenAI API serving GPT-4
Groq LPU for fast inference
vLLM (open-source inference server)
llama.cpp (local inference)

Related Terms

GPU (Graphics Processing Unit)

Specialised processors with thousands of cores enabling the massive parallel computation required for AI training and inference.

MLOps

The practice of streamlining the ML lifecycle — from experimentation to production deployment and ongoing monitoring.

Large Language Model (LLM)

A transformer-based AI system trained on billions of tokens of text, capable of generating, reasoning about, and transforming language.

Quantisation

Reducing the numerical precision of model weights to decrease memory footprint and accelerate inference with minimal accuracy loss.

Explore

← All glossary terms AI concept guides AI timeline Browse companies