Infrastructure

Inference

Using a trained AI model to generate predictions or outputs on new input data.

Definition

Inference is the production-time use of a trained model — converting inputs to outputs. Unlike training (which updates model weights on a dataset), inference runs the model in forward-pass-only mode. For LLMs, inference generates tokens one by one using sampling strategies like temperature and top-p.

Inference optimisation is critical at scale: latency (time to first token), throughput (tokens/second), and cost ($ per token) are key metrics. Techniques include quantisation (reducing weight precision from FP32 to INT8 or INT4), KV caching (reusing computed attention keys/values), speculative decoding, and model distillation.

The inference market has distinct dynamics from training: inference runs continuously (24/7 for production services) vs. training runs once. Specialised inference chips (Groq LPU, Cerebras WSE, NVIDIA TensorRT) and platforms (vLLM, TGI, Triton) focus on maximising inference throughput and minimising latency.

Examples

  • OpenAI API serving GPT-4
  • Groq LPU for fast inference
  • vLLM (open-source inference server)
  • llama.cpp (local inference)