Gradient Descent
An iterative optimisation algorithm that updates model weights by stepping in the direction that most reduces loss.
Definition
Gradient descent is the optimisation backbone of all neural network training. Starting from random weights, it iteratively adjusts them by computing the gradient (direction of steepest ascent) of the loss function and stepping in the opposite direction.
Variants differ in how much data is used per update: Batch GD uses all data (accurate but slow), Stochastic GD uses one sample (fast but noisy), and Mini-batch SGD uses small random batches (the practical default). Adaptive optimisers like Adam maintain per-parameter learning rate estimates for faster, more stable convergence.
Key hyperparameters include learning rate (step size), batch size, and momentum. The loss landscape of deep networks has many local minima and saddle points, but empirically, over-parameterised networks tend to find good solutions.
Examples
- SGD with momentum (training ResNets)
- Adam optimiser (training transformers)
- AdaGrad (recommended for sparse gradients)
Related Terms
Backpropagation
The algorithm for computing gradients in neural networks by propagating errors backwards through layers.
Neural Network
A computing system of interconnected nodes inspired by biological brains, trained to recognise patterns.
Deep Learning
A subset of machine learning using neural networks with many layers to learn complex hierarchical representations.