Reinforcement Learning
Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment to maximise cumulative reward. Unlike supervised learning, RL receives no labelled examples — it learns from the consequences of its actions through trial and error.
How Reinforcement Learning Works
An RL agent observes the current state of an environment, takes an action, receives a reward signal (positive or negative), and transitions to a new state. Over many iterations, the agent learns a policy — a mapping from states to actions — that maximises expected future reward. Key algorithms include Q-learning, policy gradients (PPO, TRPO), and model-based methods. RLHF (Reinforcement Learning from Human Feedback) adapts RL to align LLMs with human preferences.
Key Use Cases
- Game playing (AlphaGo, AlphaStar, OpenAI Five)
- Robot locomotion and manipulation
- Autonomous driving systems
- Supply chain optimisation
- LLM alignment via RLHF
- Personalised recommendations
- Algorithm discovery
Frequently Asked Questions
- What is reinforcement learning?
- Reinforcement learning is a machine learning paradigm where an AI agent learns by taking actions in an environment and receiving reward signals. It's how AlphaGo learned to play Go and how LLMs are aligned with human preferences via RLHF.
- What is RLHF?
- RLHF (Reinforcement Learning from Human Feedback) is a technique used to fine-tune LLMs by having human raters score model outputs, training a reward model on these ratings, and using RL to optimise the LLM's responses toward high-scoring outputs.
- Is reinforcement learning related to ChatGPT?
- Yes. ChatGPT and other aligned LLMs use RLHF as a key training step. Human trainers compare model outputs and provide preference labels, which are used to train a reward model that guides the LLM to produce more helpful and safe responses.