What is reinforcement learning?

Reinforcement learning is a machine learning paradigm where an AI agent learns by taking actions in an environment and receiving reward signals. It's how AlphaGo learned to play Go and how LLMs are aligned with human preferences via RLHF.

RLHF (Reinforcement Learning from Human Feedback) is a technique used to fine-tune LLMs by having human raters score model outputs, training a reward model on these ratings, and using RL to optimise the LLM's responses toward high-scoring outputs.

Is reinforcement learning related to ChatGPT?

Yes. ChatGPT and other aligned LLMs use RLHF as a key training step. Human trainers compare model outputs and provide preference labels, which are used to train a reward model that guides the LLM to produce more helpful and safe responses.

Techniques

Reinforcement Learning

Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment to maximise cumulative reward. Unlike supervised learning, RL receives no labelled examples — it learns from the consequences of its actions through trial and error.

How Reinforcement Learning Works

An RL agent observes the current state of an environment, takes an action, receives a reward signal (positive or negative), and transitions to a new state. Over many iterations, the agent learns a policy — a mapping from states to actions — that maximises expected future reward. Key algorithms include Q-learning, policy gradients (PPO, TRPO), and model-based methods. RLHF (Reinforcement Learning from Human Feedback) adapts RL to align LLMs with human preferences.

Key Use Cases

Game playing (AlphaGo, AlphaStar, OpenAI Five)
Robot locomotion and manipulation
Autonomous driving systems
Supply chain optimisation
LLM alignment via RLHF
Personalised recommendations
Algorithm discovery

Frequently Asked Questions

What is reinforcement learning?: Reinforcement learning is a machine learning paradigm where an AI agent learns by taking actions in an environment and receiving reward signals. It's how AlphaGo learned to play Go and how LLMs are aligned with human preferences via RLHF.
What is RLHF?: RLHF (Reinforcement Learning from Human Feedback) is a technique used to fine-tune LLMs by having human raters score model outputs, training a reward model on these ratings, and using RL to optimise the LLM's responses toward high-scoring outputs.
Is reinforcement learning related to ChatGPT?: Yes. ChatGPT and other aligned LLMs use RLHF as a key training step. Human trainers compare model outputs and provide preference labels, which are used to train a reward model that guides the LLM to produce more helpful and safe responses.