RLHF (Reinforcement Learning from Human Feedback)
A training technique where human preference ratings guide language model fine-tuning to produce more helpful, harmless outputs.
Definition
RLHF is the alignment technique that transformed raw language models into helpful assistants like ChatGPT. The process has three stages: (1) supervised fine-tuning on curated demonstrations, (2) training a reward model on human preference comparisons between model outputs, (3) using PPO (Proximal Policy Optimisation) to fine-tune the language model to maximise the reward model's score.
Human labellers compare pairs of model responses and indicate which is better along dimensions like helpfulness, safety, and honesty. The reward model learns these preferences and can then score arbitrary outputs. RL then optimises the LLM toward high-scoring outputs without ongoing human involvement.
Variants include Direct Preference Optimisation (DPO), which achieves similar alignment without RL, and Constitutional AI (Anthropic), which uses AI-generated preference data based on a set of principles.
Examples
- ChatGPT training pipeline
- Claude (via Constitutional AI)
- InstructGPT
Related Terms
Reinforcement Learning (RL)
A learning paradigm where an agent learns by taking actions in an environment and receiving reward signals.
Large Language Model (LLM)
A transformer-based AI system trained on billions of tokens of text, capable of generating, reasoning about, and transforming language.
Fine-tuning
Continuing training of a pre-trained model on domain-specific data to specialise it for a particular task.
Constitutional AI (CAI)
Anthropic's training approach that uses a set of principles to guide AI self-critique and revision for safer outputs.