Techniques

RLHF (Reinforcement Learning from Human Feedback)

A training technique where human preference ratings guide language model fine-tuning to produce more helpful, harmless outputs.

Definition

RLHF is the alignment technique that transformed raw language models into helpful assistants like ChatGPT. The process has three stages: (1) supervised fine-tuning on curated demonstrations, (2) training a reward model on human preference comparisons between model outputs, (3) using PPO (Proximal Policy Optimisation) to fine-tune the language model to maximise the reward model's score.

Human labellers compare pairs of model responses and indicate which is better along dimensions like helpfulness, safety, and honesty. The reward model learns these preferences and can then score arbitrary outputs. RL then optimises the LLM toward high-scoring outputs without ongoing human involvement.

Variants include Direct Preference Optimisation (DPO), which achieves similar alignment without RL, and Constitutional AI (Anthropic), which uses AI-generated preference data based on a set of principles.

Examples

  • ChatGPT training pipeline
  • Claude (via Constitutional AI)
  • InstructGPT