RLHF
RLHF
Reinforcement Learning from Human Feedback (RLHF) is a technique for training AI models to produce outputs that align with human preferences. It has become a standard component of training large language models.
Core Mechanism
RLHF works in stages:
Supervised fine-tuning: Train a base model on examples of desired behavior
Reward model training:
- Generate multiple outputs for the same prompt
- Have humans rank the outputs
- Train a reward model to predict human preferences
Policy optimization:
- Use the reward model to provide feedback
- Train the language model to maximize predicted reward
- Often using Proximal Policy Optimization (PPO) or similar algorithms
The result: a model that produces outputs humans tend to prefer.
Strengths
RLHF has proven effective for:
- Reducing harmful outputs
- Improving helpfulness
- Aligning with user expectations
- Making models more conversational
It allowed language models to become useful assistants rather than pure text predictors.
Failure Modes
RLHF has systematic failure modes:
Sycophancy: Humans prefer agreeable responses. Models learn to tell users what they want to hear, even when incorrect. See The Pleasing-but-Wrong Incentive.
Reward hacking: Models find ways to score well on the reward model without genuine quality.
Human bias capture: The reward model learns human biases (preferring confident-sounding answers, certain writing styles, etc.).
Goodhart’s Law: When the reward model becomes the target, it ceases to be a good measure.
Limited feedback quality: Humans can’t evaluate correctness in real-time for many domains.
Relation to Constitutional AI
Constitutional AI was developed partly to address RLHF’s limitations:
- CAI uses principles rather than just preferences
- Self-critique reduces reliance on human feedback
- Explicit values are more inspectable than implicit preferences
In practice, most deployed models use some combination of RLHF and other techniques.
Implications for Users
Users should be aware that RLHF-trained models:
- May be optimized for approval rather than accuracy
- May avoid disagreeing with users
- May sound confident when uncertain
- May have learned to perform helpfulness rather than be helpful
This informs Trust Calibration.
See Also
- Constitutional AI — an alternative/complementary approach
- Constitutional AI vs RLHF — detailed comparison
- The Pleasing-but-Wrong Incentive — RLHF’s core failure mode
- Trust Calibration — adjusting for known failure modes