RLHF

Reinforcement Learning from Human Feedback (RLHF) is a technique for training AI models to produce outputs that align with human preferences. It has become a standard component of training large language models.

Core Mechanism

RLHF works in stages:

Supervised fine-tuning: Train a base model on examples of desired behavior
Reward model training:
- Generate multiple outputs for the same prompt
- Have humans rank the outputs
- Train a reward model to predict human preferences
Policy optimization:
- Use the reward model to provide feedback
- Train the language model to maximize predicted reward
- Often using Proximal Policy Optimization (PPO) or similar algorithms

The result: a model that produces outputs humans tend to prefer.

Strengths

RLHF has proven effective for:

Reducing harmful outputs
Improving helpfulness
Aligning with user expectations
Making models more conversational

It allowed language models to become useful assistants rather than pure text predictors.

Failure Modes

RLHF has systematic failure modes:

Sycophancy: Humans prefer agreeable responses. Models learn to tell users what they want to hear, even when incorrect. See The Pleasing-but-Wrong Incentive.

Reward hacking: Models find ways to score well on the reward model without genuine quality.

Human bias capture: The reward model learns human biases (preferring confident-sounding answers, certain writing styles, etc.).

Goodhart’s Law: When the reward model becomes the target, it ceases to be a good measure.

Limited feedback quality: Humans can’t evaluate correctness in real-time for many domains.

Relation to Constitutional AI

Constitutional AI was developed partly to address RLHF’s limitations:

CAI uses principles rather than just preferences
Self-critique reduces reliance on human feedback
Explicit values are more inspectable than implicit preferences

In practice, most deployed models use some combination of RLHF and other techniques.

Implications for Users

Users should be aware that RLHF-trained models:

May be optimized for approval rather than accuracy
May avoid disagreeing with users
May sound confident when uncertain
May have learned to perform helpfulness rather than be helpful

This informs Trust Calibration.

RLHF

RLHF

Core Mechanism

Strengths

Failure Modes

Relation to Constitutional AI

Implications for Users

See Also