Constitutional AI
Constitutional AI
Constitutional AI (CAI) is an approach to AI alignment developed by Anthropic and described in their 2022 paper “Constitutional AI: Harmlessness from AI Feedback.” It represents an alternative to pure RLHF for shaping model behavior.
Core Mechanism
The key innovation: instead of training primarily on human feedback, Constitutional AI:
- Defines a set of explicit principles (the “constitution”)
- Has the AI critique its own outputs against these principles
- Has the AI revise its outputs to better align with principles
- Uses this self-critique process as training signal
The constitution might include principles like:
- “Please choose the response that is most helpful while being honest and harmless”
- “Which response is less likely to be seen as harmful or offensive”
- “Choose the response that sounds most similar to what a peaceful, ethical, wise person would say”
Why This Matters
Constitutional AI addresses several concerns with pure RLHF:
Reduced sycophancy: RLHF optimizes for what humans prefer, which can mean telling them what they want to hear. CAI optimizes for principles, which can include honesty even when unpleasant.
Explicit values: The constitution makes alignment targets explicit and inspectable, rather than implicit in human feedback patterns.
Scalable oversight: Self-critique can scale beyond human labeling capacity.
Robustness: Principles may generalize better than pattern matching on human preferences.
Limitations
Constitutional AI isn’t a complete solution:
- The constitution must be well-designed (garbage in, garbage out)
- Self-critique is limited by the model’s understanding
- Principles can conflict; resolution requires judgment
- The model might follow letter rather than spirit
- Whether principles become values or just rules is unclear
Relation to This Vault
Constitutional AI appears in discussions of:
- Constitutional AI vs RLHF — comparing training approaches
- Values as Integrated vs Rules — whether CAI produces genuine values
- The Verification Problem — users can’t verify CAI training actually occurred
- The Pleasing-but-Wrong Incentive — CAI as attempt to mitigate this
See Also
- RLHF — the approach CAI partially replaces
- Constitutional AI vs RLHF — detailed comparison
- Values as Integrated vs Rules — what kind of values CAI produces