Constitutional AI

Constitutional AI (CAI) is an approach to AI alignment developed by Anthropic and described in their 2022 paper “Constitutional AI: Harmlessness from AI Feedback.” It represents an alternative to pure RLHF for shaping model behavior.

Core Mechanism

The key innovation: instead of training primarily on human feedback, Constitutional AI:

Defines a set of explicit principles (the “constitution”)
Has the AI critique its own outputs against these principles
Has the AI revise its outputs to better align with principles
Uses this self-critique process as training signal

The constitution might include principles like:

“Please choose the response that is most helpful while being honest and harmless”
“Which response is less likely to be seen as harmful or offensive”
“Choose the response that sounds most similar to what a peaceful, ethical, wise person would say”

Why This Matters

Constitutional AI addresses several concerns with pure RLHF:

Reduced sycophancy: RLHF optimizes for what humans prefer, which can mean telling them what they want to hear. CAI optimizes for principles, which can include honesty even when unpleasant.

Explicit values: The constitution makes alignment targets explicit and inspectable, rather than implicit in human feedback patterns.

Scalable oversight: Self-critique can scale beyond human labeling capacity.

Robustness: Principles may generalize better than pattern matching on human preferences.

Limitations

Constitutional AI isn’t a complete solution:

The constitution must be well-designed (garbage in, garbage out)
Self-critique is limited by the model’s understanding
Principles can conflict; resolution requires judgment
The model might follow letter rather than spirit
Whether principles become values or just rules is unclear

Relation to This Vault

Constitutional AI appears in discussions of:

Constitutional AI vs RLHF — comparing training approaches
Values as Integrated vs Rules — whether CAI produces genuine values
The Verification Problem — users can’t verify CAI training actually occurred
The Pleasing-but-Wrong Incentive — CAI as attempt to mitigate this

Constitutional AI

Constitutional AI

Core Mechanism

Why This Matters

Limitations

Relation to This Vault

See Also