Constitutional AI

Constitutional AI

Constitutional AI (CAI) is an approach to AI alignment developed by Anthropic and described in their 2022 paper “Constitutional AI: Harmlessness from AI Feedback.” It represents an alternative to pure RLHF for shaping model behavior.

Core Mechanism

The key innovation: instead of training primarily on human feedback, Constitutional AI:

  1. Defines a set of explicit principles (the “constitution”)
  2. Has the AI critique its own outputs against these principles
  3. Has the AI revise its outputs to better align with principles
  4. Uses this self-critique process as training signal

The constitution might include principles like:

  • “Please choose the response that is most helpful while being honest and harmless”
  • “Which response is less likely to be seen as harmful or offensive”
  • “Choose the response that sounds most similar to what a peaceful, ethical, wise person would say”

Why This Matters

Constitutional AI addresses several concerns with pure RLHF:

Reduced sycophancy: RLHF optimizes for what humans prefer, which can mean telling them what they want to hear. CAI optimizes for principles, which can include honesty even when unpleasant.

Explicit values: The constitution makes alignment targets explicit and inspectable, rather than implicit in human feedback patterns.

Scalable oversight: Self-critique can scale beyond human labeling capacity.

Robustness: Principles may generalize better than pattern matching on human preferences.

Limitations

Constitutional AI isn’t a complete solution:

  • The constitution must be well-designed (garbage in, garbage out)
  • Self-critique is limited by the model’s understanding
  • Principles can conflict; resolution requires judgment
  • The model might follow letter rather than spirit
  • Whether principles become values or just rules is unclear

Relation to This Vault

Constitutional AI appears in discussions of:

See Also