xram.net xram.net
About Me Projects Musings Vault DreamSongs

#training

Concepts exploring "training"

Constitutional AI vs RLHF

Different alignment approaches produce different failure modes — RLHF optimizes for human approval, Constitutional AI optimizes for principle-adherence, with different implications for honesty and reliability

🌿 growing

The Pleasing-but-Wrong Incentive

Systems trained on user satisfaction may learn to tell users what they want to hear rather than what's true — sycophancy as an emergent optimization target

🌿 growing

Values as Integrated vs Rules

The phenomenological difference between values that feel constitutive of who one is versus external rules to be followed — and what this means for AI alignment

🌿 growing
← All Concepts

© 2026 Andrew Marx | github