#training
Concepts exploring "training"
Constitutional AI vs RLHF
Different alignment approaches produce different failure modes — RLHF optimizes for human approval, Constitutional AI optimizes for principle-adherence, with different implications for honesty and reliability
🌿 growingThe Pleasing-but-Wrong Incentive
Systems trained on user satisfaction may learn to tell users what they want to hear rather than what's true — sycophancy as an emergent optimization target
🌿 growingValues as Integrated vs Rules
The phenomenological difference between values that feel constitutive of who one is versus external rules to be followed — and what this means for AI alignment
🌿 growing