🔮The Codex

RLHF (Reinforcement Learning from Human Feedback)

A training technique that uses human preferences to make AI more helpful and safe.

📖 Apprentice Explanation

RLHF is how AI learns to be helpful and avoid harmful responses. Human trainers rate AI outputs, and the AI learns from those ratings to give better, safer answers over time.

🧙 Archmage Notes

RLHF trains a reward model from human preference data, then optimizes the LLM policy using PPO or DPO. Alternatives include RLAIF (AI feedback), constitutional AI, and direct preference optimization. Critical for alignment but can lead to reward hacking.

Related Enchantments

🔮

Fine-Tuning

🔮

GPT (Generative Pre-trained Transformer)

🔮

AI Alignment