What is DPO and how does it differ from RLHF?

DPO (Direct Preference Optimisation) achieves similar alignment goals as RLHF without the complexity of training a separate reward model and running RL. Instead, it directly fine-tunes the LLM on preference pairs using a simplified objective derived from the RLHF framework. DPO is more stable, simpler to implement, and often produces comparable results to PPO-based RLHF — making it the preferred approach for many alignment fine-tuning applications.

RLHF (Reinforcement Learning from Human Feedback)

Q: What is RLHF in simple terms?

RLHF is how AI learns to be helpful. A raw pretrained model is capable but untamed — it might write excellent essays or toxic rants depending on the prompt. RLHF trains it to prefer helpful, harmless responses. Human raters compare pairs of outputs and indicate which is better. These preferences train a reward model. The LLM is then reinforced to produce outputs the reward model likes — gradually becoming reliably helpful.

Q: What are the three steps of RLHF?

Step 1 — Supervised fine-tuning: fine-tune the pretrained LLM on high-quality human-written demonstrations of helpful responses. Step 2 — Reward model training: collect human preference data (which of these two responses is better?) and train a reward model to predict human preferences. Step 3 — RL fine-tuning: use PPO (Proximal Policy Optimisation) to fine-tune the LLM to generate outputs that score highly according to the reward model.

Q: What is Constitutional AI?

Constitutional AI (CAI) is Anthropic's approach to alignment that reduces dependence on human raters. A set of principles (the 'constitution') guides the model to evaluate and revise its own outputs. The model critiques its responses against the constitution, revises them, and then AI feedback (rather than only human feedback) is used to train the reward model. This scales alignment training with less human labelling effort.

⚡ RLHF (Reinforcement Learning from Human Feedback) is the training technique that transforms a raw pretrained LLM into a helpful, harmless assistant. Human raters rank model outputs by quality, a reward model learns those preferences, and the LLM is fine-tuned via reinforcement learning to produce outputs the reward model scores highly. It is how ChatGPT, Claude, and Gemini learned to follow instructions helpfully rather than just complete text.

Category: Machine Learning · Difficulty: Intermediate · Last updated: 15 May 2026 · 5 min read

RLHF — What It Is and How Human Feedback Turned Raw Language Models Into Helpful Assistants

What is RLHF?

Pretraining produces a model that completes text — not one that follows instructions helpfully. A pretrained GPT-3 would complete “Write me a poem about love” with whatever continuation it statistically expects — which might be another prompt, a list of poem titles, or an actual poem. It has no understanding of what the user wants or what counts as a good response.

RLHF teaches the model what “good” means. Not through rules — but through human preferences. Show the model two responses. Have a human rate which is better. Repeat thousands of times. Train a reward model that learns to predict human preferences. Then use reinforcement learning to nudge the LLM toward producing outputs the reward model considers better. After this process, the model has a reliable sense of what humans find helpful, harmless, and honest.

This is why ChatGPT felt so different from earlier language models. The underlying GPT architecture was not new. The alignment through RLHF was.

How RLHF works?

Step 1 — Supervised fine-tuning (SFT):
Collect demonstrations — human labellers write examples of ideal responses to diverse prompts. Fine-tune the pretrained LLM on these demonstrations. This produces a model that follows instructions and produces generally helpful outputs.

Step 2 — Reward model training:
Generate multiple responses to the same prompt using the SFT model. Have human labellers rank these responses from best to worst. Train a separate reward model (a neural network) on these preference pairs to predict which response a human would prefer. The reward model becomes a proxy for human judgment.

Step 3 — Reinforcement learning fine-tuning:
Use PPO (Proximal Policy Optimisation) to fine-tune the SFT model. For each prompt, the model generates a response. The reward model scores it. PPO adjusts the model’s weights to increase the probability of high-scoring responses and decrease low-scoring ones. A KL divergence penalty prevents the model from drifting too far from the SFT starting point.When to use [Term] (and when not to)

Real-world examples

Not theory — what real teams actually shipped using this technique.

InstructGPT (OpenAI, 2022) — the paper that popularised RLHF for LLMs. A 1.3B parameter model fine-tuned with RLHF outperformed a 175B parameter GPT-3 on human preference evaluations — demonstrating that alignment quality matters more than raw scale.
ChatGPT — built on GPT-3.5 fine-tuned with RLHF. The shift from completion-based to instruction-following behaviour, with safety boundaries, was almost entirely due to RLHF rather than architectural changes.
Claude (Anthropic) — trained using both RLHF and Constitutional AI, incorporating a set of principles the model learns to follow. Anthropic’s research into scalable oversight and AI safety shapes how RLHF is applied.

Common pitfalls

Reward model gaming — the LLM learns to maximise the reward model score, not the underlying human preference the reward model approximates. Over-optimisation can produce responses that score highly on the reward model but are stylistically strange or subtly manipulative.
Labeller disagreement — human raters disagree on what constitutes a better response, especially for complex or sensitive topics. Reward models trained on noisy labels inherit this disagreement.
Scalability — collecting high-quality human preference data at scale is expensive and slow. Constitutional AI and DPO reduce but do not eliminate this bottleneck.
Alignment tax — RLHF fine-tuning sometimes reduces raw capability (lower scores on knowledge benchmarks) even while improving alignment. The tradeoff between capability and safety is an active area of research.

Frequently asked questions

QUESTION 1 What is RLHF in simple terms?

ANSWER 1 Teaching AI to be helpful by having humans rank outputs, training a reward model on those rankings, and reinforcing the LLM to produce what the reward model considers best.

QUESTION 2 What are the three steps of RLHF?

ANSWER 2 Supervised fine-tuning on human demonstrations → reward model trained on human preference pairs → RL fine-tuning (PPO) to maximise reward model score

QUESTION 3 What is DPO?

ANSWER 3 Direct Preference Optimisation — achieves similar alignment to RLHF without a separate reward model. Simpler, more stable, increasingly preferred.

QUESTION 4 What is Constitutional AI?

ANSWER 4 Anthropic’s approach using a set of principles the model evaluates its own outputs against — reducing dependence on human raters for alignment training.

Sources & further reading

Ouyang et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155 — the InstructGPT paper that popularised RLHF for LLMs.
Christiano et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS — foundational RLHF paper.
Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. arXiv:2212.08073
Rafailov et al. (2023). Direct Preference Optimisation: Your Language Model is Secretly a Reward Model. arXiv:2305.18290
OpenAI: Aligning language models to follow instructions — openai.com/research/instruction-following

📬 Get one concept + one use case every Tuesday. Join the newsletter →