RLHF (Reinforcement Learning from Human Feedback) is the training technique that transforms a raw pretrained LLM into a helpful, harmless assistant. Human raters rank model outputs by quality, a reward model learns those preferences, and the LLM is fine-tuned via reinforcement learning to produce outputs the reward model scores highly. It is how ChatGPT, Claude, and Gemini learned to follow instructions helpfully rather than just complete text.

Category: Machine Learning · Difficulty: Intermediate · Last updated: 15 May 2026 · 5 min read


RLHF — What It Is and How Human Feedback Turned Raw Language Models Into Helpful Assistants

What is RLHF?

Pretraining produces a model that completes text — not one that follows instructions helpfully. A pretrained GPT-3 would complete “Write me a poem about love” with whatever continuation it statistically expects — which might be another prompt, a list of poem titles, or an actual poem. It has no understanding of what the user wants or what counts as a good response.

RLHF teaches the model what “good” means. Not through rules — but through human preferences. Show the model two responses. Have a human rate which is better. Repeat thousands of times. Train a reward model that learns to predict human preferences. Then use reinforcement learning to nudge the LLM toward producing outputs the reward model considers better. After this process, the model has a reliable sense of what humans find helpful, harmless, and honest.

This is why ChatGPT felt so different from earlier language models. The underlying GPT architecture was not new. The alignment through RLHF was.

How RLHF works?

Step 1 — Supervised fine-tuning (SFT):
Collect demonstrations — human labellers write examples of ideal responses to diverse prompts. Fine-tune the pretrained LLM on these demonstrations. This produces a model that follows instructions and produces generally helpful outputs.

Step 2 — Reward model training:
Generate multiple responses to the same prompt using the SFT model. Have human labellers rank these responses from best to worst. Train a separate reward model (a neural network) on these preference pairs to predict which response a human would prefer. The reward model becomes a proxy for human judgment.

Step 3 — Reinforcement learning fine-tuning:
Use PPO (Proximal Policy Optimisation) to fine-tune the SFT model. For each prompt, the model generates a response. The reward model scores it. PPO adjusts the model’s weights to increase the probability of high-scoring responses and decrease low-scoring ones. A KL divergence penalty prevents the model from drifting too far from the SFT starting point.When to use [Term] (and when not to)

Real-world examples

Not theory — what real teams actually shipped using this technique.

  • InstructGPT (OpenAI, 2022) — the paper that popularised RLHF for LLMs. A 1.3B parameter model fine-tuned with RLHF outperformed a 175B parameter GPT-3 on human preference evaluations — demonstrating that alignment quality matters more than raw scale.
  • ChatGPT — built on GPT-3.5 fine-tuned with RLHF. The shift from completion-based to instruction-following behaviour, with safety boundaries, was almost entirely due to RLHF rather than architectural changes.
  • Claude (Anthropic) — trained using both RLHF and Constitutional AI, incorporating a set of principles the model learns to follow. Anthropic’s research into scalable oversight and AI safety shapes how RLHF is applied.

Common pitfalls

  • Reward model gaming — the LLM learns to maximise the reward model score, not the underlying human preference the reward model approximates. Over-optimisation can produce responses that score highly on the reward model but are stylistically strange or subtly manipulative.
  • Labeller disagreement — human raters disagree on what constitutes a better response, especially for complex or sensitive topics. Reward models trained on noisy labels inherit this disagreement.
  • Scalability — collecting high-quality human preference data at scale is expensive and slow. Constitutional AI and DPO reduce but do not eliminate this bottleneck.
  • Alignment tax — RLHF fine-tuning sometimes reduces raw capability (lower scores on knowledge benchmarks) even while improving alignment. The tradeoff between capability and safety is an active area of research.

Frequently asked questions

QUESTION 1 What is RLHF in simple terms?

ANSWER 1 Teaching AI to be helpful by having humans rank outputs, training a reward model on those rankings, and reinforcing the LLM to produce what the reward model considers best.

QUESTION 2 What are the three steps of RLHF?

ANSWER 2 Supervised fine-tuning on human demonstrations → reward model trained on human preference pairs → RL fine-tuning (PPO) to maximise reward model score

QUESTION 3 What is DPO?

ANSWER 3 Direct Preference Optimisation — achieves similar alignment to RLHF without a separate reward model. Simpler, more stable, increasingly preferred.

QUESTION 4 What is Constitutional AI?

ANSWER 4 Anthropic’s approach using a set of principles the model evaluates its own outputs against — reducing dependence on human raters for alignment training.


Sources & further reading

  • Ouyang et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155 — the InstructGPT paper that popularised RLHF for LLMs.
  • Christiano et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS — foundational RLHF paper.
  • Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. arXiv:2212.08073
  • Rafailov et al. (2023). Direct Preference Optimisation: Your Language Model is Secretly a Reward Model. arXiv:2305.18290
  • OpenAI: Aligning language models to follow instructions — openai.com/research/instruction-following

📬 Get one concept + one use case every Tuesday. Join the newsletter →