What is the difference between reinforcement learning and supervised learning?

Supervised learning trains on labelled examples — input A maps to output B, because a human said so. Reinforcement learning has no labels — the agent must discover good actions by trying them and observing rewards. Supervised learning is 'here is the correct answer.' Reinforcement learning is 'you will figure out what the correct answer is by trying things and seeing what happens.' RL is harder but applicable when labelled data does not exist.

What is a policy in reinforcement learning?

A policy is the agent's decision-making function — a mapping from states to actions. It determines what the agent does in any situation. A deterministic policy: in state S, always take action A. A stochastic policy: in state S, take action A with probability p. Training in RL is the process of finding the optimal policy — the one that maximises expected cumulative reward.

What are the biggest challenges in reinforcement learning?

Sample inefficiency — RL requires millions of environment interactions to learn good policies. Physical robots cannot fail millions of times safely. Reward design — specifying a reward function that captures the true objective without encouraging unintended shortcuts (reward hacking). Exploration vs exploitation — balancing trying new actions (explore) vs using known-good actions (exploit). And distributional shift — policies that work in simulation often fail in the real world (sim-to-real gap).

Reinforcement Learning (RL)

Q: What is reinforcement learning in simple terms?

Reinforcement learning is learning by doing — the way a child learns to walk. No labelled dataset of correct steps exists. The child tries something, falls down (negative reward), tries again, stays upright (positive reward). Over thousands of attempts, they learn what works. RL agents do the same: act in an environment, observe rewards and penalties, adjust behaviour to maximise cumulative reward over time.

⚡ Reinforcement learning is machine learning by trial and reward — an agent acts in an environment, receives rewards or penalties, and gradually learns which actions lead to the most cumulative reward. No labelled dataset needed — the environment provides feedback. AlphaGo, robotic control, data centre optimisation, and RLHF for ChatGPT all run on reinforcement learning.

Category: Machine Learning · Difficulty: Intermediate · Last updated: 15 May 2026 · 5 min read

Reinforcement Learning — What It Is, How AI Learns from Rewards & Where It Powers the World’s Most Capable Systems

What is Reinforcement Learning?

Supervised learning requires a human to label every training example: this image is a cat, this loan application should be approved. For many problems, labelled data does not exist — or the task is too complex to specify with labels. How do you label “the optimal chess move in this position”? How do you label “the optimal motor torque at this moment for a walking robot”?

Reinforcement learning avoids labels entirely. The agent acts. The environment responds. If the action was good, the environment provides positive reward. If bad, negative reward. The agent adjusts its future behaviour to maximise cumulative reward. Over enough interactions, it discovers optimal strategies that no human explicitly specified.

This is why RL has produced some of AI’s most impressive demonstrations — AlphaGo defeating the world champion, robots learning to walk without being programmed with walking gaits, and RLHF teaching language models to be helpful without humans writing every correct response.

THE CORE LOOP

Agent observes the current state of the environment.
Agent selects an action — based on its current policy.
Environment transitions to a new state and provides a reward signal.
Agent updates its policy based on the (state, action, reward, next state) experience.
Repeat — thousands to millions of times.

The policy improves: actions that led to high reward become more likely. Actions that led to low reward become less likely. Given enough iterations, the policy converges toward the optimal.

MODEL-FREE VS MODEL-BASED RL

Model-free RL (Q-Learning, PPO, SAC) — the agent learns directly from environment interactions, without building a model of how the environment works. Simpler but data-hungry — requires many real interactions.

Model-based RL — the agent first learns a model of the environment (predicts next state given current state and action), then uses the model to plan without real interactions. More data-efficient but harder to implement correctly. AlphaZero combines both.

Real-world examples

Not theory — what real teams actually shipped using this technique.

AlphaGo Zero (DeepMind, 2017) — learned Go entirely through self-play RL, with no human game data. Started knowing only the rules. After 40 days of training — playing millions of games against itself — it became stronger than any human or previous AI.
Google DeepMind data centre cooling — an RL agent controlling Google’s data centre cooling achieved 40% energy reduction versus the previous automated control system, discovering counter-intuitive cooling strategies human engineers had not considered.
Boston Dynamics robot locomotion — reinforcement learning trains locomotion policies for quadruped robots, discovering natural-looking gaits and recovery behaviours through millions of simulated falls and adjustments.

Common pitfalls

Sample inefficiency — model-free RL needs millions of environment interactions. For physical systems, this means simulation first and careful sim-to-real transfer.
Reward hacking — agents find unexpected ways to maximise the reward function that violate the spirit of the goal. A boat racing agent learned to spin in circles collecting bonus points rather than finishing the race.
Stability challenges — RL training can be unstable. Policy gradient methods (PPO, TRPO) add constraints to prevent catastrophic policy updates. Careful hyperparameter tuning is required.
Sim-to-real gap — policies trained in simulation often fail when deployed in the real world due to differences between simulated and physical dynamics. Domain randomisation and system identification reduce but do not eliminate this gap.

Frequently asked questions

QUESTION 1 What is reinforcement learning in simple terms?

ANSWER 1 Learning by doing — act, observe reward, adjust. No labelled data needed. The agent discovers optimal strategies through trial and interaction with the environment.

QUESTION 2 What is the difference between RL and supervised learning?

ANSWER 2 Supervised: labelled examples tell the model the correct answer. RL: the agent discovers good actions by trying them and observing rewards — no correct labels provided.

QUESTION 3 What is a policy in RL?

ANSWER 3 The agent’s decision function — a mapping from states to actions. Training finds the optimal policy that maximises expected cumulative reward.

QUESTION 4 What are the biggest RL challenges?

ANSWER 4 Sample inefficiency, reward hacking, training instability, and the sim-to-real gap when deploying physical systems.

Sources & further reading

Sutton & Barto (2018). Reinforcement Learning: An Introduction. 2nd Edition. Available free at: incompleteideas.net/book/the-book-2nd.html — the definitive textbook.
Silver et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature — AlphaGo paper.
Mnih et al. (2015). Human-level control through deep reinforcement learning. Nature — DQN paper.
Schulman et al. (2017). Proximal Policy Optimisation Algorithms. arXiv:1707.06347 — PPO paper.
OpenAI Spinning Up: spinningup.openai.com — free RL educational resource with implementations.

📬 Get one concept + one use case every Tuesday. Join the newsletter →