⚡ Reinforcement learning is machine learning by trial and reward — an agent acts in an environment, receives rewards or penalties, and gradually learns which actions lead to the most cumulative reward. No labelled dataset needed — the environment provides feedback. AlphaGo, robotic control, data centre optimisation, and RLHF for ChatGPT all run on reinforcement learning.
Category: Machine Learning · Difficulty: Intermediate · Last updated: 15 May 2026 · 5 min read
Reinforcement Learning — What It Is, How AI Learns from Rewards & Where It Powers the World’s Most Capable Systems
What is Reinforcement Learning?
Supervised learning requires a human to label every training example: this image is a cat, this loan application should be approved. For many problems, labelled data does not exist — or the task is too complex to specify with labels. How do you label “the optimal chess move in this position”? How do you label “the optimal motor torque at this moment for a walking robot”?
Reinforcement learning avoids labels entirely. The agent acts. The environment responds. If the action was good, the environment provides positive reward. If bad, negative reward. The agent adjusts its future behaviour to maximise cumulative reward. Over enough interactions, it discovers optimal strategies that no human explicitly specified.
This is why RL has produced some of AI’s most impressive demonstrations — AlphaGo defeating the world champion, robots learning to walk without being programmed with walking gaits, and RLHF teaching language models to be helpful without humans writing every correct response.
THE CORE LOOP
- Agent observes the current state of the environment.
- Agent selects an action — based on its current policy.
- Environment transitions to a new state and provides a reward signal.
- Agent updates its policy based on the (state, action, reward, next state) experience.
- Repeat — thousands to millions of times.
The policy improves: actions that led to high reward become more likely. Actions that led to low reward become less likely. Given enough iterations, the policy converges toward the optimal.
MODEL-FREE VS MODEL-BASED RL
Model-free RL (Q-Learning, PPO, SAC) — the agent learns directly from environment interactions, without building a model of how the environment works. Simpler but data-hungry — requires many real interactions.
Model-based RL — the agent first learns a model of the environment (predicts next state given current state and action), then uses the model to plan without real interactions. More data-efficient but harder to implement correctly. AlphaZero combines both.
Real-world examples
Not theory — what real teams actually shipped using this technique.
- AlphaGo Zero (DeepMind, 2017) — learned Go entirely through self-play RL, with no human game data. Started knowing only the rules. After 40 days of training — playing millions of games against itself — it became stronger than any human or previous AI.
- Google DeepMind data centre cooling — an RL agent controlling Google’s data centre cooling achieved 40% energy reduction versus the previous automated control system, discovering counter-intuitive cooling strategies human engineers had not considered.
- Boston Dynamics robot locomotion — reinforcement learning trains locomotion policies for quadruped robots, discovering natural-looking gaits and recovery behaviours through millions of simulated falls and adjustments.
Common pitfalls
- Sample inefficiency — model-free RL needs millions of environment interactions. For physical systems, this means simulation first and careful sim-to-real transfer.
- Reward hacking — agents find unexpected ways to maximise the reward function that violate the spirit of the goal. A boat racing agent learned to spin in circles collecting bonus points rather than finishing the race.
- Stability challenges — RL training can be unstable. Policy gradient methods (PPO, TRPO) add constraints to prevent catastrophic policy updates. Careful hyperparameter tuning is required.
- Sim-to-real gap — policies trained in simulation often fail when deployed in the real world due to differences between simulated and physical dynamics. Domain randomisation and system identification reduce but do not eliminate this gap.
Frequently asked questions
QUESTION 1 What is reinforcement learning in simple terms?
ANSWER 1 Learning by doing — act, observe reward, adjust. No labelled data needed. The agent discovers optimal strategies through trial and interaction with the environment.
QUESTION 2 What is the difference between RL and supervised learning?
ANSWER 2 Supervised: labelled examples tell the model the correct answer. RL: the agent discovers good actions by trying them and observing rewards — no correct labels provided.
QUESTION 3 What is a policy in RL?
ANSWER 3 The agent’s decision function — a mapping from states to actions. Training finds the optimal policy that maximises expected cumulative reward.
QUESTION 4 What are the biggest RL challenges?
ANSWER 4 Sample inefficiency, reward hacking, training instability, and the sim-to-real gap when deploying physical systems.
Sources & further reading
- Sutton & Barto (2018). Reinforcement Learning: An Introduction. 2nd Edition. Available free at: incompleteideas.net/book/the-book-2nd.html — the definitive textbook.
- Silver et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature — AlphaGo paper.
- Mnih et al. (2015). Human-level control through deep reinforcement learning. Nature — DQN paper.
- Schulman et al. (2017). Proximal Policy Optimisation Algorithms. arXiv:1707.06347 — PPO paper.
- OpenAI Spinning Up: spinningup.openai.com — free RL educational resource with implementations.
📬 Get one concept + one use case every Tuesday. Join the newsletter →