⚡ A Markov Decision Process (MDP) is the mathematical framework that describes any sequential decision problem — what states the agent can be in, what actions it can take, what rewards it receives, and how states transition. MDPs are the foundation of reinforcement learning. Every RL problem — from chess engines to robot navigation to recommendation systems — is formally an MDP.
Category: Machine Learning · Difficulty: Intermediate · Last updated: 15 May 2026 · 5 min read
Markov Decision Process — How AI Formalises Sequential Decision-Making Under Uncertainty
What is Markov Decision Process?
How should a robot decide what to do next? How should a chess engine choose its move? How should a recommendation system decide which video to show? All three are sequential decision problems — each decision changes the situation, which changes the options, which changes the future. Solving them requires thinking not just about immediate rewards but about long-term consequences.
A Markov Decision Process is the mathematical language for describing these problems precisely. Define the components clearly and any sequential decision problem — however complex — becomes a tractable optimisation problem.
The name comes from Andrey Markov, the Russian mathematician who developed the theory of stochastic processes. The key property his name designates: the future depends only on the present state, not on the full history of how you got there.
THE FOUR COMPONENTS
States (S) — every possible situation the agent can be in. In chess: every possible board position (10^47 states). In a grid world: every cell. In a hospital: every possible combination of patient vitals and treatment history.
Actions (A) — every action available to the agent in each state. In chess: every legal move (typically 20-30 per position). In a robot arm: every possible joint angle change. Actions may be discrete (chess moves) or continuous (robot joint velocities).
Rewards (R) — the scalar feedback the agent receives after taking an action in a state. Winning a chess game: +1. Losing: -1. In a robot: positive reward for reaching the goal, negative for collision. The reward signal defines what the agent is trying to achieve.
Transitions (P) — the probability of transitioning to each possible next state given the current state and action taken. In chess (deterministic): taking a move always leads to exactly one next state. In a robot in a windy environment (stochastic): the same action might lead to slightly different states each time.
THE MARKOV PROPERTY
The Markov property is what makes MDPs tractable: the next state depends only on the current state and action — not on any previous states or actions. Knowing where you are now is sufficient. You do not need to remember how you got there.
This is an approximation of reality — human decisions often depend on history (“I burnt myself on this stove before”). But it is a remarkably useful approximation. Most sequential decision problems can be reformulated to satisfy the Markov property by carefully defining what counts as the “state.”
Real-world examples
Not theory — what real teams actually shipped using this technique.
- AlphaGo modelled the game of Go as an MDP — states are board positions, actions are stone placements, reward is winning or losing the game. Monte Carlo Tree Search explored the MDP and a neural network estimated values. The combination beat the world champion.
- Google’s data centre cooling optimisation modelled server cooling as an MDP — states are temperature and power readings, actions are cooling system settings, reward is energy efficiency. DeepMind’s RL agent reduced cooling energy use by 40%.
- Personalised medicine dosing — MDPs model patient treatment as sequential decisions: states are patient biomarkers, actions are drug dosages, rewards are patient health outcomes. RL agents trained on MDP models suggest optimal dosing schedules that adapt as the patient responds.
Common pitfalls
- State space explosion — real-world MDPs have astronomically large state spaces. Chess has 10^47 states; the real world has infinitely more. Exact MDP solution (dynamic programming) is intractable. Deep RL uses neural networks to approximate value functions and policies without explicit enumeration.
- Reward specification — defining the reward function correctly is harder than it looks. A reward that is technically correct but misaligned with the true objective leads the agent to unexpected and often undesirable behaviour (reward hacking).
- Non-Markovian reality — real environments often violate the Markov property. History matters. POMDPs and memory-augmented agents address this but increase complexity significantly.
- Exploration vs exploitation — in unknown MDPs, the agent must balance exploring new actions (to discover rewards) versus exploiting known good actions. Getting this balance wrong leads to either over-exploration (wasting time) or premature convergence to suboptimal policies.
Frequently asked questions
QUESTION 1 What is a Markov Decision Process in simple terms?
ANSWER 1 A formal description of a sequential decision problem — states (situations), actions (choices), rewards (feedback), transitions (probabilities). Defines the problem; RL finds the solution.
QUESTION 2 What is the Markov property?
ANSWER 2 The future depends only on the current state, not the history of how you got there. Current state is sufficient — past states add no predictive value.
QUESTION 3 What is a policy in an MDP?
ANSWER 3 A mapping from states to actions — what to do in every possible situation. The goal of RL is finding the optimal policy that maximises cumulative reward.
QUESTION 4 What is the difference between MDP and POMDP?
ANSWER 4 MDP: agent fully observes the current state. POMDP: agent only partially observes it and must maintain a belief about the true state. Most real AI agents face partial observability.
📬 Get one concept + one use case every Tuesday. Join the newsletter →