Reinforcement Learning Fundamentals¶
Reinforcement learning (RL) is a paradigm of machine learning where an agent learns to make sequential decisions by interacting with an environment to maximize cumulative reward. Unlike supervised learning, RL does not require labeled input-output pairs -- instead, the agent discovers optimal behavior through trial and error. RL is the mathematical framework behind game-playing AI (AlphaGo, OpenAI Five), robotic control, autonomous driving, and large language model alignment (RLHF).
This chapter covers the theoretical foundations that underpin all RL algorithms, following the structure of the seminal textbook Reinforcement Learning: An Introduction by Sutton & Barto (2018).
Learning Objectives¶
After completing this chapter, you will be able to:
- Define the agent-environment interface and explain the key components of RL
- Formalize a problem as a Markov Decision Process (MDP)
- Derive and interpret the Bellman equations for value functions
- Implement dynamic programming methods (policy iteration, value iteration)
- Understand Monte Carlo and Temporal Difference learning
- Compare different RL paradigms (model-free vs. model-based, on-policy vs. off-policy)
- Build and solve RL environments using Python and Gymnasium
- Connect RL fundamentals to robotic manipulation and control tasks
1. What Is Reinforcement Learning?¶
1.1 The Agent-Environment Interface¶
The core idea of RL is simple: an agent takes actions in an environment, observes states and rewards, and learns a policy that maximizes long-term reward.
+------------+ action a_t +--------------+
| | ----------------->| |
| Agent | | Environment |
| |<---------------- | |
+------------+ state s_{t+1} +--------------+
^ reward r_t |
| |
+-------------------------------+
At each time step \(t\):
- The agent observes the current state \(s_t \in \mathcal{S}\)
- The agent selects an action \(a_t \in \mathcal{A}(s_t)\) according to policy \(\pi\)
- The environment transitions to a new state \(s_{t+1} \sim P(\cdot | s_t, a_t)\)
- The agent receives a reward \(r_{t+1} = R(s_t, a_t, s_{t+1})\)
- The agent updates its policy to maximize expected cumulative reward
1.2 RL vs. Supervised vs. Unsupervised Learning¶
| Aspect | Supervised Learning | Unsupervised Learning | Reinforcement Learning |
|---|---|---|---|
| Data | Labeled pairs \((x, y)\) | Unlabeled data \(x\) | Sequential \((s, a, r, s')\) |
| Feedback | Correct labels | Structure/patterns | Scalar reward signal |
| Goal | Minimize prediction error | Find hidden structure | Maximize cumulative reward |
| Temporal | Independent samples | Independent samples | Sequential decisions |
| Exploration | Not needed | Not needed | Essential |
| Examples | Classification, regression | Clustering, PCA | Game playing, robotics |
1.3 History and Key Milestones¶
| Year | Milestone | Key Contribution |
|---|---|---|
| 1950s | Dynamic Programming (Bellman) | Mathematical foundation |
| 1989 | Q-Learning (Watkins) | Model-free off-policy control |
| 1992 | TD-Gammon (Tesauro) | Self-play backgammon |
| 2013 | DQN (DeepMind) | Deep RL for Atari |
| 2016 | AlphaGo (DeepMind) | Superhuman Go |
| 2017 | PPO (OpenAI) | Stable policy optimization |
| 2019 | OpenAI Five | Multi-agent Dota 2 |
| 2022 | ChatGPT (RLHF) | RL for language model alignment |
2. Multi-Armed Bandits¶
Before diving into full MDPs, we study the k-armed bandit problem -- the simplest RL setting with a single state and \(k\) actions.
2.1 The k-Armed Bandit Problem¶
You face \(k\) slot machines (arms), each with an unknown reward distribution. At each step, you pull one arm and receive a reward. Goal: maximize total reward over \(T\) steps.
Formally, each arm \(a\) has a true value \(q_*(a) = \mathbb{E}[R_t | A_t = a]\). The optimal action is \(a^* = \arg\max_a q_*(a)\).
2.2 Action-Value Methods¶
Estimate \(Q_t(a) \approx q_*(a)\) using sample averages:
Select action: \(A_t = \arg\max_a Q_t(a)\) (greedy).
2.3 Exploration vs. Exploitation¶
The fundamental dilemma: exploit the current best arm, or explore to find a better one?
epsilon-Greedy: With probability \(\epsilon\), choose randomly; otherwise, choose greedily.
Upper Confidence Bound (UCB):
where \(c\) controls the exploration-exploitation trade-off.
Thompson Sampling: Maintain a posterior distribution over each arm's value; sample from the posterior and act greedily.
import numpy as np
import matplotlib.pyplot as plt
def run_bandit_experiment(k=10, steps=1000, epsilon=0.1, runs=200):
"""Compare epsilon-greedy strategies on a k-armed bandit testbed."""
avg_rewards = np.zeros((runs, steps))
optimal_counts = np.zeros((runs, steps))
for run in range(runs):
# True values: q*(a) ~ N(0, 1)
q_star = np.random.randn(k)
best_action = np.argmax(q_star)
Q = np.zeros(k) # Estimated values
N = np.zeros(k) # Action counts
for t in range(steps):
# epsilon-greedy action selection
if np.random.random() < epsilon:
action = np.random.randint(k)
else:
action = np.argmax(Q)
# Reward: R ~ N(q*(a), 1)
reward = q_star[action] + np.random.randn()
N[action] += 1
Q[action] += (reward - Q[action]) / N[action]
avg_rewards[run, t] = reward
if action == best_action:
optimal_counts[run, t] = 1
return avg_rewards.mean(axis=0), optimal_counts.mean(axis=0) * 100
# Compare epsilon = 0, 0.01, 0.1
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
for eps, label in [(0, 'eps=0 (greedy)'), (0.01, 'eps=0.01'), (0.1, 'eps=0.1')]:
rewards, optimal = run_bandit_experiment(epsilon=eps)
ax1.plot(rewards, label=label)
ax2.plot(optimal, label=label)
ax1.set_xlabel('Steps'); ax1.set_ylabel('Average Reward'); ax1.legend()
ax2.set_xlabel('Steps'); ax2.set_ylabel('% Optimal Action'); ax2.legend()
ax1.set_title('10-Armed Bandit: Average Reward')
ax2.set_title('10-Armed Bandit: % Optimal Action')
plt.tight_layout(); plt.savefig('bandit_comparison.png', dpi=150); plt.show()
3. Markov Decision Processes (MDP)¶
The MDP is the formal mathematical framework for RL. Almost all RL problems can be formulated as MDPs.
3.1 The Markov Property¶
A state \(s_t\) has the Markov property if:
The future depends only on the present, not the history. This is the "memoryless" property.
3.2 Formal Definition¶
An MDP is a tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\):
| Symbol | Meaning | Robot Example |
|---|---|---|
| \(\mathcal{S}\) | State space | Joint angles + velocities |
| \(\mathcal{A}\) | Action space | Joint torques |
| \(P(s' \mid s, a)\) | Transition probability | Physics dynamics |
| \(R(s, a, s')\) | Reward function | Task completion signal |
| \(\gamma \in [0,1)\) | Discount factor | How much we value future rewards |
3.3 Returns and Discount Factor¶
The return from time \(t\) is the cumulative discounted reward:
- \(\gamma = 0\): myopic (only care about immediate reward)
- \(\gamma \to 1\): far-sighted (care about future rewards)
- \(\gamma < 1\) guarantees \(G_t\) is finite for bounded rewards
The return satisfies the recursive relationship:
3.4 Episodic vs. Continuing Tasks¶
| Type | Description | Example |
|---|---|---|
| Episodic | Natural terminal state, then reset | Game ends, robot reaches goal |
| Continuing | No terminal state, infinite horizon | Server load balancing |
For episodic tasks, the return is a finite sum: \(G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T\).
4. Value Functions¶
Value functions estimate "how good" it is to be in a state (or take an action in a state) under a given policy \(\pi\).
4.1 State-Value Function¶
This is the expected return starting from state \(s\) and following policy \(\pi\) thereafter.
4.2 Action-Value Function¶
This is the expected return starting from state \(s\), taking action \(a\), then following policy \(\pi\).
The relationship between \(V\) and \(Q\):
4.3 The Bellman Equations¶
The Bellman expectation equation for \(V^\pi\):
For \(Q^\pi\):
The Bellman optimality equation:
Bellman Expectation Backup Bellman Optimality Backup
s s
/|\ /|\
/ | \ / | \
a1 a2 a3 a1 a2 a3
| | | | | |
s' s' s' s' s' s'
v v v v v v
V^pi(s') max Q*(s',a')
weighted by pi(a|s) choose best a
4.4 Bellman Equation Derivation¶
Starting from the definition:
Using \(G_t = R_{t+1} + \gamma G_{t+1}\):
Expanding the expectation over actions and next states:
5. Dynamic Programming¶
Dynamic Programming (DP) methods solve MDPs when the model \((P, R)\) is fully known. While rarely applicable to real robotics (where dynamics are unknown), DP provides the conceptual foundation for all RL algorithms.
5.1 Policy Evaluation¶
Given policy \(\pi\), compute \(V^\pi\) by iteratively applying the Bellman expectation equation:
Algorithm: Iterative Policy Evaluation
Initialize V(s) = 0 for all s in S
Repeat:
delta <- 0
For each s in S:
v <- V(s)
V(s) <- sum_a pi(a|s) sum_{s'} P(s'|s,a) [R(s,a,s') + gamma*V(s')]
delta <- max(delta, |v - V(s)|)
Until delta < theta (convergence threshold)
5.2 Policy Iteration¶
Alternates between evaluation (compute \(V^\pi\)) and improvement (make \(\pi\) greedy w.r.t. \(V^\pi\)):
Policy Improvement Theorem: If \(Q^\pi(s, \pi'(s)) \geq V^\pi(s)\) for all \(s\), then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s\).
Initialize pi randomly
Repeat:
1. Policy Evaluation: compute V^pi
2. Policy Improvement: pi(s) <- argmax_a sum_{s'} P(s'|s,a)[R + gamma*V^pi(s')]
Until pi does not change
5.3 Value Iteration¶
Combines evaluation and improvement into a single sweep:
Algorithm: Value Iteration
Initialize V(s) = 0 for all s in S
Repeat:
delta <- 0
For each s in S:
v <- V(s)
V(s) <- max_a sum_{s'} P(s'|s,a) [R(s,a,s') + gamma*V(s')]
delta <- max(delta, |v - V(s)|)
Until delta < theta
Output: pi*(s) = argmax_a sum_{s'} P(s'|s,a) [R(s,a,s') + gamma*V*(s')]
import numpy as np
def value_iteration(grid_size=4, gamma=0.9, theta=1e-6):
"""Solve a grid world using Value Iteration."""
n_states = grid_size * grid_size
n_actions = 4 # up, down, left, right
actions = [(-1,0), (1,0), (0,-1), (0,1)]
V = np.zeros(n_states)
policy = np.zeros(n_states, dtype=int)
def step(state, action):
row, col = state // grid_size, state % grid_size
dr, dc = actions[action]
nr, nc = row + dr, col + dc
if 0 <= nr < grid_size and 0 <= nc < grid_size:
next_state = nr * grid_size + nc
else:
next_state = state
reward = -1
if next_state == n_states - 1:
reward = 0
return next_state, reward
iteration = 0
while True:
delta = 0
for s in range(n_states - 1):
v = V[s]
q_values = []
for a in range(n_actions):
ns, r = step(s, a)
q_values.append(r + gamma * V[ns])
V[s] = max(q_values)
delta = max(delta, abs(v - V[s]))
iteration += 1
if delta < theta:
break
for s in range(n_states - 1):
q_values = []
for a in range(n_actions):
ns, r = step(s, a)
q_values.append(r + gamma * V[ns])
policy[s] = np.argmax(q_values)
return V, policy, iteration
V, policy, iters = value_iteration()
print(f"Converged in {iters} iterations")
print("Optimal Value Function:")
print(V.reshape(4, 4).round(2))
5.4 Generalized Policy Iteration (GPI)¶
The idea of alternating between evaluation and improvement is the core of almost all RL algorithms:
+----------+ +----------+
| Policy | <-----> | Policy |
|Evaluation | |Improvement|
+----------+ +----------+
| |
v v
V^pi <---------------- pi(a|s)
value function <--> policy
- DP: Full evaluation + full improvement (requires model)
- MC: MC evaluation + improvement (model-free, episodic)
- TD: TD evaluation + improvement (model-free, online)
6. Monte Carlo Methods¶
Monte Carlo (MC) methods learn from complete episodes of experience -- no model required.
6.1 MC Prediction¶
Estimate \(V^\pi(s)\) by averaging returns from all visits to state \(s\):
First-visit MC: Only use the first time \(s\) is visited in each episode. Every-visit MC: Use every visit to \(s\).
6.2 MC Control¶
Use MC to estimate \(Q(s,a)\), then improve policy greedily:
Initialize Q(s,a) arbitrarily, pi(s) <- argmax_a Q(s,a)
Repeat forever:
Generate episode using pi: S_0, A_0, R_1, ..., S_{T-1}, A_{T-1}, R_T
G <- 0
For t = T-1, T-2, ..., 0:
G <- gamma*G + R_{t+1}
If (S_t, A_t) not in earlier pairs:
Q(S_t, A_t) <- average of all returns following (S_t, A_t)
pi(S_t) <- argmax_a Q(S_t, a)
6.3 Off-Policy MC (Importance Sampling)¶
Learn about policy \(\pi\) (target) while following policy \(b\) (behavior):
7. Temporal Difference Learning (Preview)¶
TD methods combine ideas from MC and DP: they learn from episodes like MC, but bootstrap like DP.
7.1 TD(0) Prediction¶
The TD update rule:
The term \(\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)\) is the TD error.
7.2 TD vs. MC vs. DP¶
| Property | DP | MC | TD |
|---|---|---|---|
| Model required? | Yes (P, R) | No | No |
| Bootstrapping? | Yes | No | Yes |
| Updates per step | All states | End of episode | Each step |
| Convergence | Exact | Asymptotic | Asymptotic |
| Handling non-stationarity | No | Yes | Yes |
7.3 SARSA and Q-Learning (Preview)¶
SARSA (on-policy TD control):
Q-Learning (off-policy TD control):
Detailed coverage
See the Temporal Difference Learning page for full implementations and comparisons.
8. The RL Algorithm Landscape¶
Reinforcement Learning
|
+--------------+--------------+
| | |
Model-Free Model-Based Imitation
| | Learning
+-------+-------+ |
| | World Models
Value-Based Policy-Based Dyna, MBPO
| |
+-- Q-Learning +-- REINFORCE
+-- SARSA +-- Actor-Critic
+-- DQN +-- PPO, TRPO
+-- Double DQN +-- SAC, TD3
| Algorithm | Type | On/Off-Policy | Bootstrapping | Model |
|---|---|---|---|---|
| Value Iteration | Value | -- | Yes | Yes |
| MC Control | Value | On-policy | No | No |
| Q-Learning | Value | Off-policy | Yes | No |
| SARSA | Value | On-policy | Yes | No |
| DQN | Value | Off-policy | Yes | No |
| REINFORCE | Policy | On-policy | No | No |
| A2C/A3C | Actor-Critic | On-policy | Yes | No |
| PPO | Actor-Critic | On-policy | Yes | No |
| SAC | Actor-Critic | Off-policy | Yes | No |
| DDPG | Actor-Critic | Off-policy | Yes | No |
9. Hands-On Project: Grid World with Gymnasium¶
Build a complete RL project from scratch: define a Grid World environment, solve it with Q-learning, and visualize the results.
import numpy as np
import gymnasium as gym
from gymnasium import spaces
import matplotlib.pyplot as plt
class GridWorldEnv(gym.Env):
"""A simple Grid World with obstacles and a goal."""
metadata = {"render_modes": ["human"]}
def __init__(self, size=5):
super().__init__()
self.size = size
self.action_space = spaces.Discrete(4)
self.observation_space = spaces.Discrete(size * size)
self.actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]
self.obstacles = [(1, 1), (1, 3), (3, 1), (3, 3)]
self.goal = (size - 1, size - 1)
self.start = (0, 0)
def reset(self, seed=None):
super().reset(seed=seed)
self.agent_pos = self.start
return self._pos_to_state(self.agent_pos), {}
def step(self, action):
dr, dc = self.actions[action]
r, c = self.agent_pos
nr, nc = r + dr, c + dc
if 0 <= nr < self.size and 0 <= nc < self.size and (nr, nc) not in self.obstacles:
self.agent_pos = (nr, nc)
reward = -1.0
terminated = self.agent_pos == self.goal
if terminated:
reward = 10.0
return self._pos_to_state(self.agent_pos), reward, terminated, False, {}
def _pos_to_state(self, pos):
return pos[0] * self.size + pos[1]
def q_learning(env, episodes=500, alpha=0.1, gamma=0.99, epsilon=0.1):
"""Tabular Q-Learning for the Grid World."""
n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions))
rewards_per_episode = []
for ep in range(episodes):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) * (1 - terminated) - Q[state, action]
)
state = next_state
total_reward += reward
rewards_per_episode.append(total_reward)
return Q, rewards_per_episode
def visualize_policy(Q, env):
"""Visualize the learned policy and value function."""
arrows = ['^', 'v', '<', '>']
V = np.max(Q, axis=1).reshape(env.size, env.size)
policy = np.argmax(Q, axis=1).reshape(env.size, env.size)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
V_display = V.copy()
for obs in env.obstacles:
V_display[obs] = np.nan
im = ax1.imshow(V_display, cmap='YlOrRd', interpolation='nearest')
ax1.set_title('Value Function V(s)')
plt.colorbar(im, ax=ax1)
ax2.imshow(np.ones((env.size, env.size)), cmap='Greys', alpha=0.1)
for r in range(env.size):
for c in range(env.size):
if (r, c) in env.obstacles:
ax2.add_patch(plt.Rectangle((c-0.4, r-0.4), 0.8, 0.8, color='gray'))
elif (r, c) == env.goal:
ax2.text(c, r, 'G', ha='center', va='center', fontsize=16, fontweight='bold')
else:
ax2.text(c, r, arrows[policy[r, c]], ha='center', va='center', fontsize=18)
ax2.set_title('Optimal Policy')
plt.tight_layout(); plt.savefig('grid_world_solution.png', dpi=150); plt.show()
# Run the complete project
env = GridWorldEnv(size=5)
Q, rewards = q_learning(env, episodes=500)
print(f"Average reward (last 100 episodes): {np.mean(rewards[-100:]):.2f}")
visualize_policy(Q, env)
10. RL for Robotics: Why Fundamentals Matter¶
Understanding RL fundamentals is critical for robotics because:
- Reward Design: The Bellman equation tells us how reward shaping affects optimal policies (potential-based shaping preserves optimality).
- Sample Efficiency: Understanding bootstrapping (TD) vs. full returns (MC) helps choose the right algorithm for expensive real-world interactions.
- Exploration in Safety-Critical Systems: Multi-armed bandit theory informs safe exploration strategies.
- Sim-to-Real Transfer: DP and model-based methods connect to system identification and model predictive control.
Robotics-Specific Challenges¶
| Challenge | RL Concept | Solution |
|---|---|---|
| Continuous states/actions | Function approximation | Deep RL (DQN, PPO, SAC) |
| Sparse rewards | Credit assignment | Reward shaping, HER |
| Safety constraints | Constrained MDP | Safe RL, Lagrangian methods |
| Sample inefficiency | Model-based RL | World models, Dyna |
| Sim-to-real gap | Domain randomization | System identification |
Exercises¶
Exercise 1: Bandit Implementation¶
Implement the 10-armed bandit testbed and compare epsilon-greedy (eps=0, 0.01, 0.1) and UCB (c=2). Plot average rewards over 1000 steps.
Exercise 2: Policy Evaluation¶
Given a 4x4 grid world with random policy, implement iterative policy evaluation by hand. Verify the result matches the analytical solution.
Exercise 3: Value Iteration¶
Extend the grid world to 10x10 with random obstacles. Implement value iteration and find the optimal policy.
Exercise 4: MC vs. TD¶
Implement both MC prediction and TD(0) prediction for a random walk environment. Compare their learning curves.
Exercise 5: Q-Learning with Gymnasium¶
Use Gymnasium's FrozenLake-v1 environment to train a Q-learning agent. Experiment with different epsilon, alpha, and gamma values.
Exercise 6: Bellman Equation Proof¶
Prove that the optimal value function \(V^*\) satisfies the Bellman optimality equation:
Hint: Use the policy improvement theorem and the fact that \(V^* = \max_\pi V^\pi\).
References¶
- Sutton & Barto (2018), Reinforcement Learning: An Introduction -- The definitive RL textbook, freely available online
- David Silver's UCL Course on RL -- Excellent lecture series covering MDPs, DP, MC, TD, policy gradient
- OpenAI Spinning Up in Deep RL -- Practical introduction to deep RL algorithms with PyTorch implementations
- Gymnasium Documentation -- Standard RL environment API (successor to OpenAI Gym)
- DeepMind x UCL: Introduction to Reinforcement Learning -- Video lectures and slides
- Hugging Face Deep RL Course -- Hands-on course with practical exercises
- Szepesvari, Algorithms for Reinforcement Learning -- Rigorous theoretical treatment
- Bertsekas, Reinforcement Learning and Optimal Control -- Connections to optimal control theory