Skip to content

Reinforcement Learning Fundamentals

Reinforcement learning (RL) is a paradigm of machine learning where an agent learns to make sequential decisions by interacting with an environment to maximize cumulative reward. Unlike supervised learning, RL does not require labeled input-output pairs -- instead, the agent discovers optimal behavior through trial and error. RL is the mathematical framework behind game-playing AI (AlphaGo, OpenAI Five), robotic control, autonomous driving, and large language model alignment (RLHF).

This chapter covers the theoretical foundations that underpin all RL algorithms, following the structure of the seminal textbook Reinforcement Learning: An Introduction by Sutton & Barto (2018).

Learning Objectives

After completing this chapter, you will be able to:

  • Define the agent-environment interface and explain the key components of RL
  • Formalize a problem as a Markov Decision Process (MDP)
  • Derive and interpret the Bellman equations for value functions
  • Implement dynamic programming methods (policy iteration, value iteration)
  • Understand Monte Carlo and Temporal Difference learning
  • Compare different RL paradigms (model-free vs. model-based, on-policy vs. off-policy)
  • Build and solve RL environments using Python and Gymnasium
  • Connect RL fundamentals to robotic manipulation and control tasks

1. What Is Reinforcement Learning?

1.1 The Agent-Environment Interface

The core idea of RL is simple: an agent takes actions in an environment, observes states and rewards, and learns a policy that maximizes long-term reward.

  +------------+    action a_t    +--------------+
  |            | ----------------->|              |
  |   Agent    |                  | Environment  |
  |            |<---------------- |              |
  +------------+  state s_{t+1}   +--------------+
       ^          reward r_t           |
       |                               |
       +-------------------------------+

At each time step \(t\):

  1. The agent observes the current state \(s_t \in \mathcal{S}\)
  2. The agent selects an action \(a_t \in \mathcal{A}(s_t)\) according to policy \(\pi\)
  3. The environment transitions to a new state \(s_{t+1} \sim P(\cdot | s_t, a_t)\)
  4. The agent receives a reward \(r_{t+1} = R(s_t, a_t, s_{t+1})\)
  5. The agent updates its policy to maximize expected cumulative reward

1.2 RL vs. Supervised vs. Unsupervised Learning

Aspect Supervised Learning Unsupervised Learning Reinforcement Learning
Data Labeled pairs \((x, y)\) Unlabeled data \(x\) Sequential \((s, a, r, s')\)
Feedback Correct labels Structure/patterns Scalar reward signal
Goal Minimize prediction error Find hidden structure Maximize cumulative reward
Temporal Independent samples Independent samples Sequential decisions
Exploration Not needed Not needed Essential
Examples Classification, regression Clustering, PCA Game playing, robotics

1.3 History and Key Milestones

Year Milestone Key Contribution
1950s Dynamic Programming (Bellman) Mathematical foundation
1989 Q-Learning (Watkins) Model-free off-policy control
1992 TD-Gammon (Tesauro) Self-play backgammon
2013 DQN (DeepMind) Deep RL for Atari
2016 AlphaGo (DeepMind) Superhuman Go
2017 PPO (OpenAI) Stable policy optimization
2019 OpenAI Five Multi-agent Dota 2
2022 ChatGPT (RLHF) RL for language model alignment

2. Multi-Armed Bandits

Before diving into full MDPs, we study the k-armed bandit problem -- the simplest RL setting with a single state and \(k\) actions.

2.1 The k-Armed Bandit Problem

You face \(k\) slot machines (arms), each with an unknown reward distribution. At each step, you pull one arm and receive a reward. Goal: maximize total reward over \(T\) steps.

Formally, each arm \(a\) has a true value \(q_*(a) = \mathbb{E}[R_t | A_t = a]\). The optimal action is \(a^* = \arg\max_a q_*(a)\).

2.2 Action-Value Methods

Estimate \(Q_t(a) \approx q_*(a)\) using sample averages:

\[Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i = a}}\]

Select action: \(A_t = \arg\max_a Q_t(a)\) (greedy).

2.3 Exploration vs. Exploitation

The fundamental dilemma: exploit the current best arm, or explore to find a better one?

epsilon-Greedy: With probability \(\epsilon\), choose randomly; otherwise, choose greedily.

\[A_t = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q_t(a) & \text{with probability } 1 - \epsilon \end{cases}\]

Upper Confidence Bound (UCB):

\[A_t = \arg\max_a \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]\]

where \(c\) controls the exploration-exploitation trade-off.

Thompson Sampling: Maintain a posterior distribution over each arm's value; sample from the posterior and act greedily.

import numpy as np
import matplotlib.pyplot as plt

def run_bandit_experiment(k=10, steps=1000, epsilon=0.1, runs=200):
    """Compare epsilon-greedy strategies on a k-armed bandit testbed."""
    avg_rewards = np.zeros((runs, steps))
    optimal_counts = np.zeros((runs, steps))

    for run in range(runs):
        # True values: q*(a) ~ N(0, 1)
        q_star = np.random.randn(k)
        best_action = np.argmax(q_star)
        Q = np.zeros(k)       # Estimated values
        N = np.zeros(k)       # Action counts

        for t in range(steps):
            # epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(k)
            else:
                action = np.argmax(Q)

            # Reward: R ~ N(q*(a), 1)
            reward = q_star[action] + np.random.randn()
            N[action] += 1
            Q[action] += (reward - Q[action]) / N[action]

            avg_rewards[run, t] = reward
            if action == best_action:
                optimal_counts[run, t] = 1

    return avg_rewards.mean(axis=0), optimal_counts.mean(axis=0) * 100

# Compare epsilon = 0, 0.01, 0.1
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
for eps, label in [(0, 'eps=0 (greedy)'), (0.01, 'eps=0.01'), (0.1, 'eps=0.1')]:
    rewards, optimal = run_bandit_experiment(epsilon=eps)
    ax1.plot(rewards, label=label)
    ax2.plot(optimal, label=label)

ax1.set_xlabel('Steps'); ax1.set_ylabel('Average Reward'); ax1.legend()
ax2.set_xlabel('Steps'); ax2.set_ylabel('% Optimal Action'); ax2.legend()
ax1.set_title('10-Armed Bandit: Average Reward')
ax2.set_title('10-Armed Bandit: % Optimal Action')
plt.tight_layout(); plt.savefig('bandit_comparison.png', dpi=150); plt.show()

3. Markov Decision Processes (MDP)

The MDP is the formal mathematical framework for RL. Almost all RL problems can be formulated as MDPs.

3.1 The Markov Property

A state \(s_t\) has the Markov property if:

\[P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t)\]

The future depends only on the present, not the history. This is the "memoryless" property.

3.2 Formal Definition

An MDP is a tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\):

Symbol Meaning Robot Example
\(\mathcal{S}\) State space Joint angles + velocities
\(\mathcal{A}\) Action space Joint torques
\(P(s' \mid s, a)\) Transition probability Physics dynamics
\(R(s, a, s')\) Reward function Task completion signal
\(\gamma \in [0,1)\) Discount factor How much we value future rewards

3.3 Returns and Discount Factor

The return from time \(t\) is the cumulative discounted reward:

\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\]
  • \(\gamma = 0\): myopic (only care about immediate reward)
  • \(\gamma \to 1\): far-sighted (care about future rewards)
  • \(\gamma < 1\) guarantees \(G_t\) is finite for bounded rewards

The return satisfies the recursive relationship:

\[G_t = R_{t+1} + \gamma G_{t+1}\]

3.4 Episodic vs. Continuing Tasks

Type Description Example
Episodic Natural terminal state, then reset Game ends, robot reaches goal
Continuing No terminal state, infinite horizon Server load balancing

For episodic tasks, the return is a finite sum: \(G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T\).


4. Value Functions

Value functions estimate "how good" it is to be in a state (or take an action in a state) under a given policy \(\pi\).

4.1 State-Value Function

\[V^\pi(s) = \mathbb{E}_\pi [G_t \mid s_t = s] = \mathbb{E}_\pi \left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid s_t = s\right]\]

This is the expected return starting from state \(s\) and following policy \(\pi\) thereafter.

4.2 Action-Value Function

\[Q^\pi(s, a) = \mathbb{E}_\pi [G_t \mid s_t = s, a_t = a]\]

This is the expected return starting from state \(s\), taking action \(a\), then following policy \(\pi\).

The relationship between \(V\) and \(Q\):

\[V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s, a)\]

4.3 The Bellman Equations

The Bellman expectation equation for \(V^\pi\):

\[V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V^\pi(s') \right]\]

For \(Q^\pi\):

\[Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma \sum_{a'} \pi(a' \mid s') Q^\pi(s', a') \right]\]

The Bellman optimality equation:

\[V^*(s) = \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V^*(s') \right]\]
\[Q^*(s, a) = \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma \max_{a'} Q^*(s', a') \right]\]
    Bellman Expectation Backup          Bellman Optimality Backup

         s                                 s
        /|\                               /|\
       / | \                             / | \
      a1 a2 a3                         a1 a2 a3
      |  |  |                           |  |  |
      s' s' s'                         s' s' s'
      v  v  v                           v  v  v
     V^pi(s')                          max Q*(s',a')
      weighted by pi(a|s)               choose best a

4.4 Bellman Equation Derivation

Starting from the definition:

\[V^\pi(s) = \mathbb{E}_\pi [G_t | s_t = s]\]

Using \(G_t = R_{t+1} + \gamma G_{t+1}\):

\[V^\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma G_{t+1} | s_t = s]\]

Expanding the expectation over actions and next states:

\[= \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \mathbb{E}_\pi[G_{t+1} | s_{t+1} = s'] \right]\]
\[= \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]\]

5. Dynamic Programming

Dynamic Programming (DP) methods solve MDPs when the model \((P, R)\) is fully known. While rarely applicable to real robotics (where dynamics are unknown), DP provides the conceptual foundation for all RL algorithms.

5.1 Policy Evaluation

Given policy \(\pi\), compute \(V^\pi\) by iteratively applying the Bellman expectation equation:

Algorithm: Iterative Policy Evaluation

Initialize V(s) = 0 for all s in S
Repeat:
    delta <- 0
    For each s in S:
        v <- V(s)
        V(s) <- sum_a pi(a|s) sum_{s'} P(s'|s,a) [R(s,a,s') + gamma*V(s')]
        delta <- max(delta, |v - V(s)|)
Until delta < theta (convergence threshold)

5.2 Policy Iteration

Alternates between evaluation (compute \(V^\pi\)) and improvement (make \(\pi\) greedy w.r.t. \(V^\pi\)):

Policy Improvement Theorem: If \(Q^\pi(s, \pi'(s)) \geq V^\pi(s)\) for all \(s\), then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s\).

Initialize pi randomly
Repeat:
    1. Policy Evaluation: compute V^pi
    2. Policy Improvement: pi(s) <- argmax_a sum_{s'} P(s'|s,a)[R + gamma*V^pi(s')]
Until pi does not change

5.3 Value Iteration

Combines evaluation and improvement into a single sweep:

Algorithm: Value Iteration

Initialize V(s) = 0 for all s in S
Repeat:
    delta <- 0
    For each s in S:
        v <- V(s)
        V(s) <- max_a sum_{s'} P(s'|s,a) [R(s,a,s') + gamma*V(s')]
        delta <- max(delta, |v - V(s)|)
Until delta < theta
Output: pi*(s) = argmax_a sum_{s'} P(s'|s,a) [R(s,a,s') + gamma*V*(s')]
import numpy as np

def value_iteration(grid_size=4, gamma=0.9, theta=1e-6):
    """Solve a grid world using Value Iteration."""
    n_states = grid_size * grid_size
    n_actions = 4  # up, down, left, right
    actions = [(-1,0), (1,0), (0,-1), (0,1)]

    V = np.zeros(n_states)
    policy = np.zeros(n_states, dtype=int)

    def step(state, action):
        row, col = state // grid_size, state % grid_size
        dr, dc = actions[action]
        nr, nc = row + dr, col + dc
        if 0 <= nr < grid_size and 0 <= nc < grid_size:
            next_state = nr * grid_size + nc
        else:
            next_state = state
        reward = -1
        if next_state == n_states - 1:
            reward = 0
        return next_state, reward

    iteration = 0
    while True:
        delta = 0
        for s in range(n_states - 1):
            v = V[s]
            q_values = []
            for a in range(n_actions):
                ns, r = step(s, a)
                q_values.append(r + gamma * V[ns])
            V[s] = max(q_values)
            delta = max(delta, abs(v - V[s]))
        iteration += 1
        if delta < theta:
            break

    for s in range(n_states - 1):
        q_values = []
        for a in range(n_actions):
            ns, r = step(s, a)
            q_values.append(r + gamma * V[ns])
        policy[s] = np.argmax(q_values)

    return V, policy, iteration

V, policy, iters = value_iteration()
print(f"Converged in {iters} iterations")
print("Optimal Value Function:")
print(V.reshape(4, 4).round(2))

5.4 Generalized Policy Iteration (GPI)

The idea of alternating between evaluation and improvement is the core of almost all RL algorithms:

  +----------+         +----------+
  |  Policy   | <-----> |  Policy   |
  |Evaluation |         |Improvement|
  +----------+         +----------+
       |                      |
       v                      v
     V^pi <---------------- pi(a|s)
       value function  <-->  policy
  • DP: Full evaluation + full improvement (requires model)
  • MC: MC evaluation + improvement (model-free, episodic)
  • TD: TD evaluation + improvement (model-free, online)

6. Monte Carlo Methods

Monte Carlo (MC) methods learn from complete episodes of experience -- no model required.

6.1 MC Prediction

Estimate \(V^\pi(s)\) by averaging returns from all visits to state \(s\):

\[V(s) \leftarrow V(s) + \alpha [G_t - V(s)]\]

First-visit MC: Only use the first time \(s\) is visited in each episode. Every-visit MC: Use every visit to \(s\).

6.2 MC Control

Use MC to estimate \(Q(s,a)\), then improve policy greedily:

Initialize Q(s,a) arbitrarily, pi(s) <- argmax_a Q(s,a)
Repeat forever:
    Generate episode using pi: S_0, A_0, R_1, ..., S_{T-1}, A_{T-1}, R_T
    G <- 0
    For t = T-1, T-2, ..., 0:
        G <- gamma*G + R_{t+1}
        If (S_t, A_t) not in earlier pairs:
            Q(S_t, A_t) <- average of all returns following (S_t, A_t)
            pi(S_t) <- argmax_a Q(S_t, a)

6.3 Off-Policy MC (Importance Sampling)

Learn about policy \(\pi\) (target) while following policy \(b\) (behavior):

\[\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}\]
\[V(s) \leftarrow V(s) + \alpha \left[\rho_{t:T-1} G_t - V(s)\right]\]

7. Temporal Difference Learning (Preview)

TD methods combine ideas from MC and DP: they learn from episodes like MC, but bootstrap like DP.

7.1 TD(0) Prediction

The TD update rule:

\[V(s_t) \leftarrow V(s_t) + \alpha \left[ r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \right]\]

The term \(\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)\) is the TD error.

7.2 TD vs. MC vs. DP

Property DP MC TD
Model required? Yes (P, R) No No
Bootstrapping? Yes No Yes
Updates per step All states End of episode Each step
Convergence Exact Asymptotic Asymptotic
Handling non-stationarity No Yes Yes

7.3 SARSA and Q-Learning (Preview)

SARSA (on-policy TD control):

\[Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]\]

Q-Learning (off-policy TD control):

\[Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \right]\]

Detailed coverage

See the Temporal Difference Learning page for full implementations and comparisons.


8. The RL Algorithm Landscape

                    Reinforcement Learning
                           |
            +--------------+--------------+
            |              |              |
     Model-Free       Model-Based      Imitation
            |              |          Learning
    +-------+-------+      |
    |               |    World Models
 Value-Based   Policy-Based   Dyna, MBPO
    |               |
    +-- Q-Learning  +-- REINFORCE
    +-- SARSA       +-- Actor-Critic
    +-- DQN         +-- PPO, TRPO
    +-- Double DQN  +-- SAC, TD3
Algorithm Type On/Off-Policy Bootstrapping Model
Value Iteration Value -- Yes Yes
MC Control Value On-policy No No
Q-Learning Value Off-policy Yes No
SARSA Value On-policy Yes No
DQN Value Off-policy Yes No
REINFORCE Policy On-policy No No
A2C/A3C Actor-Critic On-policy Yes No
PPO Actor-Critic On-policy Yes No
SAC Actor-Critic Off-policy Yes No
DDPG Actor-Critic Off-policy Yes No

9. Hands-On Project: Grid World with Gymnasium

Build a complete RL project from scratch: define a Grid World environment, solve it with Q-learning, and visualize the results.

import numpy as np
import gymnasium as gym
from gymnasium import spaces
import matplotlib.pyplot as plt

class GridWorldEnv(gym.Env):
    """A simple Grid World with obstacles and a goal."""
    metadata = {"render_modes": ["human"]}

    def __init__(self, size=5):
        super().__init__()
        self.size = size
        self.action_space = spaces.Discrete(4)
        self.observation_space = spaces.Discrete(size * size)
        self.actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]
        self.obstacles = [(1, 1), (1, 3), (3, 1), (3, 3)]
        self.goal = (size - 1, size - 1)
        self.start = (0, 0)

    def reset(self, seed=None):
        super().reset(seed=seed)
        self.agent_pos = self.start
        return self._pos_to_state(self.agent_pos), {}

    def step(self, action):
        dr, dc = self.actions[action]
        r, c = self.agent_pos
        nr, nc = r + dr, c + dc
        if 0 <= nr < self.size and 0 <= nc < self.size and (nr, nc) not in self.obstacles:
            self.agent_pos = (nr, nc)
        reward = -1.0
        terminated = self.agent_pos == self.goal
        if terminated:
            reward = 10.0
        return self._pos_to_state(self.agent_pos), reward, terminated, False, {}

    def _pos_to_state(self, pos):
        return pos[0] * self.size + pos[1]


def q_learning(env, episodes=500, alpha=0.1, gamma=0.99, epsilon=0.1):
    """Tabular Q-Learning for the Grid World."""
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    Q = np.zeros((n_states, n_actions))
    rewards_per_episode = []

    for ep in range(episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False
        while not done:
            if np.random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state])
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            Q[state, action] += alpha * (
                reward + gamma * np.max(Q[next_state]) * (1 - terminated) - Q[state, action]
            )
            state = next_state
            total_reward += reward
        rewards_per_episode.append(total_reward)
    return Q, rewards_per_episode


def visualize_policy(Q, env):
    """Visualize the learned policy and value function."""
    arrows = ['^', 'v', '<', '>']
    V = np.max(Q, axis=1).reshape(env.size, env.size)
    policy = np.argmax(Q, axis=1).reshape(env.size, env.size)
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    V_display = V.copy()
    for obs in env.obstacles:
        V_display[obs] = np.nan
    im = ax1.imshow(V_display, cmap='YlOrRd', interpolation='nearest')
    ax1.set_title('Value Function V(s)')
    plt.colorbar(im, ax=ax1)
    ax2.imshow(np.ones((env.size, env.size)), cmap='Greys', alpha=0.1)
    for r in range(env.size):
        for c in range(env.size):
            if (r, c) in env.obstacles:
                ax2.add_patch(plt.Rectangle((c-0.4, r-0.4), 0.8, 0.8, color='gray'))
            elif (r, c) == env.goal:
                ax2.text(c, r, 'G', ha='center', va='center', fontsize=16, fontweight='bold')
            else:
                ax2.text(c, r, arrows[policy[r, c]], ha='center', va='center', fontsize=18)
    ax2.set_title('Optimal Policy')
    plt.tight_layout(); plt.savefig('grid_world_solution.png', dpi=150); plt.show()


# Run the complete project
env = GridWorldEnv(size=5)
Q, rewards = q_learning(env, episodes=500)
print(f"Average reward (last 100 episodes): {np.mean(rewards[-100:]):.2f}")
visualize_policy(Q, env)

10. RL for Robotics: Why Fundamentals Matter

Understanding RL fundamentals is critical for robotics because:

  1. Reward Design: The Bellman equation tells us how reward shaping affects optimal policies (potential-based shaping preserves optimality).
  2. Sample Efficiency: Understanding bootstrapping (TD) vs. full returns (MC) helps choose the right algorithm for expensive real-world interactions.
  3. Exploration in Safety-Critical Systems: Multi-armed bandit theory informs safe exploration strategies.
  4. Sim-to-Real Transfer: DP and model-based methods connect to system identification and model predictive control.

Robotics-Specific Challenges

Challenge RL Concept Solution
Continuous states/actions Function approximation Deep RL (DQN, PPO, SAC)
Sparse rewards Credit assignment Reward shaping, HER
Safety constraints Constrained MDP Safe RL, Lagrangian methods
Sample inefficiency Model-based RL World models, Dyna
Sim-to-real gap Domain randomization System identification

Exercises

Exercise 1: Bandit Implementation

Implement the 10-armed bandit testbed and compare epsilon-greedy (eps=0, 0.01, 0.1) and UCB (c=2). Plot average rewards over 1000 steps.

Exercise 2: Policy Evaluation

Given a 4x4 grid world with random policy, implement iterative policy evaluation by hand. Verify the result matches the analytical solution.

Exercise 3: Value Iteration

Extend the grid world to 10x10 with random obstacles. Implement value iteration and find the optimal policy.

Exercise 4: MC vs. TD

Implement both MC prediction and TD(0) prediction for a random walk environment. Compare their learning curves.

Exercise 5: Q-Learning with Gymnasium

Use Gymnasium's FrozenLake-v1 environment to train a Q-learning agent. Experiment with different epsilon, alpha, and gamma values.

Exercise 6: Bellman Equation Proof

Prove that the optimal value function \(V^*\) satisfies the Bellman optimality equation:

\[V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^*(s')]\]

Hint: Use the policy improvement theorem and the fact that \(V^* = \max_\pi V^\pi\).


References