跳转至

时序差分学习

TD 学习方法:TD(0)、SARSA、Q-learning、期望 SARSA、n-step TD、带资格迹的 TD(λ)及其收敛性。

Learning Objectives

1. From DP to Model-Free Learning

2. TD(0) Prediction

2.1 The TD Update Rule

2.2 TD vs. Monte Carlo

3. SARSA (On-Policy Control)

3.1 Algorithm

3.2 Python Implementation: Cliff Walking

4. Q-Learning (Off-Policy Control)

4.1 Algorithm

4.2 Python Implementation

4.3 SARSA vs. Q-Learning Comparison

5. Expected SARSA

6. Multi-Step TD Methods

6.1 n-Step Returns

6.2 n-Step SARSA

7. TD(λ) and Eligibility Traces

7.1 Forward View

7.2 Backward View with Eligibility Traces

8. Convergence Properties

Exercises

References