时序差分学习
TD 学习方法:TD(0)、SARSA、Q-learning、期望 SARSA、n-step TD、带资格迹的 TD(λ)及其收敛性。
Learning Objectives
1. From DP to Model-Free Learning
2. TD(0) Prediction
2.1 The TD Update Rule
2.2 TD vs. Monte Carlo
3. SARSA (On-Policy Control)
3.1 Algorithm
3.2 Python Implementation: Cliff Walking
4. Q-Learning (Off-Policy Control)
4.1 Algorithm
4.2 Python Implementation
4.3 SARSA vs. Q-Learning Comparison
5. Expected SARSA
6. Multi-Step TD Methods
6.1 n-Step Returns
6.2 n-Step SARSA
7. TD(λ) and Eligibility Traces
7.1 Forward View
7.2 Backward View with Eligibility Traces
8. Convergence Properties
Exercises
References