Temporal Difference Learning
TD learning methods: TD(0), SARSA, Q-learning, expected SARSA, n-step TD, TD(λ) with eligibility traces, and their convergence properties.
Learning Objectives
1. From DP to Model-Free Learning
2. TD(0) Prediction
2.1 The TD Update Rule
2.2 TD vs. Monte Carlo
3. SARSA (On-Policy Control)
3.1 Algorithm
3.2 Python Implementation: Cliff Walking
4. Q-Learning (Off-Policy Control)
4.1 Algorithm
4.2 Python Implementation
4.3 SARSA vs. Q-Learning Comparison
5. Expected SARSA
6. Multi-Step TD Methods
6.1 n-Step Returns
6.2 n-Step SARSA
7. TD(λ) and Eligibility Traces
7.1 Forward View
7.2 Backward View with Eligibility Traces
8. Convergence Properties
Exercises
References