时序差分学习¶

TD 学习方法：TD(0)、SARSA、Q-learning、期望 SARSA、n-step TD、带资格迹的 TD(λ)及其收敛性。

Learning Objectives¶

1. From DP to Model-Free Learning¶

2. TD(0) Prediction¶

2.1 The TD Update Rule¶

2.2 TD vs. Monte Carlo¶

3. SARSA (On-Policy Control)¶

3.1 Algorithm¶

3.2 Python Implementation: Cliff Walking¶

4. Q-Learning (Off-Policy Control)¶

4.1 Algorithm¶

4.2 Python Implementation¶

4.3 SARSA vs. Q-Learning Comparison¶

5. Expected SARSA¶

6. Multi-Step TD Methods¶

6.1 n-Step Returns¶

6.2 n-Step SARSA¶

7. TD(λ) and Eligibility Traces¶

7.1 Forward View¶

7.2 Backward View with Eligibility Traces¶

8. Convergence Properties¶

Exercises¶

References¶

时序差分学习¶

Learning Objectives¶

1. From DP to Model-Free Learning¶

2. TD(0) Prediction¶

2.1 The TD Update Rule¶

2.2 TD vs. Monte Carlo¶

3. SARSA (On-Policy Control)¶

3.1 Algorithm¶

3.2 Python Implementation: Cliff Walking¶

4. Q-Learning (Off-Policy Control)¶

4.1 Algorithm¶

4.2 Python Implementation¶

4.3 SARSA vs. Q-Learning Comparison¶

5. Expected SARSA¶

6. Multi-Step TD Methods¶

6.1 n-Step Returns¶

6.2 n-Step SARSA¶

7. TD(λ) and Eligibility Traces¶

7.1 Forward View¶

7.2 Backward View with Eligibility Traces¶

8. Convergence Properties¶

Exercises¶

References¶

Robotics Course Docs

学习路径

实践模块

社区