跳转至

动态规划

已知 MDP 下的精确求解方法:策略评估、策略迭代、价值迭代及其收敛性质。所有 RL 算法的理论基础。

Learning Objectives

1. From MDP to Dynamic Programming

1.1 When Can We Use DP?

1.2 The Curse of Dimensionality

2. Policy Evaluation (Prediction)

2.1 Iterative Policy Evaluation

2.2 Convergence

3. Policy Iteration

3.1 Policy Improvement Theorem

3.2 Full Algorithm

4. Value Iteration

4.1 Bellman Optimality Backup

4.2 Full Algorithm

5. Asynchronous DP

6. Generalized Policy Iteration (GPI)

7. Python Implementation: Grid World

Exercises

References