Advanced Policy Optimization
Modern policy gradient methods: TRPO, PPO, SAC, TD3, DDPG. Theory, implementation details, and practical tips for training.
Learning Objectives
1. From REINFORCE to Actor-Critic
2. Deterministic Policy Gradient (DDPG)
2.1 Off-Policy Actor-Critic
2.2 PyTorch Implementation
3. Twin Delayed DDPG (TD3)
3.1 Clipped Double-Q
3.2 Delayed Policy Updates
3.3 Target Policy Smoothing
4. Trust Region Policy Optimization (TRPO)
4.1 The Trust Region Idea
4.2 KL Divergence Constraint
4.3 Conjugate Gradient
5. Proximal Policy Optimization (PPO)
5.1 Clipped Surrogate Objective
5.2 PPO-Clip Algorithm
5.3 PyTorch Implementation
6. Soft Actor-Critic (SAC)
6.1 Maximum Entropy RL
6.2 Automatic Temperature Tuning
6.3 PyTorch Implementation
7. Algorithm Comparison & Selection Guide
8. Practical Training Tips
Exercises
References