Reinforcement Learning for Manipulation¶
Policy Gradient Methods¶
REINFORCE¶
\[\nabla J = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) G_t]\]
PPO (Proximal Policy Optimization)¶
Clipped objective for stable learning:
\[L^{CLIP} = \mathbb{E}[\min(r_t A_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t)]\]
Model-Based RL¶
Learn dynamics model for planning:
- Collect data
- Fit dynamics model
- Plan with learned model
- Execute and refine
Sim-to-Real Transfer¶
- Domain randomization
- System identification
- Progressive fine-tuning