User:Philip.Zman/sandbox

Proximal Policy Optimization is a family of model-free reinforcement learning algorithms for learning a mapping from states to actions, also referred to as a policy. Similarly to Trust Region Policy Optimization they are based on the policy gradient algorithm and try to extend it by allowing multiple gradient update steps per data sample. It was proposed in 2017 by OpenAI scientists.

Algorithm

 * $$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \, [r_{t} + \gamma \, Q(s_{t+1}, a_{t+1})-Q(s_t,a_t)]$$

Test