“Relative Entropy Regularized Policy Iteration”, Abbas Abdolmaleki, Jost Tobias Springenberg, Jonas Degrave, Steven Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, Martin Riedmiller2018-12-05 (; similar)⁠:

We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function. The result is a simple procedure consisting of 3 steps: (8) policy evaluation by estimating a parametric action-value function; (2) policy improvement via the estimation of a local non-parametric policy; and (3) generalization by fitting a parametric policy. Each step can be implemented in different ways, giving rise to several algorithm variants. Our algorithm draws on connections to existing literature on black-box optimization and ‘RL as an inference’ and it can be seen either as an extension of the Maximum a Posteriori Policy Optimization algorithm (MPO) [Abdolmaleki et al 2018a], or as an extension of Trust Region Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [Abdolmaleki et al 2017b; Hansen et al 199727ya] to a policy iteration scheme.

Our comparison on 31 continuous control tasks from parkour suite [Heess et al 2017], DeepMind control suite [Tassa et al 2018] and OpenAI Gym [Brockman et al 2016] with diverse properties, limited amount of compute and a single set of hyperparameters, demonstrate the effectiveness of our method and the state of art results.

Videos, summarizing results, can be found at our homepage.