“Muesli: Combining Improvements in Policy Optimization”, 2021-04-13 (; backlinks; similar):
We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero’s state-of-the-art performance on Atari.
Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines.
The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9×9 Go.