“Policy Improvement by Planning With Gumbel”, Ivo Danihelka, Arthur Guez, Julian Schrittwieser, David Silver2022-03-04 (, , ; similar)⁠:

AlphaZero is a powerful reinforcement learning algorithm based on approximate policy iteration and tree search. However, AlphaZero can fail to improve its policy network, if not visiting all actions at the root of a search tree.

To address this issue, we propose a policy improvement algorithm based on sampling actions without replacement [see Huijben et al 2021 for more on Gumbel tricks]. Furthermore, we use the idea of policy improvement to replace the more heuristic mechanisms by which AlphaZero selects and uses actions, both at root nodes and at non-root nodes.

Our new algorithms, Gumbel AlphaZero & Gumbel MuZero, respectively without and with model-learning, match the state-of-the-art on Go, chess, and Atari, and improve prior performance when planning with few simulations.

[Keywords: AlphaZero, MuZero, reinforcement learning]