“Policy Improvement by Planning With Gumbel”, 2022-03-04 (; similar):
AlphaZero is a powerful reinforcement learning algorithm based on approximate policy iteration and tree search. However, AlphaZero can fail to improve its policy network, if not visiting all actions at the root of a search tree.
To address this issue, we propose a policy improvement algorithm based on sampling actions without replacement [see et al 2021 for more on Gumbel tricks]. Furthermore, we use the idea of policy improvement to replace the more heuristic mechanisms by which AlphaZero selects and uses actions, both at root nodes and at non-root nodes.
Our new algorithms, Gumbel AlphaZero & Gumbel MuZero, respectively without and with model-learning, match the state-of-the-art on Go, chess, and Atari, and improve prior performance when planning with few simulations.
[Keywords: AlphaZero, MuZero, reinforcement learning]