Poster

Policy improvement by planning with Gumbel

Ivo Danihelka ⋅ Arthur Guez ⋅ Julian Schrittwieser ⋅ David Silver

Keywords: MuZero reinforcement learning

2022 Poster

[ Visit Poster at Spot F3 in Virtual World ] [ Slides] [ OpenReview]

Abstract

AlphaZero is a powerful reinforcement learning algorithm based on approximate policy iteration and tree search. However, AlphaZero can fail to improve its policy network, if not visiting all actions at the root of a search tree. To address this issue, we propose a policy improvement algorithm based on sampling actions without replacement. Furthermore, we use the idea of policy improvement to replace the more heuristic mechanisms by which AlphaZero selects and uses actions, both at root nodes and at non-root nodes. Our new algorithms, Gumbel AlphaZero and Gumbel MuZero, respectively without and with model-learning, match the state of the art on Go, chess, and Atari, and significantly improve prior performance when planning with few simulations.

Video

Chat is not available.