“AlphaGo Zero: Mastering the Game of Go without Human Knowledge”, David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis2017-10-19 (, ; similar)⁠:

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play.

Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules [expert iteration]. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration.

Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

Figure 6: Performance of AlphaGo Zero. (a) Learning curve for AlphaGo Zero using a larger 40-block residual network over 40 days. The plot shows the performance of each player αθi from each iteration i of our reinforcement learning algorithm. Elo ratings were computed from evaluation games between different players, using 0.4s per search (see Methods). (b) Final performance of AlphaGo Zero. AlphaGo Zero was trained for 40 days using a 40-block residual neural network. The plot shows the results of a tournament between: AlphaGo Zero, AlphaGo Master (defeated top human professionals 60–0 in online games), AlphaGo Lee (defeated Lee Sedol), AlphaGo Fan (defeated Fan Hui), as well as previous Go programs Crazy Stone, Pachi and GNU Go. Each program was given 5s of thinking time per move. AlphaGo Zero and AlphaGo Master played on a single machine on the Google Cloud; AlphaGo Fan and AlphaGo Lee were distributed over many machines. The raw neural network from AlphaGo Zero is also included, which directly selects the move a with maximum probability pa, without using tree search. Programs were evaluated on an Elo scale: a 200-point gap corresponds to a 75% probability of winning.
Figure 6: Performance of AlphaGo Zero.
(a) Learning curve for AlphaGo Zero using a larger 40-block residual network over 40 days. The plot shows the performance of each player αθi from each iteration i of our reinforcement learning algorithm. Elo ratings were computed from evaluation games between different players, using 0.4s per search (see Method).
(b) Final performance of AlphaGo Zero. AlphaGo Zero was trained for 40 days using a 40-block residual neural network. The plot shows the results of a tournament between: AlphaGo Zero, AlphaGo Master (defeated top human professionals 60–0 in online games), AlphaGo Lee (defeated Lee Sedol), AlphaGo Fan (defeated Fan Hui), as well as previous Go programs Crazy Stone, Pachi and GNU Go. Each program was given 5s of thinking time per move. AlphaGo Zero and AlphaGo Master played on a single machine on the Google Cloud; AlphaGo Fan and AlphaGo Lee were distributed over many machines. The raw neural network from AlphaGo Zero is also included, which directly selects the move a with maximum probability pa, without using tree search. Programs were evaluated on an Elo scale: a 200-point gap corresponds to a 75% probability of winning.

Figure 6b shows the performance of each program on an Elo scale. The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan.

…Finally, it uses a simpler tree search [PUCT on 4 TPUs for a few ply] that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte-Carlo rollouts [so not really MCTS at all]…we chose to use the simplest possible search algorithm. [cf. Jones2021]

Figure 3b: Prediction accuracy on human professional moves. The plot shows the accuracy of the neural network at each iteration of self-play, in predicting human professional moves. The accuracy measures the percentage of positions in which the neural network assigns the highest probability to the human move.

…To assess the merits of self-play reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves [behavior cloning] in the KGS Go Server dataset; this achieved state-of-the-art prediction accuracy compared to previous work12, 30, 32, 33 (see Extended Data Table 1 & 2 for current and previous results, respectively). Supervised learning achieved a better initial performance, and was better at predicting human professional moves (Figure 3).

Notably, although supervised learning achieved higher move prediction accuracy, the self-learned player performed much better overall, defeating the human-trained player within the first 24h of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play.