×
top 200 commentsshow all 488

[–]gwern 138 points139 points  (16 children)

How/why is Zero's training so stable? This was the question everyone was asking when DM announced it'd be experimenting with pure self-play training - deep RL is notoriously unstable and prone to forgetting, self-play is notoriously unstable and prone to forgetting, the two together should be a disaster without a good (imitation-based) initialization & lots of historical checkpoints to play against. But Zero starts from zero and if I'm reading the supplements right, you don't use any historical checkpoints as opponents to prevent forgetting or loops. But the paper essentially doesn't discuss this at all or even mention it other than one line at the beginning about tree search. So how'd you guys do it?

[–]David_SilverDeepMind[S] 56 points57 points  (4 children)

AlphaGo Zero uses a quite different approach to deep RL than typical (model-free) algorithms such as policy gradient or Q-learning. By using AlphaGo search we massively improve the policy and self-play outcomes - and then we apply simple, gradient based updates to train the next policy + value network. This appears to be much more stable than incremental, gradient-based policy improvements that can potentially forget previous improvements.

[–]gwern 11 points12 points  (3 children)

So you think the additional supervision on all moves' value estimates by the tree search is what preserves knowledge across all the checkpoints and prevents catastrophic forgetting? Is there an analogy here to Hinton's dark knowledge & incremental learning techniques?

[–]ThomasWAnthony 63 points64 points  (1 child)

I’ve been working on almost the same algorithm (we call it Expert Iteration, or ExIt), and we too see very stable performance. Why is a really interesting question.

By looking at the differences between us and AlphaGo, we can certainly rule out some explanations:

  1. The dataset of the last 500,000 games only changes very slowly (25,000 new games are created each iteration, 25,000 old ones are removed - only 5% of data points change). This acts like an experience replay buffer, and ensures only slow changes in policy. But this is not why the algorithm is stable: we tried a version where the dataset is recreated from scratch every iteration, and that seems to be really stable as well.

  2. We do not use the Dirichlet Noise at the root trick, and still learn stably. We’ve thought about a similar idea, namely using a uniform prior at the root. But this was to avoid potential local minima in our policy during training, almost the opposite of making it more stable.

  3. We learn stably (with and) without the reflect/rotating the board trick, either in the dataset creation or the MCTS.

I believe the stability is a direct result of using tree search. My best explanation is that:

An RL agent may train unstably for two reasons: (a) It may forget pertinent information about positions that it no longer visits (change in data distribution) (b) It learns to exploit a weak opponent (or a weakness of its own), rather than playing the optimal move.

  1. AlphaGo Zero uses the tree policy in the first 30 moves to explore positions. In our work we use a NN trained to imitate that tree policy. Because MCTS should explore all plausible moves, an opponent that tries to play outside of the data distribution that the NN is trained on will usually have to play some moves that the MCTS has worked out strong responses to, so as you leave the training distribution, the AI will gain an unassailable lead.

  2. To overfit to a policy weakness, a player needs to learn to visit a state s where the opponent is weak. However, because MCTS will direct resource to exploring towards s, it can discover improvements to the policy at s during search. MCTS finds these improvements will be found before the neural network is trained to try to play to s. In a method with no look-ahead, the neural network learns to reach s to exploit the weakness immediately. Only later does it realise that V^pi(s) is only large because the policy pi is poor at s, rather than because V*(s) is large.

As I’ve mentioned elsewhere in the comments, our paper is “Thinking Fast and Slow with Deep Learning and Tree Search”, we’ve got a pre-print on the arxiv, and will be publishing a final version at NIPS soon.

[–]TemplateRex 5 points6 points  (0 children)

Seems like the continuous feedback from the tree search acts like a kind of experience replay. Does that make sense?

[–]Borgut1337 15 points16 points  (1 child)

I personally suspect it's because of the tree search (MCTS), which is still used to find moves potentially better than those recommended by the network. If you only use two copies of the same network which train against each other / themselves (since they're copies), I think they can get stuck / start oscillating / overfit against themselves. But if you add some search on top of it, it can sometimes find better than those recommended purely by the network, enabling it to ''exploit'' mistakes of the network if the network is indeed overfitting.

This is all just my intuition though, would love to see confirmation on this

[–]2358452 3 points4 points  (0 children)

I believe this is correct. The network will be trained with full hindsight from a large tree search. A degradation in performance by a bad parameter change would very often lead to its weakness being found out in the tree search. If it were pure policy play it seems safe to assume it would be much less stable.

Another important factor is stochastic behavior, I believe non-stochastic agents in self-play should be vulnerable to instabilities.

For example, the optimal strategy in rock-paper-scissors is to pretty much play randomly. Take an agent At restricted to deterministic strategies, and make it play its previous iteration At-1, which played rock. It will quickly find playing paper is optimal, and analogously for t+1,t+2,... Always convinced its ELO is rising (it always wins 100% of the time w.r.t. previous iterations).

[–]aec2718 12 points13 points  (1 child)

The key part is that it is not just a Deep RL agent, it uses a policy/value network to guide an MCTS agent. Even with a garbage NN policy influencing the moves, MCTS agents can generate strong play by planning ahead and simulating game outcomes. The NN policy/value network just biases the MCTS move selection. So there is a limit on instability from the MCTS angle.

Second, in every training iteration, 25,000 games are generated through self play of a fixed agent. That agent is updated for the next iteration only if the updated version can beat the old version 55% of the time or more. So there is roughly a limit on instability of policy strength from this angle. Agents aren't retained if they are worse than their predecessors.

[–]gwern 2 points3 points  (0 children)

Second, in every training iteration, 25,000 games are generated through self play of a fixed agent. That agent is updated for the next iteration only if the updated version can beat the old version 55% of the time or more. So there is roughly a limit on instability of policy strength from this angle. Agents aren't retained if they are worse than their predecessors.

I don't think that can be the answer. You can catch a GAN diverging by eye, but that doesn't mean you can train a NN Picasso with GANs. You have to have some sort of steady improvement for the ratchet to help at all. And, there's no reason it couldn't gradually decay in ways not immediately caught by the test suite, leading to cycles or divergence. If stabilizing self-play was that easy, someone would've done that by now and you wouldn't need historical snapshots or anything.

[–][deleted]  (1 child)

[deleted]

    [–]gwern 19 points20 points  (0 children)

    That's not really an answer, though. It's merely a one-line claim, with nothing like background or comparisons or a theoretical justification or interpretation or ablation experiments showing regular policy-gradient self-play is wildly unstable as expected & tree-search-trained self-play super stable. I mean, stability is far more important than, say, regular convolutional layers vs residual convolutional layers (they're training a NN with 40 residual layers! for a RL agent, that's huge), and that gets a full discussion, ablation experiment, & graphs.

    [–]BullockHouse 2 points3 points  (0 children)

    This is a great question. Something really confusing is going on here.

    [–]Cassandra120 96 points97 points  (18 children)

    Do you think that AlphaGo would be able to solve Igo Hatsuyôron's problem 120, the "most difficult problem ever", i. e. winning a given middle game position, or confirm an existing solution (e.g. http://igohatsuyoron120.de/2015/0039.htm)?

    [–]David_SilverDeepMind[S] 53 points54 points  (6 children)

    We just asked Fan Hui about this position. He says AlphaGo would solve the problem, but the more interesting question would be if AlphaGo found the book answer, or another solution that no one has ever imagined. That's the kind of thing which we have seen with so many moves in AlphaGo’s play!

    [–]GetInThereLewis 49 points50 points  (2 children)

    Perhaps the question should have been, "can you run AG Zero on this position and tell us what the optimal solution is?" I don't think anyone doubts that it would be able to solve it at all. :)

    [–][deleted] 3 points4 points  (1 child)

    Can AlphaGo be "dropped in" to already developed boards? I would imagine so, but that might not be what it was trained on.

    I know there's a LOT of variations in Go, so there's a good chance a similar board could be created during actual play... but what if not? What if this exact game is not something AlphaGo would ever let happen?

    [–]Cassandra120 14 points15 points  (0 children)

    Our three amateurs' team would be very happy to get in touch with DeepMind (maybe via Fan Hui?). Any solution found by AlphaGo would be fine. We are still looking for a white move that gains two points for her, in order to reach an "ideal" result of "Black + 1". Additionally, there are a lot of side variations that could be checked by AlphaGo ... Please note that all the "solutions" that can be found in books by PROFESSIONALS are NOT correct!

    [–]gin_and_toxic 5 points6 points  (0 children)

    The world needs this answer ;)

    It's kinda like having a computer that can solve one of the unsolvable math problems, but not telling the world the answer.

    [–]hikaruzero 7 points8 points  (9 children)

    Man I just want to say this question is solid gold, nice! I'd also like to hear the answer.

    [–]Feryll 5 points6 points  (0 children)

    Also very much looking forward to having this one answered!

    [–]sml0820 90 points91 points  (6 children)

    How much more difficult are you guys finding Starcraft II versus Go, and potentially what are the technical roadblocks you are struggling with most? When can we expect a formal update?

    [–]JulianSchrittwieserDeepMind 55 points56 points  (2 children)

    It's only been a few weeks since we announced the StarCraft II environment, so it's still very early days. The StarCraft action space is definitely a lot more challenging than Go, and the observations are a lot larger as well. Technically, I think one of the largest differences is that Go is a perfect information game, whereas StarCraft has fog of war and therefore imperfect information.

    [–][deleted] 5 points6 points  (1 child)

    What are the similarities and differences when compared to OpenAI's efforts to play Dota?

    I of course hope resources become diverted because of some major breakthrough in applying AI methods to medical research or resource management, but assuming that isn't happening just yet... Is StarCraft the next major non-confidential challenge DeepMind is taking on?

    [–]OriolVinyals 13 points14 points  (1 child)

    We just released the paper, with mostly baselines and vanilla networks (e.g., those found in the original Atari DQN paper) to understand how far along those baseline algorithms can push SC2. Following Blizzard tradition, you should expect an update when it's ready (TM).

    [–]fischgurke 31 points32 points  (3 children)

    As developers on the computer Go mailing list have stated, it is not "hard" for them to implement the algorithms presented in your paper, however it is impossible for them to provide the same amount of training to their programs as you could to AlphaGo.

    In computer chess, we have observed that developers copied algorithm parts (heuristics, etc.) from other programs, including for commercial purposes. Generally, it seems with new software based on DCNNs, the algorithm is not as important as the data resulting from training. The data, however, is much easier to copy than the algorithm.

    Would you say that data is more important than the algorithm at all? Your new paper about AG0 implies otherwise. Nevertheless, do you think the fact that "AI" is "copy-pastable" will be an issue in the future? Do you think that as reinforcement learning and neural networks become more important, we will see attempts to protect trained networks in similar ways as other intellectual property (e.g., patents, copyright)?

    [–]JulianSchrittwieserDeepMind 26 points27 points  (1 child)

    I think the algorithm is still more important - compare how much more efficient the training in the new AlphaGo Zero paper is compared to the previous paper - and I think this is where we'll still see huge advances in data efficiency.

    [–]RayquazaDD 27 points28 points  (8 children)

    Thanks for the the AMA. According to the new paper,

    1. Is AlphaGo Zero still training now? Will we get another new self-play in the future if there is a breakthrough(ex: 70% win rate vs previous version)?

    2. AlphaGo Zero played two hoshi(star points) against AlphaGo master whether Zero is black or white. However, we saw AlphaGo Zero had played komoku in the last period of its self-play. Is there any reason?

    3. In the paper, you mentioned AlphaGo Zero won 89 games to 11 versus AlphaGo Master. Could you release all 100 games?

    [–]David_SilverDeepMind[S] 29 points30 points  (5 children)

    AlphaGo is retired! That means the people and hardware resources have moved onto other projects on the long, winding road to AI :)

    [–]FeepingCreature 17 points18 points  (3 children)

    I'm kind of curious why you're not opensourcing it in that case. Clearly there's interest. Is it using proprietary APIs/techniques that you still want to use in other contexts?

    [–]ParadigmComplex 6 points7 points  (1 child)

    While you probably saw this, I figured there may be value in me linking you just in case:

    Considering that AlphaGo is now retired, when do you plan to open source it? This would have a huge impact on both the Go community and the current research in machine learning.

    When are you planning to release the Go tool that Demis Hassabis announced at Wuzhen?

    Work is progressing on this tool as we speak. Expect some news soon : )

    but also:

    Any plans to open source AlphaGo?

    We've open sourced a lot of our code in the past, but it's always a complex process. And in this case, unfortunately, it's a prohibitively intricate codebase.

    I'm inclined to think the first post was about the tool, not open sourcing, and that it probably won't happen ):

    [–]okimoyo 6 points7 points  (0 children)

    I'm also quite interested in the first point raised here.

    Did you terminate the ELO rating vs time figure at ~40 days because of a publication deadline, or you select this as a cutoff because AlphaGo Zero's performance ceased to significantly improve beyond this point?

    [–]Uberdude85 58 points59 points  (3 children)

    At a talk Demis Hassabis gave in Cambridge in March he said one of the future aims of the AlphaGo project was interpretability of the neural networks. So my question is have you made any progress in interpreting the neural networks of AlphaGo or are they still essentially mysterious black boxes? Is there any emergent structure that you can correlate with the human concepts we think about when we play the game, such as parsing the board into groups and then assigning them properties like strong or weak, alive or dead?

    For example in this illustrative neural network trained to produce wikipedia articles sections of the network related to producing urls could be identified (see under "Visualizing the predictions and the “neuron” firings in the RNN"). So is there anything similar in AlphaGo's networks, such as this area of the network shows greater activity when it is attacking vs defending, or fighting a ko? Perhaps even more interesting would be if there were some emergent features which do not correlate with current human Go concepts, for example we humans think of groups or stones having positions on scales of a variety of properties such as weak/strong, amount of territory/influence, alive/dead, light/heavy, thick/thin, good/bad eyeshape etc but maybe AlphaGo could introduce a whole new dimension to how we think about the game.

    [–]David_SilverDeepMind[S] 31 points32 points  (1 child)

    Interpretability is a really interesting question for all of our systems, not just AlphaGo. We have teams working across DeepMind trying to come up with novel ways to interrogate our systems. Most recently they published work that draws on techniques from cognitive psychology to try to decipher what is happening inside matching networks… and it worked pretty nicely!

    [–]cutelyaware 4 points5 points  (0 children)

    I love this question! If we do find regions that activate for concepts we don't already have, it would be fun to look at examples of those positions and try to guess what they have in common.

    [–]tr1pzz 26 points27 points  (4 children)

    Two questions after reading the amazing AlphaGo Zero paper, wow, just wow!!

    Q1: Could you explain why exactly the input dimensionality for AlphaGo's residual blocks is 19x19x17?

    I don't really get why it would be useful to include 8 stacked binary feature plains per player to include the recent history of the game? (In my mind 2 (or even just 1?) would be enough..) (I'm not 100% familiar with all the rules of Go, so maybe I'm missing something here (I know move repetitions are prohibited etc..) but in any case 8 seems like a lot!)

    Additionally, the presence of a final, full 19x19 binary feature plain C to simply indicate which player's move it is seems like a rather awkward construction since it's duplicating a single useful bit 361 times..

    In summary I'm just surprised: the input dimensionality seems unnecessarily high... (I was expecting something more like 19x19x3 + 1 (a single 19x19 plane with 3 possible values: black, white or empty + 1 binary value indicating which player's turn it is))


    Q2: Since the entire pipeline uses only self-play against the latest/best version of the model, do you guys think there is any risk in overfitting to the specific SGD-driven trajectory the model is taking through parameter space? It seems like the final model-gameplay is kind of dependent on the random initialisation weights and the actual encountered game states (as a result of stochastic action sampling).

    This just reminded me of OpenAI's wrestling RL agents that learn to counter their immediate opponent resulting in a strategy that doesn't generalize as well as when it would be facing multiple, diverse opponents...

    [–]David_SilverDeepMind[S] 19 points20 points  (2 children)

    Actually, the representation would probably work well with other choices than 8 planes! But we use a stacked history of observations for three reasons: 1. it is consistent with common input representations in other domains (e.g. Atari), 2. we need some history to represent ko, 3. it is useful to have some history to have an idea of where the opponent played recently - these can act as a kind of attention mechanism (i.e. focus on where my opponent thinks is important). The 17th plane is necessary to know which colour we are playing - important because of the komi rule.

    [–]ThomasWAnthony 27 points28 points  (4 children)

    Super excited to see results of AlphaGo Zero. In our NIPS paper, Thinking Fast and Slow with Deep Learning and Tree Search, we propose a very similar idea. I'm particularly interested in learning more about behaviour in longer training runs than we achieved

    1. As AlphaGo Zero trains, how does the relative performance of greedy play by the MCTS used to create learning targets, greedy play by the policy network, and greedy play of the value function change during training? Does the improvement over the networks achieved by the MCTS ever diminish?

    2. In light of the success of this self-play method, will deepmind/blizzard be making it possible to use self-play games in the recent Starcraft 2 API (which was not available at launch)?

    [–]David_SilverDeepMind[S] 12 points13 points  (2 children)

    Thanks for posting your paper! I don't believe it had been published at the time of our submission (7th April). Indeed it is quite similar to the policy component of our learning algorithm (although we also have a value component), see discussion in Methods/reinforcement learning. Good to see related approaches working in other games.

    [–]sarokrae 9 points10 points  (0 children)

    That didn't answer either of these questions... (Also interested in whether a self play Starcraft API is in the works!)

    [–]brkirby 22 points23 points  (3 children)

    Any plans to open source AlphaGo?

    [–]David_SilverDeepMind[S] 18 points19 points  (2 children)

    We've open sourced a lot of our code in the past, but it's always a complex process. And in this case, unfortunately, it's a prohibitively intricate codebase.

    [–][deleted]  (1 child)

    [deleted]

      [–]thebackpropaganda 23 points24 points  (0 children)

      It probably uses a tonne of internal libraries owned by other teams at Google.

      [–]clumma 38 points39 points  (6 children)

      With strong chess engines we can now give players intrinsic ratings -- Elo ratings inferred from move-by-move analysis of their play. This lets us do neat things like compare players of past eras, and potentially offers a platform for the study of human cognition.

      Could this be done with AlphaGo? I suppose it could be more complicated for go, since in chess there is no margin of victory to consider (there is material vs depth to mate, but only rarely are these two out of sync).

      [–]JulianSchrittwieserDeepMind 36 points37 points  (2 children)

      Actually this is a really cool idea, thanks for sharing the paper!

      I think this could totally be done for Go, maybe using the difference in value between best and played move, or the probability assigned to the played move by the policy network. If I have some free time I'd love to try this at some point.

      [–]clumma 4 points5 points  (0 children)

      +1 This post from Regan's blog may be helpful as well.

      [–][deleted] 4 points5 points  (0 children)

      But isn't AlphaGo being retired? Are you still permitted to work on it and polish it in your spare time, or will some resources remain available for it as things taper off?

      [–]Bleyddyn 9 points10 points  (2 children)

      Somewhat along the same lines. Has there been any work done on using Alpha Go as a teacher? Ideally more than just playing against it. As a novice Go player I doubt I'd learn much playing against Alpha Go unless there was some way to lower its apparent skill level.

      [–]darkmighty 2 points3 points  (0 children)

      I'm interested in this too! I think there are useful lessons in human-human learning and machine-human teaching to be applied to efficient machine-machine transfer learning, and AI safety (with machines explaining their reasoning).

      [–]reddittimiscal 20 points21 points  (3 children)

      Why stop the training at 40 days? It's still climbing the performance ladder, no? What happened if you let it run for, say, 3 months?

      [–]David_SilverDeepMind[S] 34 points35 points  (2 children)

      I guess it's a question of people and resources and priorities! If we'd run for 3 months, I guess you might still be wondering what would happen after, say, 6 months :)

      [–]cutelyaware 4 points5 points  (0 children)

      I guarantee you we would, but that doesn't mean we wouldn't appreciate the effort!

      [–][deleted] 3 points4 points  (0 children)

      This is so true... I think the Go community was hoping AlphaGo would run indefinitely.

      Seems like what is happening instead, is AlphaGo's research is fueling advancements in alternative bots. People are likely going to be studying AlphaGo's games for quite some time, but people are also going to create new bots they can learn from.

      Hopefully, in 10 - 20 years, much like what happened in chess, you will be able to run the world's most powerful Go AI on your home computer or on a network with a low subscription fee.

      Speaking of which, what is the chance that improvements in computation will keep happening? How much of an improvement in processing power and AI tools will be needed for another sponsored run of AlphaGo, or a community run of something similar, to be "not that big of a deal"?

      Seems like AlphaGo currently takes a whole team's effort... and that team is needed on other tasks.

      [–][deleted]  (1 child)

      [deleted]

        [–]JulianSchrittwieserDeepMind 37 points38 points  (0 children)

        Definitely, personally I only have a Bachelor's degree in Computer Science. The field is moving very quickly, so I think you can teach yourself a lot from reading papers and running experiments. It can be very helpful to get an internship with a company that already has experience in ML.

        [–]kamui7x 35 points36 points  (5 children)

        In 1846 Shusaku played a game against Gennan Inseki with the most famous move in go history of move #127 which has been named "the ear-reddening move." This move has been praised for how spectacular it was. Does Alphago agree this is the best path forward? If not, what sequence would Alphago play?

        [–]JulianSchrittwieserDeepMind 23 points24 points  (2 children)

        As I'm not an expert Go player, we asked Fan Hui for his view:

        At the time of this match, games were played without komi. Today, AlphaGo always plays with 7.5 komi. The game totally changes with this komi difference. If we were place move 127 in front of AlphaGo, it is very possible AlphaGo would play a very different sequence.

        [–]kamui7x 5 points6 points  (1 child)

        Thank you for the response. Is it possible to either set the komi to zero or give the black player 7 captured stones somehow? Considering how famous this move is in the history of go there is great interest to see the continuation that AlphaGo would take. Any possibility to get an SGF of this?

        [–]i_know_about_things 3 points4 points  (0 children)

        7.5 komi is hardcoded into AlphaGo. Playing with different komi requires complete retraining.

        [–]PaperBigcat 4 points5 points  (0 children)

        We should go through all human games for this .

        [–]sfenders 18 points19 points  (3 children)

        Earlier in its development, I heard that AlphaGo was guided in specific directions in its training to address weaknesses that were detected in its play. Now that it has apparently advanced beyond human understanding, is it possible that it might need another such nudge to get it out of any local maximum it has found its way into? Is that something which has been, or will be attempted?

        [–]David_SilverDeepMind[S] 20 points21 points  (1 child)

        Actually we never guided AlphaGo to address specific weaknesses - rather we always focused on principled machine learning algorithms that learned for themselves to correct their own weaknesses.

        Of course it is infeasible to achieve optimal play - so there will always be weaknesses. In practice, it was important to use the right kind of exploration to ensure training did not get stuck in local optima - but we never used human nudges.

        [–]Paranaix 17 points18 points  (0 children)

        The 50 self-play games released after Wuzhen were a shock for the professional go community. Many moves look almost alien to a human player.

        Is there any chance that you

        1. Release another set of self-play games?
        2. Include some variations which AG thinks plausible/probably, which might help us deepen our understanding of why AG chooses certain moves?

        [–]JulianSchrittwieserDeepMind 18 points19 points  (1 child)

        Hi everyone, we are here to answer your questions :)

        [–]HeyApples 33 points34 points  (9 children)

        The small sample of AlphaGo vs. AlphaGo games published showed white winning a disproportionate amount of the time. Which led some to speculate that komi was too high.

        With access to a larger dataset, have you been able to make any interesting conclusions about the basic Go ruleset? (ie: Black or white have an intrinsic advantage, komi should be higher or lower, etc.)

        [–]JulianSchrittwieserDeepMind 26 points27 points  (4 children)

        In my experience and the experiments we've run, komi 7.5 is very balanced, we only observe a slightly higher winrate for white (55%).

        [–]SebastianDoyle 11 points12 points  (1 child)

        There is a video where Michael Redmond looks at a bunch of AG self-play games and says he thinks that the komi is right, and that White wins more games simply because AG is a stronger player as White than as Black. He gives some reasons for that, i.e. there are strategic differences in how to play White vs as Black, which AG apparently didn't figure out. Looks like AG0 has caught up though :).

        [–]Kingvash 6 points7 points  (1 child)

        In the Alpha Go Zero self play games white wins a more modest 24 of 40 games.

        [–][deleted] 5 points6 points  (0 children)

        I heard that the selfplay games are selected from various stages throughout the development of Zero, so only the later games are representative of win rates of w and b when Zero is at highest power. And white seems to be winning most of the latter games.

        [–]ExtraTricky 14 points15 points  (16 children)

        One of the things that stood out to me most in the Nature paper was the fact that two of the feature planes used explicit ladder searches. I've heard several commentators on AlphaGo be surprised by its awareness of ladders, but to me it feels like a go player thinking about a position when someone taps him on the shoulder and says "Hey, in this variation the ladder stops working." Much less impressive! In addition, the pure MCTS programs that predated AlphaGo were notoriously bad at reading ladders. Do you agree that using explicit ladder searches as feature planes feels like sidestepping the problem rather than solving it? Have you made any progress or attempts at progress on that front since your last publication?

        I'm also interested in the ladder problem because it's in some sense a very simple form of the general semeai problem, where one side has only one liberty. When we look at other programs such as JueYi that are based on the Nature publication, we see many cases of games (maybe around 10% of games against top pros) where there is a very large semeai with many liberties on both sides and the program decides to ignore it, resulting in a catastrophically large dead group. When AlphaGo played online as Master, we didn't see any of that in 60 games. What does AlphaGo do differently from what was described in the Nature paper that allows it to play semeai much better?

        When a sufficiently strong human player approaches these positions they are able to resolve it by counting the liberties on both sides, and determining the result by comparing the two counts. From my understanding of the nature paper, it seems that the liberty counts get encoded into the 8 feature planes, which are described as representing liberty counts 1, 2, 3, 4, 5, 6, 7, and 8 or more. It seems like this would work for small semeai, as the network could easily learn that if one group has the input for 7 liberties and the other has the input for 6 liberties then the group with 7 liberties will win the race. But for large semeai, say two groups with 10 liberties each, then when we compare playing there versus not playing there, the they both look like an "8+" vs "8+" race, which would probably be learned to be counted something like a seki, since there's no way to know which side wins just from that. So I was thinking that this could explain these programs' tendencies to disastrously play away from large semeai.

        Does this thinking match the data that you've observed? If so, have you made any insights into techniques for machines to learn these "count and compare"-style approaches to problems in ways that would generalize to arbitrarily high counts?

        [–]David_SilverDeepMind[S] 20 points21 points  (0 children)

        AlphaGo Zero has no special features to deal with ladders (or indeed any other domain-specific aspect of Go). Early in training, Zero occasionally plays out ladders across the whole board - even when it has quite a sophisticated understanding of the rest of the game. But, in the games we have analysed, the fully trained Zero read all meaningful ladders correctly.

        [–]dhpt 6 points7 points  (1 child)

        Interesting question! I'm quoting from the new paper:

        Surprisingly, shicho (‘ladder’ capture sequences that may span the whole board)—one of the first elements of Go knowledge learned by humans—were only understood by AlphaGo Zero much later in training.

        [–]dhpt 4 points5 points  (0 children)

        They actually don't specify how late in training. Would be interesting to know!

        [–]2358452 4 points5 points  (12 children)

        See their new paper (AlphaGo Zero), it doesn't include explicit ladder search, and is already better than previous AlphaGo.

        As for counting, yes that's an interesting question. Neural networks of depth N are pretty much differential versions of logical circuits of depth O(N). So it should be able to count to at least O(2N)* if necessary in its internal evaluation, but I don't think it's obvious that it does, or that it can be trained to reliably count up to O(2N). I wouldn't be surprised if certain internal states were found to be a binary representation (or logatihmic amplitude representation) of a liberty count of a group.

        *: For a conventional adder circuit, not sure about unary counting. Anyone has ideas on a generalization?

        [–]seigenblues 14 points15 points  (2 children)

        Hi David & Julian, congratulations on the fantastic paper! 5 ML questions and a Go question:

        1. How did you know to move to a 40-block architecture? I.e., was there something you were monitoring to suggest that the 20-block architecture was hitting a ceiling?
        2. Why is it needed to do 1600 playouts/move even at the beginning, when the networks are mostly random noise? Wouldn't it make sense to play a lot of fast random games, and to search deeper as the network gets progressively better?
        3. Why are the input features only 8 moves back? Why not fewer? (or more?)
        4. Would a 'delta featurization' work, where you essentially have a one-hot for the most recent moves? (from brian lee)
        5. Implementation detail: do you actually use an infinitesimal temperature (in the deterministic playouts), or just 'approximate' it by always picking the most visited move?

        6. Any chance of getting more detailed analysis of joseki occurences in the corpus? :)

        Congratulations again!

        [–]JulianSchrittwieserDeepMind 9 points10 points  (1 child)

        Yes, you could probably get away with doing fewer simulations in the beginning, but it's simpler to keep it uniform throughout the whole experiment.

        David answered the input features one; as for the delta features: Neural nets are surprisingly good at using different ways of representing the same information, so yeah, I think that would work too.

        Yeah, 0 temperature is equivalent to just std::max of the visits :)

        [–]pjox 28 points29 points  (4 children)

        Considering that AlphaGo is now retired, when do you plan to open source it? This would have a huge impact on both the Go community and the current research in machine learning.

        When are you planning to release the Go tool that Demis Hassabis announced at Wuzhen?

        [–]David_SilverDeepMind[S] 39 points40 points  (1 child)

        Work is progressing on this tool as we speak. Expect some news soon : )

        [–]gin_and_toxic 4 points5 points  (0 children)

        That's awesome news. Keep up the great work.

        [–]adum 11 points12 points  (3 children)

        As an AlphaGo superfan, watching all these matches was awesome. The biggest itch left unscratched is wondering how many handicap stones AlphaGo could give top pros. We know that AlphaGo can play handicap games since the papers talk about it. I understand that the political implications of giving H2 to Ke Jie were untenable. However, as the creators, you must be very curious yourselves. Have you done any internal tests, or is there anything else you can hint at? Thanks!

        [–]David_SilverDeepMind[S] 23 points24 points  (1 child)

        We haven't played handicap games against human players - we really wanted to focus on even games which after all are the real game of Go. However, it was useful to test different versions of AlphaGo against each other under handicap conditions. Using names of major versions from Zero paper, AlphaGo Master > AlphaGo Lee > AlphaGo Fan, each version defeated its predecessor with 3 handicap stones. But there are some caveats to this evaluation, as the networks were not specifically trained for handicap play. Also since AlphaGo is trained by self-play, it is specially good at defeating weaker versions of itself. So I don't think we can generalise these results to human handicap games in any meaningful way.

        [–][deleted]  (2 children)

        [deleted]

          [–]David_SilverDeepMind[S] 21 points22 points  (1 child)

          We have stopped active research into making AlphaGo stronger. But it's still there as a research test-bed for DeepMinders to experiment with new ideas and algorithms.

          [–][deleted] 2 points3 points  (0 children)

          This answers one of my earlier questions regarding the impact of "retirement".

          [–][deleted] 11 points12 points  (2 children)

          It seems that training by self-play entirely would have been the first thing you would try in this situation before trying to scrape together human game data. What was the reason that earlier versions of AlphaGo didn't train through self-play or if it was attempted, why didn't it work as well?

          In general, I am curious about how development and progress works in this field. What would have been the bottleneck two years ago in designing a self-play trained AlphaGo compared to today? What "machine learning intuition" was gained from all the iterations that finally made a self-play system viable?

          [–]David_SilverDeepMind[S] 17 points18 points  (1 child)

          Creating a system that can learn entirely from self-play has been an open problem in reinforcement learning. Our initial attempts, as for many similar algorithms reported in the literature, were quite unstable. We tried many experiments - but ultimately the AlphaGo Zero algorithm was the most effective, and appears to have cracked this particular issue.

          [–][deleted] 8 points9 points  (0 children)

          If you have time to answer a follow-up, what changed? What was the key insight into going from unstable self-play systems to a fantastic one?

          [–]fischgurke 23 points24 points  (0 children)

          Can you give any news about an "AlphaGo tool" that you hinted at during the Ke Jie match? Will it be some kind of credit-based (for example, 1 per day) online interface where you can consult AlphaGo for its opinion on Go positions?

          [–]mosicr 10 points11 points  (4 children)

          To David Silver: in your video lectures you mentioned RL can be used for financial trading. Do you have any examples of real world use ? How would you deal with Black Swans ( previously unencountered situations ) ? Thanks

          [–]David_SilverDeepMind[S] 16 points17 points  (3 children)

          Real-world finance algorithms are notoriously hard to find in published papers! But there are a couple of classic papers well worth a look, e.g. Nevmyvaka and Kearns 2006 and Moody and Safell 2001.

          [–]darkmighty 4 points5 points  (0 children)

          Which is of course understandable, due to the almost-zero-sum nature of financial trading :) Someone publishing a dominant method will incur a loss as soon as others also start using it, and it tends to lose power.

          Which is why, if you're interested in research, I don't recommend the financial industry!

          [–]seigenblues 10 points11 points  (1 child)

          Ah, and one more -- the AGZ algorithm seems very applicable to other games -- have you run it on other games like Chess or Shogi?

          [–]gin_and_toxic 2 points3 points  (0 children)

          Would be very interesting to see how good AlphaGo Zero is at learning chess / other games, even just with a few days of training.

          In this video, David hints that it should be doable: https://www.youtube.com/watch?v=WXHFqTvfFSw

          [–]empror 27 points28 points  (5 children)

          Can you tell us something about the first move in the game? Does AlphaGo sometimes play moves that we haven't seen it play in any of the games you published? Like 10-10 or 5-3 or even really strange moves? If not, is it just out of "habit", or does it have a strong belief that 3-3, 3-4 and 4-4 are superior?

          [–]David_SilverDeepMind[S] 17 points18 points  (3 children)

          During training, we see AlphaGo explore a whole variety of different moves - even the 1-1 move at the start of training!

          Even very late in training, we did see Zero experiment with 6-4, but it then quickly returned to its familiar 3-4, a normal corner.

          [–]JulianSchrittwieserDeepMind 11 points12 points  (0 children)

          Actually at the start of the Zero pipeline, AlphaGo Zero plays completely randomly, e.g. in part b of figure 5 you can see that it actually plays the first move at the 1-1 point!

          Only gradually does the network adapt, and as it gets stronger it starts to favour 4-4, 3-4 and 3-3.

          [–]semi_colon 18 points19 points  (1 child)

          Grettings from /r/baduk! I don't actually have a question, but I do want to thank your team for stimulating interest in Go in the West. I've been playing it for about ten years and it's nice being able to explain Go as, "Oh, it's that game that Google made that AI for last year" and people always know what I'm talking about.

          [–]JulianSchrittwieserDeepMind 13 points14 points  (0 children)

          Thanks! I actually only started to play Go when I started to work on AlphaGo, and I'm really glad it led me to such a great game!

          [–]KapitalC 18 points19 points  (3 children)

          Hello David Silver and Julian Schrittwieser and thank you for taking the time to talk with us about your work. A couple months ago I've seen David's course on deep learning on YouTube and I was hooked ever since!    

          And now for the question:   

          It seems that using or simulating long term memory for RL agents is a big hurdle. Looking towards the future, do you believe we are close to “solve” this with a new way of thinking? Or is it just a matter of creating extremely large networks, and waiting for the technology to get there? 

           

          P. S. I'm aspiring to be an AI engineer but interested to get there by showcasing independent projects and not through doing a master’s degree. Do I have a chance to work at a company such as DeepMind or is a master’s degree a must? 

           

          [–]JulianSchrittwieserDeepMind 7 points8 points  (0 children)

          You are right about long term memory being an important ingredient, e.g. in StarCraft where you might have thousands of actions in a single game yet still need to remember what you scouted.

          I think there are already exciting components out there (Neural Turing Machines!), but I think we'll see some more impressive advances in this area.

          [–]JulianSchrittwieserDeepMind 15 points16 points  (0 children)

          I don't have a Master's degree, so don't let that stop you!

          [–]CitricBase 9 points10 points  (1 child)

          It was said that the version of AlphaGo that played Ke Jie needed only a tenth of the processing power of the one that played against Lee Sedol. What kind of optimizations did you do to accomplish that? Was it simply that AlphaGo was ten times stronger?

          [–]JulianSchrittwieserDeepMind 14 points15 points  (0 children)

          This was primarily due to the improved value/policy dual-network - with both better training and better architecture, see also figure 4 in the paper comparing the different network architectures.

          [–]Borthralla 9 points10 points  (0 children)

          I'm a huge fan of AlphaGo!
          My first question is about handicap games. Is AlphaGo's Neural Network applicable to handicap games, or is strictly trained for even games with standard 7.5 komi chinese rules?

          Secondly, everyone is waiting with baited breath for the AlphaGo teaching software hinted at the end of Wuzhen. Although nothing is certain yet, who will be able to get the software? And also, what will be required to run the software? Does AlphaGo's Neural Network take up a lot of space?

          Third, has AlphaGo been continuing to learn since the Wuzhen games? Are you going to continue training it? If so, do you think you'll ever release more Self Play games? Also, could it review some of the games played in the 60-game self-play series? Micheal Redmond and Chris Garlock are making a series on the self-play games and I'm sure they would find that sort of thing incredibly insightful.

          Edit: with the reveal of AlphaGo 0, how strong is it from the version that played at Wuzhen? Wow!!

          Thank you!!!!

          [–]Adjutor_de_Vernon 6 points7 points  (1 child)

          Have you thought of using generative adversarial network?

          We all love AlphaGo but it has a tendency to slow down when ahead. This is annoying for go players because it hides its real strength and play suboptimal endgame. I know this is not a bug but a feature resulting from the fact that AlphaGo maximise his winning probability. What could be cool would be to create demon version of AlphaGo that maximise his expected winning margin. That demon would not slow down when ahead, not hide his strength, not play unreasonable move when loosing and always play optimal endgame. That demon could serve as a generative adversarial network to an angel version that maximise his probability of winning. As we know, we all improve by playing against different styles. This could make hellish matches between the angel and the demon. Of course the angel would win more games, but it would be like winning the Electoral College without winning the popular vote...

          [–]David_SilverDeepMind[S] 6 points7 points  (0 children)

          In some sense, training from self-play is already somewhat adversarial: each iteration is attempting to find the "anti-strategy" against the previous version.

          [–]rlsing 14 points15 points  (4 children)

          Michael Redmond's reviews of AlphaGo's self-play have brought up some interesting points for behavioral differences between AlphaGo and human professionals:

          (1) AlphaGo clearly plays bad moves in particular situations that a human pro would never play

          (2) AlphaGo was not able to learn deep procedural knowledge (joseki)

          How difficult would it be to have AlphaGo pass a "Go Turing Test"? E.g., what kind of research or techniques would be necessary before it would be possible to have AlphaGo play like an actual professional? How soon could this happen? What are the roadblocks?

          [–]David_SilverDeepMind[S] 22 points23 points  (0 children)

          (1) I believe these "bad" moves of AlphaGo are only bad from a perspective of maximising score, as a human would play. But if the lower scoring move leads to a sure win - is it really bad?

          (2) AlphaGo has learned plenty of human joseki and also its own joseki, indeed human pro players now sometimes play AlphaGo joseki :)

          [–]pvkooten 13 points14 points  (3 children)

          Thanks for doing this! And David: thanks for the RL course.

          I have a few questions, I hope you can answer them:

          1. How's life at DeepMind?

          2. Who were the members of team AlphaGo?

          3. Could you say something about how the work was divided within the AlphaGo team?

          4. What's the next big challenge?

          [–]David_SilverDeepMind[S] 14 points15 points  (1 child)

          Life at DeepMind is great :) Not a recruitment plug - but I feel actually quite lucky and privileged to be here doing what I love every day. Lots of (sometimes too many! :)) cool projects to get involved in.

          We've been lucky enough to have many great people work on AlphaGo - you can get an idea of the contributors by looking at the respective author lists - also there is a very brief outline of contributions in the respective Nature papers.

          [–]goPlayerJuggler 12 points13 points  (2 children)

          Thanks a lot for organising this Q&A. Here are my 11 (!) questions, in no particular order of preference. Some of them have already been asked by others.

          1. How was the 50-game self-play set chosen? Was it picked from a larger set?

          2. Could you outline the sizes of other non-published sets of AG games you have been working with?

          3. Apparently you have stated that 7.5 komi is the best value for balancing the game, according to your data. How does that relate to Black only winning 12 games in the 50-game set?

          4. Was Godmoves actually AlphaGo incognito? https://www.reddit.com/r/baduk/comments/5kuo93/what_is_this_god_move_thing/ http://gokifu.com/playerother/GodMoves More generally, can you tell us of any other incognito games on Go servers, apart from the Master / Magist series?

          5. How does AG manage with triple kos, molasses ko etc? Does it have a superko implementation? What experimentation did you do in this area?

          6. How would you go about preparing AIs for playing Go variants such as Toroidal Go? It could be a good project for an intern at DeepMind maybe? :) Here are some sample variants that would be interesting: https://senseis.xmp.net/?ToroidalGo https://senseis.xmp.net/?VetoGo https://senseis.xmp.net/?environmentalGo https://senseis.xmp.net/?SuperpowerGo (a whole family of variants) Maybe my challenge is to create a single “generic” Go AI that would play at (near) AG level for different komis, board sizes and variants.

          7. Would it be possible to tweak AG so as to get instances with different playing styles?

          8. Do you have a tool that takes a set of games by a single player as input, and as output returns an estimate of the player’s strength? If not, how feasible do you think creating such a tool would be? Also the problem could be made more open ended by requiring the tool to also indicate the player’s strong/weak points (fuseki, chuban, yose, positional judgement, …)

          9. Did exposure to AG improve skills of strong Go players within Deepmind (people like Fan Hui, Aja Huang, T Hubert)? And how? Have there been experiments on using AG and related tools for training human players?

          10. Would Deepmind reconsider retiring AG? Say aliens appeared and challenged humanity to a jubango – how much further do you think AG could be improved?

          11. If the latest AI technology were used to play Chess, do you think something significantly stronger than the current “brute-force” chess engines could be produced?

          Sorry it’s such long list.

          As well as answering my and other people’s questions, I would be greatly interested to hear about your most recent research with AG. Perhaps that would be even more interesting than answering some of our questions!

          Cheers; I thank you and all the Deepmind team for all your incredible work.

          (edit: added line returns and question #11)

          [–]aegonbittersteel 19 points20 points  (2 children)

          The original paper mentioned that AlphaGo was initially trained using supervised learning from over a million games and then through a huge amount of self play. For most tasks that amount of initial human supervision would not exist. Now with AlphaGo's success are you looking into making a Go player entirely from self-play (without the initial supervision)? Does such a network successfully train?

          Finally, a big thank you to David for your online reinforcement learning lecture videos. They are an excellent resource for anyone new to the field.

          EDIT: This question has been answered in Deepmind's new blog post. See link below.

          [–]enntwo 16 points17 points  (1 child)

          For what its worth - just announced - AG Zero: https://deepmind.com/blog/alphago-zero-learning-scratch/

          Fully self-trained, no human input, takes 40 days to train a network stronger than AG Master.

          [–][deleted] 5 points6 points  (0 children)

          ~23 days*, 40 days is 300 elo stronger.

          [–]roryhr 5 points6 points  (0 children)

          What are y'all working on now?

          [–][deleted] 7 points8 points  (0 children)

          What are some of the most interesting things you've seen AlphaGo do?

          [–]xuzou 4 points5 points  (0 children)

          Can we have all 100 AG Zero vs AG master games instead of only the first 20 in supplementary materials? Thanks very much.

          [–]say_wot_again 17 points18 points  (3 children)

          Since both you and Facebook were working on the problem at roughly the same time, what was the advantage that allowed you to get to grandmaster level performance so much sooner?

          What do you see as the next frontier for ML, and especially for RL, in areas where getting as much training data as AlphaGo had is untenable?

          [–]David_SilverDeepMind[S] 30 points31 points  (0 children)

          Facebook focused more on supervised learning, producing one of the strongest programs at that time. We chose to focus more on reinforcement learning, as we believed it would ultimately take us beyond human knowledge. Our recent results actually show that a supervised-only approach can achieve a surprisingly high performance - but that reinforcement learning was absolutely key to progressing far beyond human levels.

          [–][deleted] 7 points8 points  (1 child)

          For what it's worth, I remember when the first AG paper was released and the number of GPUs was disclosed, one of the facebook guys tweeted that their budget provided them with a single digit number of GPUs.

          [–]somebodytookmynick 15 points16 points  (5 children)

          Please tell us about Tengen.

          Or … perhaps rather about why not Tengen :-)

          Also, have you tried forcing AlphaGo (black) to play Tengen as first move?

          If yes, can we see some games, please?

          <edit>

          I must re-think my question …

          Could it happen that, if AGZ would play a few million more games, or a billion, it might actually discover that Tengen indeed is the best first move?

          </edit>

          [–]Andeol57 5 points6 points  (2 children)

          AlphaGo Zero brings a new aspect to this: even without any human play influence, he still plays mostly 4-4 points to start a game, with some 3-4 and 3-3 as well.

          A bit anticlimatic.

          [–][deleted] 11 points12 points  (0 children)

          When do you think robots will efficiently be able to solve/generalise to highly dimensional, real world problems (e.g. a device that learns by itself how to pick up litter of any shape, size, in any location... )?

          Do you think some flavour of Policy Gradient methods will be key to this?

          [–]sml0820 13 points14 points  (1 child)

          The documentary was compelling. Although it is playing in screenings around the world: https://www.alphagomovie.com/screenings, when can we expect the ability to purchase or stream it?

          [–]David_SilverDeepMind[S] 14 points15 points  (0 children)

          The creators of the documentary are planning a digital release in the next few months on platforms where you can buy and rent movies, such as Google Play Store, iTunes, YouTube Movies. They’re also currently exploring a release on a streaming service too.

          [–]sml0820 10 points11 points  (11 children)

          You mentioned a new research paper being released in relation to the Master version of AlphaGo. You also said you may try to train AlphaGo from scratch without leveraging the initial policy network trained on human games. Do you know when the paper will be released and what is the status on training from scratch?

          [–]JulianSchrittwieserDeepMind 26 points27 points  (10 children)

          [–]lilosergey 4 points5 points  (4 children)

          Wow guys you are so awesome! I'm dying for the kifus of AlphaGo Zero!!!

          [–]diogovk 2 points3 points  (4 children)

          Please note you can read** the paper for free at the end of the page https://deepmind.com/blog/alphago-zero-learning-scratch/

          Apparently the download button doesn't work.

          [–]Orc762 3 points4 points  (1 child)

          Glad you guys are able to take some time for us!

          Will there be any more matches against pros?

          [–]JulianSchrittwieserDeepMind 7 points8 points  (0 children)

          Thanks, hope our answers are useful!

          As we said in May, the Future of Go Summit was our final match event with AlphaGo.

          [–]newproblemsolving 4 points5 points  (0 children)

          Can AlphaGo have two exhibiting matches (not competitive matches as I know AlphaGo is retired.) with Michael Redmmon or any professional players(or high-dan amateur) with (A) 2 or 3 stone handicaps (B) White mirror go with AlphaGo taking Black?

          BTW, for (B) it's just so fun to see how AlphaGo deal with it, so sad it doesn't happen so far.

          [–]splendor01 3 points4 points  (1 child)

          I wrote a program for playing gomoku(https://github.com/splendor-kill/ml-five) based on AlphaGo paper. The SL network has been trained by datasets gathered from Gomocup top 3 players’ games. At the RL stage, the RL agent are initialized to SL NN parameters at the beginning, At battling mode, since opponent parameter is fixed, and the RL agent is gradually learning with RL algorithms. therefore, after some time, when the winning rate is greater than certain level, for example 55%. I will stop and replicate the RL agent and put it into the opponent pool. I will randomly select another opponent from the pool and repeat like this.

          But here is an interesting thing I found out: The RL agent at first easily and quickly realizes the shortcomings of its opponent, defeating the opponent. However after several rounds, the agent became “stupid” and seemed to forget everything the agent has learned before.

          I am wondering how does AlphaGo solve this?

          Look forward to your reply .Thanks!

          [–]Walther_ 2 points3 points  (1 child)

          How to get involved in the AI work today?

          I think one obvious approach is "complete a PhD and apply for a job", but that feels like an answer to the slightly different question of "what's the most common way to get a career in AI".

          In today's world with hackathons, agile development, open-source communities and such, I'm fairly optimistic there have to be ways for an eager soon-to-be BSc to be able to start poking at things, to learn via experimenting, participating in group efforts, and getting mentoring from more experienced people, in addition to formal education.

          (Personally, I'm currently writing my BSc thesis on AlphaGo, so I've got that going already, which is nice.)

          Big thanks for all of your work and this AmA.

          [–]JulianSchrittwieserDeepMind 9 points10 points  (0 children)

          Another approach that works well: Pick an interesting problem, train lots of networks and explore architectures until you find something that works well, publish at a paper or present at a conference, repeat. There is a great community here for feedback, and you can follow the recent work on arxiv.

          [–]hyh123 4 points5 points  (1 child)

          On AlphaGo, now that you have done AlphaGo Zero, do you think you could have created it without developing the previous versions first? It seems like it's very different from the earlier ones.

          [–]JulianSchrittwieserDeepMind 4 points5 points  (0 children)

          We learned a lot during the development of all previous AlphaGo versions, all of which came together in our new AlphaGo Zero paper.

          [–]smurfix 5 points6 points  (0 children)

          Would it be possible to do this again, substituting chess for Go?

          I realize that it's just another game that's already been "done" with computers, but it'd be very interesting to contrast the style of play that Deep Blue exhibited, to whatever style AlphaGoZero might develop. Also, AlphaGoZero is reported to have come up with some interesting new Go stratagems. I wonder if that'd happen with chess also. And, frankly, thirdly, as a hobbyist chess player I can at least appreciate intricate chess moves, while Go is as obscure as it gets. ;-)

          [–]ogs_kfp_t 4 points5 points  (0 children)

          I challenge you to make such a heatmap of opening move, with Alphago Zero:

          http://i.imgur.com/7hz0qEL.png

          I am very curious. If you send me the probabilities, I will help to create the image.

          [–]Jameswinegar 6 points7 points  (2 children)

          When working on AlphaGo what was the most difficult obstacle you faced concerning the architecture of the system?

          [–]David_SilverDeepMind[S] 23 points24 points  (1 child)

          One big challenge we faced was in the period up to the Lee Sedol match, when we realised that AlphaGo would occasionally suffer from what we called "delusions" - games in which it would systematically misunderstand the board in a manner that could persist for many moves. We tried many ideas to address this weakness - and it was always very tempting to bring in more Go knowledge, or human meta-knowledge, to address the issue. But in the end we achieved the greatest success - finally erasing these issues from AlphaGo - by becoming more principled, using less knowledge, and relying ever more on the power of reinforcement learning to bootstrap itself towards higher quality solutions.

          [–]undefdev 7 points8 points  (0 children)

          Are there any plans to release a dataset of some of the situations that are "very difficult" for AlphaGo? It seems like finding good strategies for these situations should be the next challenge we should face to further deepen our understanding of Go.

          [–]sml0820 8 points9 points  (0 children)

          What real life areas do you find most promising for applications of reinforcement algorithms such as AlphaGo - 5, 10, and 15 years out?

          [–]empror 7 points8 points  (1 child)

          Would it be possible to train your AI to decide itself how long it wants to think about a move? For example, in the game Alphago lost against Lee Sedol, would Alphago have found a better move if it had had more time to think about the famous wedge? How about those needless forcing moves that Michael Redmond likes to criticize, aren't they a sign that Alphago cries out to have control over its pace?

          Edit: Maybe my wording was a bit vague, so I'll try to explain what I mean with the last question: Often Alphago plays moves where it is obvious that the opponent has to answer (e.g. fills a liberty). For many of these forcing moves, strong players agree that the move itself cannot possibly have any positive effect (while it is not entirely clear whether the effect is negative or neutral). Michael Redmond and others have been speculating that Alphago has only some limited time for each move, and if it wants to think longer, then it plays some forcing move. So my question is: If Alphago already knows that the time is not enough, wouldn't it be feasible to just let it take longer for this move than for others?

          [–]David_SilverDeepMind[S] 3 points4 points  (0 children)

          We actually used quite a straightforward strategy for time-control, based on a simple optimisation of winning rate in self-play games. But more sophisticated strategies are certainly possible - and could indeed improve performance a little.

          [–]sritee 6 points7 points  (0 children)

          Do you think we can see RL being used in Self-driving vehicles any time soon? If not, would the primary reason be its data inefficiency, or some other concerns?

          [–][deleted] 4 points5 points  (0 children)

          What are the stages that AlphaGo goes through, when trained from scratch (if you did this experiment), after reaching say amateur Dan level?

          Do these stages correspond somehow with they way Go style evolved during the past few hundreds years for humans?

          [–]alcoholicfox 3 points4 points  (0 children)

          What do you recommend an undergrad should do if he is interested in research in deep learning

          [–]valdanylchuk 3 points4 points  (1 child)

          What are some expected milestone dates and achievements in Starcraft? Are there more exciting things to come soon, e.g. in VR or NLP?

          [–]darkmighty 4 points5 points  (0 children)

          AlphaGo is remarkable for finally combining an intuitive, heuristic, learned framework of the value and policy network, with an exact planning algorithm which are the explicit Monte Carlo rollouts.

          Do you expect this approach to be enough for more general intelligence tasks, such the games Starcraft or Dota when played with visual input, or maybe the game Portal?

          Notable shortcomings in those cases are that

          a) Complex environments don't have simple state transition functions. Predicting the future in a Monte Carlo rollout is thus very difficult.

          b) The future states are not equally important. Sometimes your actions need precision down to milliseconds, sometimes you're just strolling though a passage with nothing of note happening. Uniform steps in time seem infeasible.

          c) AlphaGo is non-recursive. Thus it cannot accomplish tasks that require arbitrary computations. This is perhaps irrelevant in Go, where the state of the board itself provides a sort of memory for its thinking, with the policy network functioning more or less as an evolution function of the thinking process. Even in complex scenarios one could imagine the agent using the predicted world itself as a sort of "blackboard" to carry out complex planning. The efficiency of this seems questionable however: the environment needs to support such "blackboard" memory (have many states that can be modified with low cost); and modifying this blackboard in the real world seems largely redundant.

          If not, what immediate improvements do you have in mind?

          [–]Borgut1337 3 points4 points  (0 children)

          About AlphaGo Zero and its self-play:

          Do you think that the MCTS it still uses is critical to make self-play work out correctly? I would personally suspect that Reinforcement Learning purely from self-play without any search would suffer from a risk of ''overfitting'' against itself. I suspect incorporating a bit of search helps to combat that. Do you have any thoughts on this?

          [–]EAD86 1 point2 points  (0 children)

          How did you decide on the 40-day training time for AlphaGo Zero? Would it get stronger if you let it train longer?

          [–]NotModusPonens 2 points3 points  (0 children)

          Does alphago zero eventually only play two 4-4 points in the opening?

          Edit: also, have you tried training on bigger board sizes? 21x21, 37x37, even something bigger than that?

          [–]hyperforce 4 points5 points  (0 children)

          This new approach seems much simpler than the initial AlphaGo which had a much more complicated architecture.

          Was this the first time you tried this simpler approach? Why did the initial AlphaGo you went public with not use this self-learning approach? Did something change recently that made bootstrapping more feasible? Did the work into the initial AlphaGo make the road to Zero easier?

          [–]danielrrich 6 points7 points  (0 children)

          Any further updates about the discussed teaching/review assistant? I really think it would be cool from a perspective of transferring that superhuman knowledge/behavior of alphago to people.

          [–]Feryll 5 points6 points  (0 children)

          Is there any new information on the "AG training tool" that was mentioned as being something we could soon look forward to? Many of us in the go community are wondering what that is, and what a very tentative schedule for that might be.

          [–]YearZero 5 points6 points  (2 children)

          Would you guys consider applying the AlphaGo Zero technique to chess? Would it have an advantage over current top heuristic based engines like Komodo or Stockfish, which are around 3400 ELO? It would be interesting to see what would happen, even just as a curiosity. However, even better if it’s possible to release as a competing engine onto the scene, especially if it dramatically trumps all that came before, forcing the entire community to change methods and follow suit. Thanks!

          [–]bennedik 4 points5 points  (1 child)

          One of the authors of the AlphaGo Zero paper is Matthew Lai, who developed the Giraffe chess engine before joining DeepMind. This engine also learned the evaluation function for chess from scratch, and achieved the level of an IM. That was a fantastic result, but significantly weaker than the top chess engines which use evaluation functions fine-tuned by human programmers. What are your thoughts on applying the results from AlphaGo Zero to a Giraffe like chess engine? And is that something DeepMind would ever work on, or is the game of chess considered "solved" in terms of AI work?

          [–]Revoltwind 8 points9 points  (0 children)

          How many stones Fan Hui needs to play an even game against AlphaGo?

          Is alphago able to run on mobile? If yes, How strong is it? If no, what would be the limitation to port it on mobile?

          Thank you for this AMA! Looking forward for your paper.

          [–]m2u2 8 points9 points  (0 children)

          What did you think of the Chinese government's censorship of the Ke Jie matches? Was it due to you being a google owned company or simply embarrassment that a west based team cracked this game that was invented in China?

          Really looking forward to the documentary!

          [–]BuckeyeInSeattle 12 points13 points  (4 children)

          Thanks for the AMA!

          DeepMind has said on multiple occasions that this foray into Go is just a stepping stone to other applications, such as medical diagnosis, which is obviously laudable.

          With that in mind, I'm troubled by the way AlphaGo makes provably sub-optimal moves in the end game. When given a choice between N moves that win, AlphaGo will select the "safest", but if they're all equally safe, it appears to choose more or less at random. One specific example I can remember is when it decided to make two eyes with a group, and chose to make the second eye by playing a stone inside its own territory, rather than by playing on the boundary of its territory, losing 1 point for no reason.

          The reason this concerns me is because this behavior only makes sense if you assume it can never be wrong about its analysis. In other words, it does not give any consideration to the notion that it might have calculated something wrong. If it had any idea of uncertainty, it would prefer the move that doesn't lose 1 point 100% of the time, just in case there was some move it hadn't anticipated that made it lose some points elsewhere on the board.

          While playing Go, this isn't a big deal, but coming back to my original point, with things like medical diagnosis this could be a real life and death matter (pun fully intended). It seems self-evident to me that you would like your AI to account for the possibility that it has calculated something wrong, when it can be done at no cost (as is the case when choosing between two moves that both make a second eye).

          Do you have any thoughts about this, or more generally about it "giving away" points in winning positions when doing so doesn't actually reduce uncertainty?

          [–][deleted] 4 points5 points  (0 children)

          Does AlphaGo play actual handicap games, or are the comparisons between versions done at even play, and the reported size of handicap amount is just inferred from win ratio?

          Can you please publish some of the actual handicap games?

          [–]ViktorMV 6 points7 points  (0 children)

          Hi David, Julian, thanks for this thread!

          1) How strong is a current version of the AG? For example compare to the Ke Jie version and to the Master version. What is it's number? Do you continue it's training?

          2) Can you share self-played games with handicap vs older versions and new self-played games of the latest version?

          3) Why did you decided to follow marketers recommendations to retire AG as there was still at least one very interesting for the Go community and still open questions - with how many handicap stones AG still can win a top pro?

          4) Can you share AG comments with variants and win probabilities for it's self-play games on English?

          5) Are there any chances that you share more information from AG - analysis of some comtemporary fuseki, new self-played games with comments, etc?

          Good with your research, looking forward to see your Starcraft 2 progress!

          [–]tallguy1618 5 points6 points  (0 children)

          Do you guys have any whacky AI's that just do fun things around the office?

          [–]salunero 2 points3 points  (0 children)

          Is it possible to derive some heuristics from the current neural networks that Alphago uses or should we only view them as mystery boxes that give out answers but not telling how and why it gave those answers? Or does this kind of thinking make no sense?

          [–]newproblemsolving 2 points3 points  (0 children)

          Is AlphaGo still training itself and will does so in the foreseeable future or it just stops completely now?

          [–]berndscb1 2 points3 points  (0 children)

          Would it be possible for DeepMind to produce annotations of famous classic games using AlphaGo (or make AlphaGo accessible enough that others could produce something like this)?

          [–]temitope-a 2 points3 points  (0 children)

          Have you peaked inside the layers of Alpha-Go?

          At times the sequences of inputs and outputs of different layers can reveal the 'understanding' the network has of the problem.

          Were you able to isolate ladders, miai, hane, invasions or some other concepts of Go in AlphaGo?

          Question from the Oxford Student Go Society

          [–]brkirby 2 points3 points  (0 children)

          AlphaGo cannot explain its play, which poses a problem when similar techniques are applied to areas such as health care. Any thoughts on improving this flaw? How can society trust AI when it’s known to be subject to mistakes that it can’t articulate to humans?

          [–]_tomakko 2 points3 points  (0 children)

          Hi! How did your proceed when designed the neural net architecture for Alpha Go? What kind of theoretical considerations did you do regarding e.g. effective receiptive fields, no. of layers, filter sizes? Did you fine tune the architecture by trial and error afterwards?

          [–]hawking1125 2 points3 points  (0 children)

          1. What game(s) are you planning to conquer next?
          2. What lessons did you learn from AlphaGo helped you in subsequent research?
          3. What for you is the future of AI and how has AlphaGo affected it?
          4. How will the results from AlphaGo Zero affect how you approach RL in Starcraft?
          5. Do you plan on trying to beat OpenAI at DotA 2?

          EDIT: Added some more questions

          [–]P42- 2 points3 points  (0 children)

          Do you expect that AGI will be able to independently design technology that is decades or centuries beyond unassisted technological progression?

          [–]temitope-a 2 points3 points  (0 children)

          Can AlphaGo be made to 'talk' about Go, beside playing it, i.e. explain what it is doing? After AlphaGo, Deepmind has explored memory / immagination / planning. Would Alpha Go improve with such techniques?

          Question from the Oxford Student Go Society

          [–]enntwo 2 points3 points  (0 children)

          For the self-play games, are both "players" using the same trained network, or is each player using a separately trained network?

          My assumption is that it is the same network, and if that is the case I was wondering if you could speak to any inherent biases that may arise in games where the same network plays both sides. Would each player have the same blindspots/oversights? I feel like that some of the non-humanness of these self-play games stem from biases like these where both players pretty much have the same "strategies"/"thoughts" for lack of betters terms behind each move.

          If it is the case where it is the same network, do you think AG games where each player is a separately trained network of similar strengths that the games would appear more "human-like" or look different overall to those of the same network?

          [–]-S7evin- 2 points3 points  (0 children)

          You said that the AlphaGo Zero algorithm can be used in other fields besides the game, do you have a road map to start with? Thank you.

          [–]rick_rick_rick 2 points3 points  (0 children)

          It would be interesting and useful for later analysis if the sgf files for the Master vs Zero games also had the moves that Zero would have played in the place of Master.

          Is there any hope that DeepMind would release some AlphaGo games that include alternative lines in that fashion, which would greatly assist in later human analysis of these games?

          [–]charm001 2 points3 points  (0 children)

          Is one of your goals with Alphago zero to develop a version of alphago that we can buy and use on normal computers and maybe even our phones?

          If so when do you think that will be possible?

          [–]picardythird 2 points3 points  (0 children)

          1.) With the advances in hardware requirements for AlphaGo Master and AlphaGo Zero making it less expensive to run, will you be providing a way for amateurs or professionals to access AlphaGo as a tool?

          2.) Why do AlphaGo Master and AlphaGo Zero play random forcing moves? Michael Redmond has speculated that they are "time-saving" moves, although in the Game 11 review he mentions that he got the side-eye from a researcher when he suggested that, indicating that this is not the case.

          3.) It has been mentioned that AlphaGo Master was tweaked in terms of complicated tsumego with a custom training regimen composed by Mr. Fan Hui, which some such as Michael Redmond have suggested is a reason that AlphaGo Master is prone to extremely complicated games. In comparison, while AlphaGo Zero's games are not simple by any stretch, they seem to be less confrontational than AlphaGo Master's games. Is this because AlphaGo Zero was not so tweaked by any such custom training program?

          [–]Smallpaul 2 points3 points  (0 children)

          Could the AlphaGo Zero program be taught to play Reversi or Connect Four just by changing the ruleset? Isn't this a more important milestone than Tabula Rasa mastering of a game that is already mastered? If you could apply the same engine to multiple games, the claim of generalizable technology would be indisputable.

          [–]gin_and_toxic 2 points3 points  (0 children)

          Hi David, saw the movie recently. You're especially hilarious when trolling everyone at the end of the last game. It's great to see all of your team's struggles and point of view than what we saw on the stream last year.

          Questions: What are members of previous AlphaGo team working on now that you can tell us? Are everyone still working on different variations of AlphaGo, or are you moving on to something else?

          If you were to give AlphaGo an avatar, what would you personally choose?

          Thanks for the AMA.

          [–]zebub9 2 points3 points  (0 children)

          1. Could you release a winrate map for the empty board? And maybe some selfplay games with komi 7?

          2. Do you plan to let AG0 play a few games against humans, at decent handicap, to see the strength difference and some interesting games?

          3. There seems significantly less strength difference between AG0 and AGMaster than between AGMaster and earlier version. Is this because there is less room towards perfect play, or for some other reason?

          [–]nestedsoftware 2 points3 points  (0 children)

          After AG lost game 4 to Lee Sedol, it was apparently trained against an “anti-AlphaGo” to fix the weaknesses in reading this loss exposed. Was AlphaGo Zero also trained in this manner? If not, how were these kind of potential problems handled?

          Thank you!

          [–]icosaplex 2 points3 points  (0 children)

          So it seems like there is mounting evidence that at AlphaGo's level, white is significantly favored at 7.5 komi. I presume that black would be favored significantly at 5.5 komi.

          One funny issue is that with Taylor-Tromp or other area-scoring rules, the final score (except in rare cases) only has a granularity of 2 points, whereas in Japanese rules or other territory-scoring rules, it has a genuine granularity of 1 point and presumably on average the ability to more finely differentiate in precision of play. However, territory-based rules are a nightmare to formally implement.

          But there are alternatives. Have you considered using Taylor-Tromp-like rules, except with a "button", to achieve territory-scoring levels of result granularity? (https://senseis.xmp.net/?ButtonGo) If one were to use 6.5 komi with the increased granularity, do you think there would still be a strong bias in favor of one side or the other at an AlphaGo level of strength?

          [–]tobasz 2 points3 points  (0 children)

          if you replaced the board and rules of Go with the chess board and rules, would AlphaGo be able to learn to play better than a current open source chess program like Stockfish? Would anything else need to be changed, e.g., MCTS?

          [–]apriltea0409 2 points3 points  (2 children)

          I have 3 questions. First of all, I understand all AlphaGos are trained under the Chinese rule with a 7.5 komi. Does Zero continue to perform slightly better when she plays white? Has there been such an attempt to have Zero play under 6.5 or any other numbers of komi? And if so, how did the change of komi affect Zero's performance? In theory, a perfect komi is the number of points by which Black would win given optimal play by both sides. As AlphaGo Zero is apparently much closer to a perfect player than any of the human players is as of today, we're interested to know, that based on Zero's game data, what would be a perfect komi of the Go game?

          Similarly, I'd be interested in learning how well Zero would do on a larger Go board, for example, 25 by 25. Have you ever had such a try?

          And here's my last question. As far as I understand, AlphaGo would come up with a few choices for each move. In case there're two or three moves that have the same odds of winning, what is the mechanism AlphaGo would use to make the final choice? Or is it just a random pick?

          [–]ffontana 5 points6 points  (0 children)

          What's the future of Alphago? Will it be publicly available? For example, renting an hour to play with the AI. Thanks!

          [–]GetInThereLewis 4 points5 points  (0 children)

          First, thank you for all your hard work on AlphaGo and your contributions to the Go playing community!

          My questions are:

          1. Do you have an update on the next publication that Demis mentioned at Wuzhen?

          2. How closely were you watching other Go AI programs such as DeepZen and FineArt, and have you ever tested AlphaGo against them?

          3. Will AlphaGo ever be released, or at least accessible to the public?

          4. Can you sell DeepMind/AlphaGo swag please (shirts, hoodies, etc)?!

          edit: You already answered question 1! Thank you!

          [–][deleted] 5 points6 points  (2 children)

          Do you have any estimation about how far is AlphaGo from perfect play, maybe by studying the progress graph over time - did the training process hit any ceiling?

          [–]cutelyaware 2 points3 points  (1 child)

          Perfect play is almost unthinkable.

          [–]darkmighty 3 points4 points  (0 children)

          I think there are proofs of computational hardness for "solving" Go (and other games). It's important to keep in mind that AlphaGo is an algorithm like any other. So you're right, it's probably completely infeasible.

          Edit: n x n generalized Go is EXPTIME-complete. This hardness proof applies only heuristically to real 19x19 Go, but it is still significant evidence that perfect play is infeasible (perhaps ever).

          [–]IDe- 4 points5 points  (0 children)

          Has any work been done on visualizing the factors that affect the decision making process? Do you think this is something that has to be solved for domain expert + machine pairings to work effectively? Do you see teaching potential in AIs like these?

          [–]RayquazaDD 4 points5 points  (0 children)

          Thanks for the AMA.

          1. How does AlphaGo deal with mimic go? Does AlphaGo set up double ladders or make Tengen be a good point?

          2. Nowadays, if Go AI meets a long dragon situation(such as long liberty comparison), it will often be trouble. Does AlphaGo have same problem? How does AlphaGo solve the problem?

          3. We saw AlphaGo 55 self-play games. Did you choose some special fuseki or random? Did you remove any game owing to some reasons? If yes, then what are the reasons?

          [–]AndrewVashevnik 3 points4 points  (1 child)

          Hi, David and Julian! Thanks a lot for your work. And thank you for publishing scientific papers and making your research available for everyone, this is amazing.

          1) Have you tried to teach AlphaGo from scratch without data from human games? Doest it fall to inefficient equilibrium? Do two different attempts to train AlphaGo converge to similar result? Could you please provide some insight what are the difficulties you are facing when teaching AlphaGo from scratch?

          2) As I understood from the Nature paper AlphaGo is not 100% learning algorithm. At the first stage handcrafted algorithm is used to process board position. This algorithm calculates number of liberties, whether ladders work etc, which are later passed as inputs to learning algorithm. Is it possible to make AlphaGo without this handcrafted part? Would the learning algorithm be able to come up with concepts like liberties or ladder? What ML techniques could be used to approach this problem?

          3) What are blind spots of AlphaGo and the ways to solve them? Like modern chess engines often struggle with fortresses.

          4) Is Fan Hui + AlphaGo significantly stronger than AlphaGo alone? Is there still a way how a pro can still make an impact when teamed with an AlphaGo?

          I am curious about capabilities of AlphaGo to solve hardest go problems too.

          Thanks, Andrew

          UPDATE: Well, my initial question was before AlphaGo Zero was published, which pretty much answers 1) and 2)

          I am really excited about general-purpose learning algorithm. Thanks for sharing it.

          Some questions on AlphaGo Zero

          5) Have you tried this general-learning approach to other board games? AlphaChess Zero, AlphaNoLimitHeadsUp Zero, etc

          6) If you train two separate versions of AlphaGo Zero from scratch, do they gather the same knowledge, invent the same josekis? AlphaGo Zero training is stochastic (mcts), how much randomness is there in final result after 70 hours of training? Is it a good idea to train ten different AlphaGo Zero and then combine their knowledge or training one AlphaGo Zero ten times longer is better?

          7) let's look at AlphaGo Zero 1 dan, which is an AlphaGo Zero after 15 hours of training which has 2000 elo and a level of an amateur 1 dan. I guess that AlphaGo Zero 1 dan would be considerably better than human 1 dan in some aspects of play and worse in some other (although their overall level is the same). Which aspects of play (close fighting, direction of play, etc) are stronger for AlphaGo Zero 1 dan and which are stronger for amateur 1 dan? What knowledge is easier and harder for AI to grasp. I have read that AI understands ladder much later than human players, are there some more examples?

          8) On real-world applications: I am sure that this kind of learning algorithm could able to learn how to drive a car. The catch is that it would take millions of crashes to do so as it took millions of beginner level games to train AlphaGo Zero. How can you train an AlphaCar without allowing to crash it many times? Building a virtual simulator based on real car data? Could you please provide your thoughts on using AlphaGo general learning algorithm when simulator is not as easily available as in the game of go.

          9) what would happen if you use AlphaGo Zero training algorithms, but start with AlphaGo Lee strategy rather than with complete random strategy? Would it converge to the same AlphaGo Zero after 70+ hours of training or AlphaGo Lee patterns would "spoil" something?

          [–]David_SilverDeepMind[S] 9 points10 points  (0 children)

          AlphaGo Zero has no special features to deal with ladders (or indeed any other domain-specific aspect of Go). Early in training, Zero occasionally plays out ladders across the whole board - even when it has quite a sophisticated understanding of the rest of the game. But, in the games we have analysed, the fully trained Zero read all meaningful ladders correctly.