tl;dr I developed AI solution (MCTS + NN) inspired by AlphaZero for playing Ultimate Tic-Tac-Toe game in the browser. You can try it yourself here: https://uttt.ai.
Why?
Ever since I started working in Machine Learning 5 years ago, I have always wanted to do some cool project for my portfolio. Reading scientific papers gave me plenty of ideas, but only after I read AlphaZero preprint I knew: this is it!
AlphaZero is a third paper in AlphaGo, AlphaGo Zero, AlphaZero, MuZero sequence. In AlphaZero paper Deepmind generalizes previous work so that AI can learn through self-play not only how to master Go, but also Chess and Shogi. I had read previous papers, but it was AlphaZero specifically that sparked my imagination. Probably because I love simple and elegant engineering solutions and AlphaZero is mostly about that.
I discovered Ultimate Tic-Tac-Toe and implemented AlphaZero in early 2018. After a few weeks of work I realized it's not going to be an easy ride. There were two major problems that essentially made me forget about this project for a long time.
Firstly, although Ultimate Tic-Tac-Toe (UTTT) looks easier than Chess or Go, it is still quite a challenging game. The average length for UTTT game is somewhere between 40 and 50 plies. The average number of legal actions per position is somewhere around 7 (my estimate from self-play data). It is difficult setup for a side-project. One of the key factors enabling AlphaZero success is massive computing power (5000 TPU v1 for self-play and 64 TPU v2 for training). I had to figure out much cheaper way to develop interestingly good AI under my personal budget.
Secondly, when I envisioned deploying AlphaZero in the browser I had zero knowledge of web development and frontend in general. Which meant I had to find some time to learn it. Not easy if you already have a full-time job and other stuff going on in your life. I decided to put the whole project on hold and said to myself: "maybe one day there will be better time for this..."
Fast-forward to 2021. I left my job and decided to spend a year on a career break, pursuing my interests. I realized that I finally had enough time and resources to conquer this project. I learned the basics of web browsers, HTML, CSS, JavaScript and React. I bought a desktop PC. I've managed to incrementally redesign AlphaZero self-play training into something more executable on my computer. I evaluated AI and confirmed it's superior in comparison to already existing implementations that you can access online (other websites and mobile apps). I built a React app, tested it and finally deployed this week: https://uttt.ai.
Differences from the original AlphaZero:
- Much smaller Policy-Value Network architecture designed specifically for playing Ultimate Tic-Tac-Toe in the browser with only 5 million parameters (20 MB): source code
- Total separation of self-play data generation process from the Policy-Value Network training (offline RL). This was crucial change. There is no way I could succeed with a single script that implements online RL and runs for 10 weeks on my desktop. This had to be broken down into more manageable stages.
- More MCTS simulations per position for training (self-play data quality over quantity).
- The initial self-play dataset was generated from pure MCTS simulations (random playouts are faster and better than random Policy-Value Network predictions).
- Search simulations are synchronous, single-threaded and sequential.
- Enabled data augmentation by flipping the board during Policy-Value Network training.
- Value target for MSE loss function is defined as the root's mean state value rather than the game outcome.
- Masked KL divergence loss for policy head instead of Cross Entropy loss.
- Auxiliary policy head loss for predicting action values next to action logits.
There is no external benchmark to compare my solution with, so I came up with my own evaluation setup. Details are on https://github.com/ar-nowaczynski/uttt#evaluation. The main selling point is that the final Policy-Value Network checkpoint with 1k simulations is much better and faster than MCTS with random playouts and 10M simulations (4-order of magnitude difference). In other words, Policy-Value Network learned useful information about Ultimate Tic-Tac-Toe enabling better and faster evaluations.
I haven't found any other publicly available AI for Ultimate Tic-Tac-Toe that can beat https://uttt.ai. The best version online I found is: https://www.theofekfoundation.org/games/UltimateTicTacToe/ which implements MCTS with random rollouts and some custom modifications (source code: https://github.com/The-Ofek-Foundation/UltimateTicTacToe/blob/master/script.js). It keeps up for the first 10-15 moves with uttt.ai, but eventually does some mistakes and losses the game. Sometimes there is a draw.
Various technical details and takeways from the project:
https://uttt.ai is build using React + onnxruntime and deployed as a Azure Static Web App (shout-out to Microsoft for providing great service).
Policy-Value Network is running in the browser on your device using WebAssembly backend. It utilizes only CPU. I wanted to use WebGL backend, which enables GPU access, but they don't support ConvTranspose2D layer yet. This or I have to rewrite and retrain Policy-Value Network without ConvTranspose2D.
To learn web dev & frontend I read the entire https://javascript.info/ course twice, watched plenty of DevEd videos (https://www.youtube.com/c/DevEd) and implemented many small throwaway projects.
My computing hardware for developing this project was: Intel i7-10700K with 8 cores x 3.80GHz, 2 x RTX 2080 Ti, 64 GB RAM.
The ONNX format is great. You can load PyTorch model in JavaScript (via https://onnxruntime.ai/) very easily!
torch.jit nad libtorch are brilliant tools for using PyTorch model in C++!
https://uttt.ai works best on desktop and gaming laptops. On my desktop: 75 simulations / sec, on my laptop: 70 sims/sec, on my phone 2.5 sim/sec.
I created a video showing AI self-play with 100,000 simulations: https://www.youtube.com/watch?v=oqbHx3NSzaY. I think about recording another video with all games from NMCTS2 10k vs MCTS 10M evaluation, to show how MCTS is dominated by NMCTS2.
If you don't know the Ultimate Tic-Tac-Toe game rules: https://uttt.ai/rules
My strategy for playing Ultimate Tic-Tac-Toe (learned from AI) is as follows:
- start in the center square of the center subgame (undoubtedly the best move, unless you want to surprise the opponent with something weird)
- the 'O' response is to push to the corner subgame, so then let the next 8 moves be played in the corner subgames
- when the 'O' breaks out of the corner subgames, jump between the side subgames (these are the least useful to take, but still, one has to be careful not to mess up here)
- maintain the overall balance on the board and wait for the opponent's mistake (the game is a marathon, not a sprint)
- think twice before sending your opponent to the finished subgame (being able to choose any move from the unfinished subgames is very powerful)
source code: https://github.com/ar-nowaczynski/uttt
twitter thread: https://twitter.com/ArNowaczynski/status/1469318918837870593
Want to add to the discussion?
Post a comment!