“Transformers Play Chess”, Ricson Cheng2020-01-10 (, ; similar)⁠:

The Shannon entropy of natural English language is roughly one byte per word, depending on the dataset used. Shannon estimated the number of possible chess games to be 10120. I’ve also seen an estimate of 3 reasonable moves per ply (so 1040 possible 40 move games). This begs the question: just how much information is there in a chess move?…I treated this as a sequence modeling problem. An alternative (and possibly better) approach would be to explicitly make use of the board state. However as I was lazy, I did not do this. I was also motivated by the idea of recreating blindfold chess, which is challenging for humans, but unclear for computers (how would you blindfold a computer?—(also see Tom Murphy’s Elo World)). Also as the “Markovian” approach of simply predicting the move given the current board state has been done many, many times before, I decided this was more interesting.

…The lichess.org game database contains at the time of writing roughly 1 billion games…I chose to use long algebraic notation, which specifies the start and end coordinate of every piece moved (for example, e2e4). “special” moves also include castling and promotion. There are slightly less than 2000 valid unique tokens in this notation

…I used the transformer_big_single_gpu (henceforth known as T78) model from the tensor2tensor repository which has roughly 78 million parameters. I used the default hyperparameters and did not tune anything. I trained on a single 1080ti for almost 4 days (~2 million steps). This turns out to be roughly 50 million games, which is to say, the model only saw 25% of the dataset.

While I wasn’t picky about getting only the best games, I did want some minimal quality control. Therefore I considered only games where both players had a rating of at least 1510 (I believe new accounts start at a rating of 1500), and where both players had at least 5 minutes to play the game, and where the game was at most 100 moves (200-ply). If both players had a rating of at least 2000, the time requirement was bypassed. Note that for time controls with increment, I converted it into a single number by assuming the game was 40 moves long. Roughly 21 of games passed this first filter. I further divided my dataset up by considering games where both players had a rating of at least 2000 and the time control was at least 5 minutes. Less than 1% of games met this filter, but I didn’t find this too worrying as that was still several million games…Instead of training 2 different models, or fine-tuning a trained model on the “better” games, I simply added 2 new tokens to the vocabulary, A, and B, and prefaced each game with one of the 2 tokens. A was used for the more stringent filter, and B for the rest. I did this primarily to save time. Note that it’s fairly trivial to “undo” this conditioning just by summing over the 2 possible first tokens. I was hoping strategy this would allow me to train with a massive dataset, but then to condition on A to generate higher quality games…Each sequence ends with an additional token to denote which side won, or a draw.

Example data:

A e2e4 d7d6 d2d4 e7e5 d4e5 d6e5 d1d8 e8d8 g1f3 f8d6 f1c4 c8e6 c4e6 f7e6 f3g5 d8e7 c1e3 h7h6 g5f3 g7g5 b1c3 a7a6 a1d1 g8f6 a2a3 f6g4 e3c1 d6c5 e1g1 b8c6 h2h3 g4f6 g2g4 a8g8 b2b4 c5a7 b4b5 a6b5 c3b5 a7b6 f1e1 g8d8 d1d8 h8d8 c1b2 f6d7 g1g2 d8f8 a3a4 f8f4 b2c1 f4f7 h3h4 g5h4 c1h6 h4h3 g2g3 d7c5 g4g5 f7f4 g5g6 c5e4 e1e4 f4e4 g6g7 b6f2 g3f2 e4g4 f3g5 g4g5 h6g5 e7f7 b5c7 f7g7 c7e6 g7g6 g5e3 g6f5… e6c5 b7b6 c5d7 c6b4 d7b6 f5e4 e3g5 b4c2 f2g3 c2b4 g3h3 b4a6 g5d2 e4d4 a4a5 d4c5 b6d7 c5b5 d7e5 a6c5 e5f3 c5e4 h3g4 e4d6 d2f4 d6e4 f3d4 b5a6 d4e6 2

…Results:

I “preregistered” a guess of 2.5 bits per ply before running any experiments. After seeing the results, I believe a better designed model could probably reach between 1.6 and 2.0 BPP. I also believe a larger model would perform better, as I was probably close to saturating the capacity of T78.

[Response to “A Very Unlikely Chess Game”, see Reddit. Note that Ricson Cheng’s encoding uses the ‘inline metadata trick’ in a sense, but does not include ELO player skill metadata, or put the reward at the beginning or but at the end, and so is not a Decision Transformer; Cheng got only halfway there by doing a quality split A/B and conditioning on ‘A’ tokens.]