GPT4 and GPT3.5 can both track a chess game indefinitely deep into the game, if data is presented to them in a specific way. The moves produced are sometimes very high quality, and draws can be obtained at low rates against Stockfish 8(With the help of an external program coded by GPT4 that produces more detailed natural language descriptions, than shown in the examples. SF8 runs on a single core with 0.5 seconds to think, 16MB hash). The insight from this is that GPT based language models appear to have a "logic core", that can be "activated" when you express information to them in some exact representations. There is some evidence that these representations are tailored to both the topic at hand, but also, the way in which the model specifically learned certain logic/common sense, which is where its static "encoded intelligence" comes from. It seems some types of logic are specifically learned and can only be accessed by presenting data to the model in a specific order, and specific format. This is likely due to statistical reasons pertaining to the percentage of certain formats in the training data. When presented in the exact format as shown in these screenshots, both GPT3.5 and GPT4 can reason on a chess game regardless of where in the game it is. This means, that it can be said with relatively high confidence that the models are reasoning on information -not- in their training data, due to the exponentially larger amount of chess games possible, than the actual amount of games that have ever been posted on the internet, and thus could be in the training data. Since chess has specific rules and a complete information state space, it is highly likely, that there is true reasoning on novel information, when it processes these chess games below. In the chat presented in the 3rd image, I regenerated the response 17 times to see if GPT3.5 would hallucinate about the game board, but every move was valid and a decent move. Other times, in other chats, hallucinations occur, but can be corrected by letting the model know that it made an illegal move, and showing it the information from the original prompt, from the line starting with "FEN: ". In my testing this works the majority of the time in less than 3 tries, and it happens much more often with GPT3.5. I encourage others to test this exact format and see if it fails in other chess games. I will post my findings with code, in about a week. I will demonstrate more detailed prompts to get the models to play better, and provide examples of games where a draw occurred against SF8. I have debated releasing this understanding of GPT models, which I have had a hunch about since December, as it could help bad actors potentially use the models available today, to do things that could be harmful. This type of method, depending on the topic, can be used to get even more general purpose use out of the models currently available. That said, this should at least give some people, a clearer understanding of what humans could be dealing with, in terms of pure encoded cognition, if more moree advanced models are trained and released. I don't believe anyone at OAI knows about this exact phenomenon, as a recent paper on model debates increasing performance on some tasks, included chess, and the tests were only done to move 14. Move 14 is where the models begin to wildly hallucinate on both the board state, and the legality of moves. This would only likely be chosen as a cutoff, if the researchers who authored the paper(no offense meant whatsoever), did not have a full understanding of how chess is processed by these models. Note: With the prompt I've shown here, it does occur, in some chats, that the understanding the model expresses of the board state, is erroneous, but the move it chooses is still decent. This in itself may be able to be improved. Link to the conversation in the 4th image: chat.openai.com/share/34b654… @karpathy @GaryMarcus

May 27, 2023 · 5:26 PM UTC

Extra context: The first board pertains to images 1 and 2(at move 149). The second board pertains to the image 4(at move 41). @JeffLadish @random_walker @DrJimFan @ylecun
The full post will come in another week, as more work needs to be done. In the meantime, here are 4 more examples, with the "base prompt", in which both GPT3.5 and GPT4 reason on games. If you look at GPT4's reponses, it is clearly better than 3.5, but both are able to produce valid moves. Once again, the full method has more detailed, and context specific prompts. Image 1: Black's turn. Move 73. GPT3.5 conversation: chat.openai.com/share/4f0607… GPT4 conversation: chat.openai.com/share/7bdfc7… Image 2: Black's turn. Move 38. GPT3.5 conversation: chat.openai.com/share/312e98… GPT4 conversation: chat.openai.com/share/047ef4… Image 3: White's turn. Move 58. GPT3.5 conversation: chat.openai.com/share/72a398… GPT4 conversation: chat.openai.com/share/18e14b… Image 4: White's turn. Move 121. GPT3.5 conversation: chat.openai.com/share/a5e015… GPT4 conversation: chat.openai.com/share/08dcfd…
I have been asked by multiple people, to see if it is possible to get GPT4 to play Tic Tac Toe, with proper logic, and optimally. Below is the first example of this. This method is not stable yet, and I will include a full method once I tune it to not fail/hallucinate. The same theory of representations as in Chess applies: chat.openai.com/share/75758e… @davidad @GaryMarcus Getting gpt3.5-turbo and gpt4 to reason on Go, will also be shown, after the work on Chess is complete.
Replying to @kenshin9000_
Is it necessary to both give it the game history and the board state
This can be made to work, but with a completely different prompt. For chess, it helps to have the history. It's also possible to get the models to reason on different board sizes, different piece names, and different rules. Will show several examples of that, as an OOD test.
Replying to @kenshin9000_
Can I ask how you thought up this experiment? Namely, going from “I think there’s a logic core in there” to designing this scenario. (This is getting meta…it feels like I’m prompting GPT!)
Intuition on how GPT models encode information led me to this, which I will explain more of in the full results.
Replying to @kenshin9000_
How is the list of legal moves generated? Is it just all the legal moves, or a random selection, or something else?
Legal moves are given via prompts, and it gets to reason on what the best move is. Being able to ever get a draw against SF8, should not be possible, unless there is true "encoded intelligence" and the existence of a "logic core" construct.
Replying to @kenshin9000_
This is very cool. One note is that I don’t think most people that have extensively experiments with gpt4 (especially) doubt that it can reason with novel information. The key question is how well it does outside of the distribution of information it was trained on. After seeing enough chess games, you get a model for the distribution of chess games. As far as getting lost after a fixed number of moves, that seems more like a memory or prompting issue than a failure to reason (within distribution). Rather, the key question seems to be how well can it reason OUTSIDE of the learned distribution. If you give it a new game with slightly different pieces than chess and a different sized board, is it relatively inept? It seems to be. This doesn’t make transformers useless ofc but it substantially limits their potential.
Yes, it can play games with new rules and piece names(and different board sizes). I will show an example of that as well(with code, if I have time). I believe being able to visualize, explain strategy by describing the board, and pick good moves, when the vast, vast majority of chess games have never been played by humans and all computers on Earth combined, points to clear ability to reason out of distribution. Not only that, the performance will be very surprising to many, as it is considerably above what can be seen here. After Chess, Go is next.