GPT4 and GPT3.5 can both track a chess game indefinitely deep into the game, if data is presented to them in a specific way. The moves produced are sometimes very high quality, and draws can be obtained at low rates against Stockfish 8(With the help of an external program coded by GPT4 that produces more detailed natural language descriptions, than shown in the examples. SF8 runs on a single core with 0.5 seconds to think, 16MB hash).
The insight from this is that GPT based language models appear to have a "logic core", that can be "activated" when you express information to them in some exact representations.
There is some evidence that these representations are tailored to both the topic at hand, but also, the way in which the model specifically learned certain logic/common sense, which is where its static "encoded intelligence" comes from. It seems some types of logic are specifically learned and can only be accessed by presenting data to the model in a specific order, and specific format. This is likely due to statistical reasons pertaining to the percentage of certain formats in the training data.
When presented in the exact format as shown in these screenshots, both GPT3.5 and GPT4 can reason on a chess game regardless of where in the game it is. This means, that it can be said with relatively high confidence that the models are reasoning on information -not- in their training data, due to the exponentially larger amount of chess games possible, than the actual amount of games that have ever been posted on the internet, and thus could be in the training data.
Since chess has specific rules and a complete information state space, it is highly likely, that there is true reasoning on novel information, when it processes these chess games below. In the chat presented in the 3rd image, I regenerated the response 17 times to see if GPT3.5 would hallucinate about the game board, but every move was valid and a decent move. Other times, in other chats, hallucinations occur, but can be corrected by letting the model know that it made an illegal move, and showing it the information from the original prompt, from the line starting with "FEN: ". In my testing this works the majority of the time in less than 3 tries, and it happens much more often with GPT3.5. I encourage others to test this exact format and see if it fails in other chess games. I will post my findings with code, in about a week. I will demonstrate more detailed prompts to get the models to play better, and provide examples of games where a draw occurred against SF8.
I have debated releasing this understanding of GPT models, which I have had a hunch about since December, as it could help bad actors potentially use the models available today, to do things that could be harmful. This type of method, depending on the topic, can be used to get even more general purpose use out of the models currently available. That said, this should at least give some people, a clearer understanding of what humans could be dealing with, in terms of pure encoded cognition, if more moree advanced models are trained and released.
I don't believe anyone at OAI knows about this exact phenomenon, as a recent paper on model debates increasing performance on some tasks, included chess, and the tests were only done to move 14. Move 14 is where the models begin to wildly hallucinate on both the board state, and the legality of moves. This would only likely be chosen as a cutoff, if the researchers who authored the paper(no offense meant whatsoever), did not have a full understanding of how chess is processed by these models. Note: With the prompt I've shown here, it does occur, in some chats, that the understanding the model expresses of the board state, is erroneous, but the move it chooses is still decent. This in itself may be able to be improved.
Link to the conversation in the 4th image:
chat.openai.com/share/34b654…
@karpathy
@GaryMarcus
May 27, 2023 · 5:26 PM UTC
Extra context: The first board pertains to images 1 and 2(at move 149). The second board pertains to the image 4(at move 41).
@JeffLadish
@random_walker
@DrJimFan
@ylecun
The full post will come in another week, as more work needs to be done. In the meantime, here are 4 more examples, with the "base prompt", in which both GPT3.5 and GPT4 reason on games. If you look at GPT4's reponses, it is clearly better than 3.5, but both are able to produce valid moves. Once again, the full method has more detailed, and context specific prompts.
Image 1:
Black's turn. Move 73.
GPT3.5 conversation:
chat.openai.com/share/4f0607…
GPT4 conversation:
chat.openai.com/share/7bdfc7…
Image 2:
Black's turn. Move 38.
GPT3.5 conversation:
chat.openai.com/share/312e98…
GPT4 conversation:
chat.openai.com/share/047ef4…
Image 3:
White's turn. Move 58.
GPT3.5 conversation:
chat.openai.com/share/72a398…
GPT4 conversation:
chat.openai.com/share/18e14b…
Image 4:
White's turn. Move 121.
GPT3.5 conversation:
chat.openai.com/share/a5e015…
GPT4 conversation:
chat.openai.com/share/08dcfd…
I have been asked by multiple people, to see if it is possible to get GPT4 to play Tic Tac Toe, with proper logic, and optimally. Below is the first example of this. This method is not stable yet, and I will include a full method once I tune it to not fail/hallucinate.
The same theory of representations as in Chess applies:
chat.openai.com/share/75758e…
@davidad
@GaryMarcus
Getting gpt3.5-turbo and gpt4 to reason on Go, will also be shown, after the work on Chess is complete.