The new GPT model, gpt-3.5-turbo-instruct, can play chess around 1800 Elo. I had previously reported that GPT cannot play chess, but it appears this was just the RLHF'd chat models. The pure completion model succeeds. nitter.net/GrantSlatton/sta… See game & thoughts below:
GPT4 cannot play chess. Expected to get a bot up to ~1000 Elo and totally failed -- but negative results are still interesting! Even given verbose prompting, board state descriptions, CoT, etc, it loses badly to depth=1 engine. Code & notes below.

Sep 18, 2023 · 11:27 PM UTC

The new model readily beats Stockfish Level 4 (1700) and still loses respectably to Level 5 (2000). Never attempted illegal moves. Used clever opening sacrifice, and incredibly cheeky pawn & king checkmate, allowing the opponent to uselessly promote. lichess.org/K6Q0Lqda
I used this PGN style prompt to mimic a grandmaster game. The highlighting is a bit wrong. GPT made all its own moves, I input Stockfish moves manually. h/t to @zswitten for this prompt style
Replying to @GrantSlatton
You don’t want verbose prompting, board state descriptions, COT. You want to stay close to a pure list of moves.
In conclusion, I now totally believe @BorisMPower's claim about 1800 Elo for GPT4. I think the RLHF'd chat models do not do this well, but the base/instruct models seem to do much better.
Replying to @francoisfleuret
I don’t think anything has been published unfortunately. ELO is around 1800
Interestingly, in the games it lost again higher rated Stockfish bots, even after GPT made a bad move, it was still able to *predict* Stockfish's move that takes advantage of GPT's blunder. So you could probably get a >2000 Elo GPT by strapping on a tiny bit of search.
Fast reproduction:
Amazing! I was able to reproduce this just now: gpt-3.5-turbo-instruct, prompted with PGN, defeated Stockfish Level 4 (1700?) on LiChess (lichess.org/D39lnanQ). Here are the prompts/code I used: github.com/jordancurve/gpt-v…
Cool data
gpt-3.5-turbo-instruct can play chess at ~1800 ELO. I wrote some code and had it play 150 games against stockfish and 30 against gpt-4. It's very good! 99.7% of its 8000 moves were legal with the longest game going 147 moves. Repo link below.
Plot thickens
So why is "gpt-3.5-turbo-instruct" so much better than GPT-4 at chess? Probably because 6 months ago, someone checked in a chess eval in OpenAI's evals repo: github.com/openai/evals/comm…
Replying to @GrantSlatton
Given all the fragility with out-of-sample performance, I question how generalizable this is. But chess seems like a great place to explore it. When comparing bots, they typically test many games (maybe 100 or 1000). But hard to do when it's so manual.
See
This really is incredible. I was doubtful of its capabilities in out-of-distribution positions, so I tried to make awkward looking moves, dropping pieces to create chaos. GPT-3.5-turbo-instruct maintained the board state the whole way through, not making a single illegal move 🤯 lichess.org/pLnSzmMt (Granted, the previous game I played normally, and GPT-3.5-turbo-instruct was holding its own until it finally did make an illegal move in the late middle-game.) @GothamChess this is the real deal
Replying to @GrantSlatton
Do we know how much chess it was trained on?
We do not, and it would be very interesting to know. I would not be surprised by "every professional game every played", that database is only a few GB.