Grant Slatton · Sep 18, 2023 · 11:27 PM UTC

Grant Slatton · Sep 18, 2023 · 11:27 PM UTC

Grant Slatton

Grant Slatton

@GrantSlatton

18 Sep 2023

The new GPT model, gpt-3.5-turbo-instruct, can play chess around 1800 Elo. I had previously reported that GPT cannot play chess, but it appears this was just the RLHF'd chat models. The pure completion model succeeds. nitter.net/GrantSlatton/sta… See game & thoughts below:

Grant Slatton

@GrantSlatton

6 Sep 2023

GPT4 cannot play chess. Expected to get a bot up to ~1000 Elo and totally failed -- but negative results are still interesting! Even given verbose prompting, board state descriptions, CoT, etc, it loses badly to depth=1 engine. Code & notes below.

Sep 18, 2023 · 11:27 PM UTC

Grant Slatton · Sep 18, 2023 · 11:27 PM UTC

Grant Slatton

@GrantSlatton

18 Sep 2023

The new model readily beats Stockfish Level 4 (1700) and still loses respectably to Level 5 (2000). Never attempted illegal moves. Used clever opening sacrifice, and incredibly cheeky pawn & king checkmate, allowing the opponent to uselessly promote. lichess.org/K6Q0Lqda

138

Grant Slatton · Sep 18, 2023 · 11:27 PM UTC

Grant Slatton

@GrantSlatton

18 Sep 2023

I used this PGN style prompt to mimic a grandmaster game. The highlighting is a bit wrong. GPT made all its own moves, I input Stockfish moves manually. h/t to @zswitten for this prompt style

Zack Witten @zswitten

6 Sep 2023

Replying to @GrantSlatton

You don’t want verbose prompting, board state descriptions, COT. You want to stay close to a pure list of moves.

Grant Slatton · Sep 18, 2023 · 11:27 PM UTC

Grant Slatton

@GrantSlatton

18 Sep 2023

In conclusion, I now totally believe @BorisMPower's claim about 1800 Elo for GPT4. I think the RLHF'd chat models do not do this well, but the base/instruct models seem to do much better.

Boris Power

@BorisMPower

11 Aug 2023

Replying to @francoisfleuret

I don’t think anything has been published unfortunately. ELO is around 1800

111

Grant Slatton · Sep 18, 2023 · 11:32 PM UTC

Grant Slatton

@GrantSlatton

18 Sep 2023

Interestingly, in the games it lost again higher rated Stockfish bots, even after GPT made a bad move, it was still able to *predict* Stockfish's move that takes advantage of GPT's blunder. So you could probably get a >2000 Elo GPT by strapping on a tiny bit of search.

133

Grant Slatton · Sep 19, 2023 · 1:32 AM UTC

Grant Slatton

@GrantSlatton

19 Sep 2023

Fast reproduction:

Josh Jordan

@jordancurve

19 Sep 2023

Amazing! I was able to reproduce this just now: gpt-3.5-turbo-instruct, prompted with PGN, defeated Stockfish Level 4 (1700?) on LiChess (lichess.org/D39lnanQ). Here are the prompts/code I used: github.com/jordancurve/gpt-v…

Grant Slatton · Sep 24, 2023 · 4:56 PM UTC

Grant Slatton

@GrantSlatton

24 Sep 2023

Cool data

Adam Karvonen

@a_karvonen

22 Sep 2023

gpt-3.5-turbo-instruct can play chess at ~1800 ELO. I wrote some code and had it play 150 games against stockfish and 30 against gpt-4. It's very good! 99.7% of its 8000 moves were legal with the longest game going 147 moves. Repo link below.

Grant Slatton · Sep 26, 2023 · 5:44 AM UTC

Grant Slatton

@GrantSlatton

26 Sep 2023

Plot thickens

Mira

@_Mira___Mira_

26 Sep 2023

So why is "gpt-3.5-turbo-instruct" so much better than GPT-4 at chess? Probably because 6 months ago, someone checked in a chess eval in OpenAI's evals repo: github.com/openai/evals/comm…

Kurt Kreuger · Sep 19, 2023 · 1:32 PM UTC

Kurt Kreuger @kurtkreuger

19 Sep 2023

Replying to @GrantSlatton

Given all the fragility with out-of-sample performance, I question how generalizable this is. But chess seems like a great place to explore it. When comparing bots, they typically test many games (maybe 100 or 1000). But hard to do when it's so manual.

Grant Slatton · Sep 19, 2023 · 2:47 PM UTC

Grant Slatton

@GrantSlatton

19 Sep 2023

See

Andy Shih @andyshih_

19 Sep 2023

This really is incredible. I was doubtful of its capabilities in out-of-distribution positions, so I tried to make awkward looking moves, dropping pieces to create chaos. GPT-3.5-turbo-instruct maintained the board state the whole way through, not making a single illegal move 🤯 lichess.org/pLnSzmMt (Granted, the previous game I played normally, and GPT-3.5-turbo-instruct was holding its own until it finally did make an illegal move in the late middle-game.) @GothamChess this is the real deal

Thomas G. Dietterich · Sep 20, 2023 · 4:28 AM UTC

Thomas G. Dietterich @tdietterich

20 Sep 2023

Replying to @GrantSlatton

Do we know how much chess it was trained on?

Grant Slatton · Sep 20, 2023 · 4:31 AM UTC

Grant Slatton

@GrantSlatton

20 Sep 2023

We do not, and it would be very interesting to know. I would not be surprised by "every professional game every played", that database is only a few GB.

more replies