I've been benchmarking a few LLMs on HumanEval, specifically on pass@1. My goal was to both reproduce some of the reported numbers independently and to get a better sense how these models compare on code generation. The attached table summarizes the results. More details in 🧵

May 31, 2023 · 12:58 PM UTC

I've been (yet again) surprised by how strong @OpenAI's GPT-4 is. In my experiments, it's even stronger than what is reported in the paper (~73% vs. ~67% for pass@1). It's possible that my prompt is slightly better or that the paper didn't sample at temperature=0.
Also, @OpenAI 's text-davinci-003 is another very, very strong model. Not quite GPT-4, but with ~62% pass@1 it achieves a very comfortable 2nd place. What's extra nice about this model is that you can use it without the chat API, which made prompting a lot simpler.
I've also been very impressed by @AnthropicAI's claude-instant. It's very, very capable and handily beats GPT-3.5 (aka ChatGPT), at least on this benchmark (~46% vs. ~54%). For some reason claude itself did slightly worse (~51% vs. ~54%), even though I've used the same prompt 🤷‍♂️
I've also included some of the smaller open-source models. I was able to run those locally, which is very cool. Obviously all of them are much smaller than the @OpenAI and @AnthropicAI models, so it's not really a fair comparison.
LLaMA performs relatively poorly on code (as is also reported in their paper). This might be a direct consequence of them under-sampling GitHub (see screenshot) but compared to even Codex 2.5b, the performance is underwhelming (~10% vs. ~22%).
Finally, @Replit's 3b base model does quite well but underperforms relative to what they reported on Twitter (~16% vs. ~22%). It's possible that the quantization I did for running this locally caused a drop in performance.
Replying to @Replit
Replit-code-v1-3b & replit-finetuned-v1-3b were trained entirely on code and were meant for single-line code completion. We didn’t expect either to perform so well on HumanEval, but they did. replit-finetuned-v1-3b outperformed all OSS code models, even those 5x its size.
Interesting: @amanrsanger notes that gpt-3.5-turbo performance is a lot better when using the completion (instead of chat) API via Azure. Makes sense to me; prompting via the chat API was quite tricky.
gpt-3.5-turbo is criminally underrated at coding When using it with Azure's completion endpoint instead of OpenAI's chat endpoint, you can get a jump in HumanEval performance from <50% to 74%! This blows claude v1.3 out of the water, which sits just below 60% perf. [1]
Replying to @mplappert
Interesting, I got much better performance with gpt-4 (around 85%). Someone else linked the code below, so feel free to reproduce.
Very cool & super impressive! My approach was to tune the prompt until I could ~match what people reported in their papers / twitter threads, but it's very impressive that you can push GPT-4 performance by so much.
Replying to @mplappert
Also for reference you might like this by @amanrsanger . Here GPT4 performs even better with another prompt:
GPT-4 is waaay better at programming than given credit for. HumanEval is a benchmark of python programming problems. With some prompt engineering, GPT-4 scores ~85%, destroying Codex's 29% from just 2 years ago And performing much better than OpenAI's publicized accuracy
Very cool, thanks for sharing! Impressive that this can be pushed so much further with the right prompt.