I've been benchmarking a few LLMs on HumanEval, specifically on pass@1. My goal was to both reproduce some of the reported numbers independently and to get a better sense how these models compare on code generation.
The attached table summarizes the results.
More details in 🧵
May 31, 2023 · 12:58 PM UTC
I've been (yet again) surprised by how strong @OpenAI's GPT-4 is. In my experiments, it's even stronger than what is reported in the paper (~73% vs. ~67% for pass@1). It's possible that my prompt is slightly better or that the paper didn't sample at temperature=0.
Also, @OpenAI 's text-davinci-003 is another very, very strong model. Not quite GPT-4, but with ~62% pass@1 it achieves a very comfortable 2nd place. What's extra nice about this model is that you can use it without the chat API, which made prompting a lot simpler.
I've also been very impressed by @AnthropicAI's claude-instant. It's very, very capable and handily beats GPT-3.5 (aka ChatGPT), at least on this benchmark (~46% vs. ~54%). For some reason claude itself did slightly worse (~51% vs. ~54%), even though I've used the same prompt 🤷♂️
I've also included some of the smaller open-source models. I was able to run those locally, which is very cool. Obviously all of them are much smaller than the @OpenAI and @AnthropicAI models, so it's not really a fair comparison.
LLaMA performs relatively poorly on code (as is also reported in their paper). This might be a direct consequence of them under-sampling GitHub (see screenshot) but compared to even Codex 2.5b, the performance is underwhelming (~10% vs. ~22%).
Finally, @Replit's 3b base model does quite well but underperforms relative to what they reported on Twitter (~16% vs. ~22%). It's possible that the quantization I did for running this locally caused a drop in performance.
Interesting: @amanrsanger notes that gpt-3.5-turbo performance is a lot better when using the completion (instead of chat) API via Azure. Makes sense to me; prompting via the chat API was quite tricky.