HumanEval benchmark of EXL2 quants of popular local LLMs (2.5 through 8.0 bpw covered)

FullstackSensei · 2024-11-16T22:38:56+00:00

It's interesting that qwen 2.5 coder 14B and 32B aren't affected until about 4 bits. On the one hand it's great to know a single 3090 can ran 32B at 4 bits without (much) loss, but on the other, it makes me wonder how much better (and by extension, how much more info) can those models cram before they're "saturated"

TyraVex · 2024-11-16T22:39:32+00:00

Qwen 2.5 32b coder instruct losing 2% between 8bpw and 2.5bpw is crazy

I'll try that on mmlu pro and see what happens

QuantuisBenignus · 2024-11-16T23:58:59+00:00

<image>

I like visualizing these, so I did a quick and dirty OCR. Qwen is OK.

I need more VRAM for the 32B, although it is nicely orange even at space-saving 2.5 bits.

tmvr · 2024-11-17T11:18:07+00:00

I think this just illustrates how useless this benchmark has becomes for any quality assessment. There isn't even a need to overanalyse the data in the table, this alone is enough:

Qwen2.5-Coder-3B-Instruct scores >0.8 until 4.0 BPW

There is no way this quality measurement has any relation to real world usage.

ortegaalfredo · 2024-11-16T23:08:39+00:00

I can easily see difference between 4-bit and 8-bit AWQ, but EXL2 is much better at lower quants. Seems that the sweet spot is at qwen-coder-14B and 4bpp. That means you can run a GPT4-level coder with a 12GB GPU.

Ok_Mine189 · 2024-11-17T07:25:18+00:00

Makes you wonder hom much of HumanEval has leaked into training datasets by now... Just look at Gemma 2 2B - seemingly unimpacted even at 3.5 bpw!

CheatCodesOfLife · 2024-11-17T01:25:32+00:00

So seems like i could really swap from 8bpw to 6bpw for 72b models, free up a gpu, and be totally unaffected

a_slay_nub · 2024-11-16T23:19:00+00:00

I really wish we would expand humaneval to more than 163 problems........

ResearchCrafty1804 · 2024-11-17T03:14:23+00:00

Qwen Coder models seem to be the most resilient models in terms of keeping their performance from the quantisation

Steuern_Runter · 2024-11-17T13:44:21+00:00

What temperature did you set for the Qwen Coder models?

ethertype · 2024-11-17T11:28:39+00:00

Thank you very much for this, u/Ok_Mine189 . I noticed this:

<image>

Is the resolution of the scores a lot higher than the tests actually justify, or does Q6 (and Q5.5) actually score better than Q8?

schlammsuhler · 2024-11-17T12:40:22+00:00

Could someone do a graph of score per GB?

dahara111 · 2024-11-17T13:48:21+00:00

What about hqq?

Equivalent_Bat_3941 · 2024-11-17T14:35:12+00:00

Nice to know

AutomataManifold · 2024-11-17T20:17:41+00:00

It'd be interesting to include the VRAM use, so we could make comparisons like "with 24GB of VRAM, with model/quant combinations fit and have the highest performance?"

ramzeez88 · 2024-11-18T09:25:48+00:00

Is there a GUI to work with exl2? As easy as possible. And a server ?

ekaknr · 2024-11-18T10:54:19+00:00

Can anyone knowledgeable (about these things) compare MLX against exllama (which I've never tried)? Are MLX 4 bit models (the only other available being 8bit) any good or compare similarly? Is there a difference in the quantization methods?

Such_Advantage_6949 · 2024-11-18T11:04:08+00:00

Thank u very much. It is really great works!

FullOf_Bad_Ideas · 2024-11-17T01:59:48+00:00

I think it's interesting how Qwen 2.5 32B Instruct loses a lot of quality at 2.5bpw but 32B Coder doesn't.

Model/BPW	8.0	6.5	6.0	5.5	5.0	4.5	4.0	3.5	3.0	2.5
Phi-3.5-mini-instruct	0.66279	0.65304	0.67499	0.64511	0.59877	0.59633	0.53597	0.33658	0.17499	0.00731
Phi-3-medium-128k-instruct	0.74999	0.74145	0.74999	0.73658	0.71645	0.71462	0.67133	0.65792	0.56280	0.29023
Qwen2.5-32B-Instruct	0.91920	0.90487	0.90853	0.90649	0.90487	0.91584	0.90487	0.90609	0.87316	0.80060
Qwen2.5-14B-Instruct	0.82072	0.84755	0.80487	0.81341	0.82926	0.82316	0.79694	0.78109	0.71645	0.61706
Qwen2.5-7B-Instruct	0.86158	0.88170	0.86219	0.84390	0.86036	0.83353	0.81828	0.75731	0.62560	0.44450
Qwen2.5-3B-Instruct	0.76158	0.75548	0.73658	0.73048	0.72499	0.69572	0.68474	0.63170	0.44450	0.13414
Qwen2.5-1.5B-Instruct	0.59959	0.56707	0.56910	0.54064	0.56503	0.56503	0.55487	0.30487	0.08129	0.03658
Qwen2.5-Coder-32B-Instruct	0.93495	0.93292	0.93902	0.93902	0.92682	0.93495	0.92682	0.92682	0.92479	0.91259
Qwen2.5-Coder-14B-Instruct	0.93292	0.92682	0.93292	0.92682	0.92682	0.93292	0.93292	0.92073	0.90243	0.81707
Qwen2.5-Coder-7B-Instruct	0.90487	0.90060	0.89084	0.89938	0.90243	0.90853	0.90182	0.86463	0.84146	0.75426
Qwen2.5-Coder-3B-Instruct	0.83942	0.84552	0.83942	0.84552	0.82113	0.80487	0.82316	0.79064	0.66259	0.47560
Qwen2.5-Coder-1.5B-Instruct	0.70934	0.68698	0.68292	0.70324	0.65853	0.65853	0.68495	0.52032	0.34349	0.17479
Gemma-2-2b-it	0.40060	0.41463	0.42438	0.42012	0.40548	0.42865	0.40670	0.40365	0.21341	0.11646
Gemma-2-9b-it	0.64389	0.64024	0.64634	0.64634	0.62255	0.62865	0.65182	0.65304	0.61585	0.46950
Gemma-2-27b-it	0.79268	0.78292	0.78658	-	0.78170	0.77317	0.78902	0.76829	0.76463	-
Meta-Llama-3.1-8B-Instruct	0.70670	0.70975	0.70792	0.68840	0.70182	0.68170	0.64694	0.59938	0.49511	0.25853

LocalLLaMA

MODERATORS

LocalLLaMA

MODERATORS

Welcome to Reddit.

Want to add to the discussion?