×
all 50 comments

[–]FullstackSensei 30 points31 points  (5 children)

It's interesting that qwen 2.5 coder 14B and 32B aren't affected until about 4 bits. On the one hand it's great to know a single 3090 can ran 32B at 4 bits without (much) loss, but on the other, it makes me wonder how much better (and by extension, how much more info) can those models cram before they're "saturated"

[–]Steuern_Runter 3 points4 points  (0 children)

This surprised me as well but I think it comes from the relatively short benchmark tasks. HumaneEval is much better than those benchmarks with only multiple choice questions but those tasks are still small and generic compared to real world coding tasks where more restrictions are given and you would give existing functions/classes as context to which the model has to adapt to. In my experience small quant LLMs have a higher risk get get off from their instruction as the context gets longer.

Also the benchmark tasks only contain very targeted information, there's nothing which is irrelevant and has to be ignored.

[–]Accomplished_Mode170 8 points9 points  (3 children)

[–]Accomplished_Mode170 5 points6 points  (0 children)

Paraphrase per Claude:

We actually have two ways to measure this 'saturation' point - when per-layer alphas hit 2 (as shown in WW/SETOL work), or when D/N ratios reach ~1000 (per recent precision scaling laws). These seem to point to similar phenomena from different angles.

[–]FullstackSensei 4 points5 points  (1 child)

Can't thank you enough for both links!

[–]Accomplished_Mode170 4 points5 points  (0 children)

You are very welcome. Neat folks all around.

[–]TyraVex 26 points27 points  (2 children)

Qwen 2.5 32b coder instruct losing 2% between 8bpw and 2.5bpw is crazy

I'll try that on mmlu pro and see what happens

[–]shaman-warrior 6 points7 points  (0 children)

Seems sus

[–]DickMasterGeneral 2 points3 points  (0 children)

Report back when you do, would love to see the results of that test

[–]QuantuisBenignus 27 points28 points  (4 children)

<image>

I like visualizing these, so I did a quick and dirty OCR. Qwen is OK.

I need more VRAM for the 32B, although it is nicely orange even at space-saving 2.5 bits.

[–]randomfoo2 14 points15 points  (3 children)

I did a visualization as well, I used a line graph so it's a bit easier to tell where the dropoffs happen for across the models. Basically, I think Q4+ still remains a good rule of thumb. There's almost no dropoff for Qwen2.5-Coder probably not because it's not losing accuracy but it's so strong that it doesn't matter.

<image>

This was an AI assisted graph of course:

  • Claude Sonnet 3.5 new is useless now for generating code; I have what seems to be a very small (1000?) hard token output limit and just keeps trying to rewrite the artifact and dying
  • Claude Sonnet 3.5 old was able to generate a decent graph but was missing data points. I did a separate query to get it to just generate a markdown table (if not, it defaults to an HTML table lol)
  • ChatGPT-4o really wants to use pytesseract to OCR the table, but with some prodding will use its native vision to generate a data table. Its native use of Python Interpreter and dataframes is way better for doing data analysis work. Claude does a better job with coloring, I specified colors grouped by family
  • On the local side, Qwen2.5-Coder-32B-Q6_K (what I had running) has no problem oneshotting when the table is included (no-native vision capabilities) - I'm running llama.cpp and OpenWebUI and it's able to run the code. I'm only generating at ~20t/s for this so it's actually quite a bit slower than the remote models, so revisions are a bit painful (I'll probably switch to 14B Q4_K_M as something that is good enough and should be >2X faster)

[–]randomfoo2 8 points9 points  (0 children)

BTW, for those that want to do their own dataviz, to save some tokens:

Model/BPW 8.0 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5
Phi-3.5-mini-instruct 0.66279 0.65304 0.67499 0.64511 0.59877 0.59633 0.53597 0.33658 0.17499 0.00731
Phi-3-medium-128k-instruct 0.74999 0.74145 0.74999 0.73658 0.71645 0.71462 0.67133 0.65792 0.56280 0.29023
Qwen2.5-32B-Instruct 0.91920 0.90487 0.90853 0.90649 0.90487 0.91584 0.90487 0.90609 0.87316 0.80060
Qwen2.5-14B-Instruct 0.82072 0.84755 0.80487 0.81341 0.82926 0.82316 0.79694 0.78109 0.71645 0.61706
Qwen2.5-7B-Instruct 0.86158 0.88170 0.86219 0.84390 0.86036 0.83353 0.81828 0.75731 0.62560 0.44450
Qwen2.5-3B-Instruct 0.76158 0.75548 0.73658 0.73048 0.72499 0.69572 0.68474 0.63170 0.44450 0.13414
Qwen2.5-1.5B-Instruct 0.59959 0.56707 0.56910 0.54064 0.56503 0.56503 0.55487 0.30487 0.08129 0.03658
Qwen2.5-Coder-32B-Instruct 0.93495 0.93292 0.93902 0.93902 0.92682 0.93495 0.92682 0.92682 0.92479 0.91259
Qwen2.5-Coder-14B-Instruct 0.93292 0.92682 0.93292 0.92682 0.92682 0.93292 0.93292 0.92073 0.90243 0.81707
Qwen2.5-Coder-7B-Instruct 0.90487 0.90060 0.89084 0.89938 0.90243 0.90853 0.90182 0.86463 0.84146 0.75426
Qwen2.5-Coder-3B-Instruct 0.83942 0.84552 0.83942 0.84552 0.82113 0.80487 0.82316 0.79064 0.66259 0.47560
Qwen2.5-Coder-1.5B-Instruct 0.70934 0.68698 0.68292 0.70324 0.65853 0.65853 0.68495 0.52032 0.34349 0.17479
Gemma-2-2b-it 0.40060 0.41463 0.42438 0.42012 0.40548 0.42865 0.40670 0.40365 0.21341 0.11646
Gemma-2-9b-it 0.64389 0.64024 0.64634 0.64634 0.62255 0.62865 0.65182 0.65304 0.61585 0.46950
Gemma-2-27b-it 0.79268 0.78292 0.78658 - 0.78170 0.77317 0.78902 0.76829 0.76463 -
Meta-Llama-3.1-8B-Instruct 0.70670 0.70975 0.70792 0.68840 0.70182 0.68170 0.64694 0.59938 0.49511 0.25853

[–]Ok_Mine189[S] 4 points5 points  (0 children)

Thank you, I was planning to add such chart today, but you saved me the time!

[–]QuantuisBenignus 2 points3 points  (0 children)

That's cool!

In my case no $/tok, I did it the old fashioned way with Mathematica.

10 min of my time to clean up the TextRecognize[] result and format the ArrayPlot[].

[–]tmvr 7 points8 points  (1 child)

I think this just illustrates how useless this benchmark has becomes for any quality assessment. There isn't even a need to overanalyse the data in the table, this alone is enough:

Qwen2.5-Coder-3B-Instruct scores >0.8 until 4.0 BPW

There is no way this quality measurement has any relation to real world usage.

[–]Ok_Mine189[S] 2 points3 points  (0 children)

Yeah, it's quite possible that the newest models either simply became too smart for the difficulty level this eval provides OR (more likely, in my opinion) its data was included in the training datasets. Just look at Gemma2 2B, it's holding the same score even at 3.5 bpw. That's just sus as hell, right?

[–]ortegaalfredoAlpaca 8 points9 points  (2 children)

I can easily see difference between 4-bit and 8-bit AWQ, but EXL2 is much better at lower quants. Seems that the sweet spot is at qwen-coder-14B and 4bpp. That means you can run a GPT4-level coder with a 12GB GPU.

[–]Jellonling 6 points7 points  (1 child)

Yeah I fully switched exl2. 4bpw exl2 quants are super fast and extremly good for their size. And for larger models even 3bpw perform really well. I still see so many people using GGUFs to offload instead of just using a smaller exl2 quant which usually is just better in quality and leagues better in speed.

[–]Ok_Mine189[S] 5 points6 points  (0 children)

Makes you wonder hom much of HumanEval has leaked into training datasets by now... Just look at Gemma 2 2B - seemingly unimpacted even at 3.5 bpw!

[–]CheatCodesOfLife 4 points5 points  (0 children)

So seems like i could really swap from 8bpw to 6bpw for 72b models, free up a gpu, and be totally unaffected

[–]a_slay_nub 6 points7 points  (0 children)

I really wish we would expand humaneval to more than 163 problems........

[–]ResearchCrafty1804 2 points3 points  (0 children)

Qwen Coder models seem to be the most resilient models in terms of keeping their performance from the quantisation

[–]Steuern_Runter 2 points3 points  (1 child)

What temperature did you set for the Qwen Coder models?

[–]Ok_Mine189[S] 2 points3 points  (0 children)

I used these for all the models: temp 0.01, top_k 1, top_p 0.01.

[–]ethertype 1 point2 points  (1 child)

Thank you very much for this, u/Ok_Mine189 . I noticed this:

<image>

Is the resolution of the scores a lot higher than the tests actually justify, or does Q6 (and Q5.5) actually score better than Q8?

[–]Ok_Mine189[S] 1 point2 points  (0 children)

Well even if round it up to 3rd decimal place, it's 93.5% vs 93.9%. Not big of a difference. And FYI these scores are an average out of 3 to 5 runs for each quant (to minimize the variance). My wild guess is that stronger quantizations weaken the built-in inhibitions/guard rails making such models very slightly "smarter"(or at least better scoring) at lower quants. But it's just my opinion, nothing I can prove anyway.

[–]schlammsuhler 1 point2 points  (0 children)

Could someone do a graph of score per GB?

[–]dahara111 1 point2 points  (0 children)

What about hqq?

[–]Equivalent_Bat_3941 1 point2 points  (0 children)

Nice to know

[–]AutomataManifold 1 point2 points  (2 children)

It'd be interesting to include the VRAM use, so we could make comparisons like "with 24GB of VRAM, with model/quant combinations fit and have the highest performance?"

[–]Ok_Mine189[S] 0 points1 point  (1 child)

I'll try to provide such data, but it might take me until the next weekend to find the time for it.

[–]AutomataManifold 1 point2 points  (0 children)

No worries, I was just thinking about what data would answer 50% of the "what model should I use" posts.

[–]ramzeez88 1 point2 points  (5 children)

Is there a GUI to work with exl2? As easy as possible. And a server ?

[–]Ok_Mine189[S] 1 point2 points  (4 children)

You can use the EXUI: turboderp/exui: Web UI for ExLlamaV2. It's made by the author of the ExLlamaV2 hisself ;)

[–]ramzeez88 1 point2 points  (3 children)

Thanks,Can it work as a server too?

[–]Ok_Mine189[S] 1 point2 points  (2 children)

You need to install the exllama wheels and then the EXUI will be able to automatically load the selected model for inference. I you meant an API server then afaik TabbyAPI is the official API server for exllamaV2.

[–]ramzeez88 1 point2 points  (1 child)

Thanks mate :)

[–]Ok_Mine189[S] 0 points1 point  (0 children)

You're welcome. Enjoy!

[–]ekaknr 1 point2 points  (3 children)

Can anyone knowledgeable (about these things) compare MLX against exllama (which I've never tried)? Are MLX 4 bit models (the only other available being 8bit) any good or compare similarly? Is there a difference in the quantization methods?

[–]Ok_Mine189[S] 1 point2 points  (2 children)

I wish I could help you friend, unfortunately I don't own Apple silicon. Perhaps some good soul will oblige you though. Fingers crossed!

[–]ekaknr 1 point2 points  (1 child)

No worries, appreciate your good intent! Does exllamav even work on Apple Silicon? I'm guessing not, based on what I can see on the GitHub page.

[–]Ok_Mine189[S] 1 point2 points  (0 children)

You're right, it looks like it's Linux & Windows only.

[–]Such_Advantage_6949 1 point2 points  (1 child)

Thank u very much. It is really great works!

[–]Ok_Mine189[S] 0 points1 point  (0 children)

Thanks, I did put some time and effort in to it :)

[–]FullOf_Bad_Ideas 1 point2 points  (0 children)

I think it's interesting how Qwen 2.5 32B Instruct loses a lot of quality at 2.5bpw but 32B Coder doesn't.