“How Good Are Low-Bit Quantized LLaMA-3 Models? An Empirical Study”, Wei Huang, Xudong Ma, Haotong Qin, Xingyu Zheng, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno2024-04-22 ()⁠:

Meta’s LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA-3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data.

Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA-3’s capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA-3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression.

Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA-3 on 1–8 bits and diverse datasets to comprehensively reveal LLaMA-3’s low-bit quantization performance. Our experiment results indicate that LLaMA-3 still suffers non-negligible degradation in these scenarios, especially in ultra-low bit-width. This highlights the performance gap under low bit-width that needs to be bridged in future developments.

We expect that this empirical study will prove valuable in advancing future models, pushing the LLMs to lower bit-width with higher accuracy for being practical.

Our project is released on https://github.com/Macaronlin/LLaMA-3-Quantization and quantized LLaMA-3 models are released on https://huggingface.co/LLMQ.