“Int-4 LLaMa Is Not Enough—Int-3 and Beyond: More Compression, Easier to Build Apps on LLMs That Run Locally”, 2023-03-13 ():
Last Thursday we demonstrated for the first time that GPT-3 level LLM inference is possible via Int4 quantized LLaMa models with our implementation using the awesome ggml C/C++ library.
…Today we share more exciting news about prospects of running LLMs locally on two fronts: Lowering the RAM usage of these models through quantization beyond Int-4, and easier to build python apps using faster LLM inference
- GPTQ-style quantization improves performance over naive Round-to-Nearest (RtN) baseline in nearly all cases, but it degrades for smaller model depending on the type of quantization performed.
The bin-size for Int4 quantization can be further increased from the current size of 32 without much performance degradation, leading to a 15% reduction in RAM required to store weights for even the 7B LLaMa model.
LLaMa-1-13B can be int3 quantized (with much larger bin size) with not much additional performance drops over int4 quantization, leading to a 30–35% reduction in RAM required to store weights for larger models.
While int2 quantization is not usable for LLaMa-1-13B, larger models may be 2-bit quantize-able without much performance drop.
…Furthermore, we intend to integrate support for Google’s Flan series and GPT-Neo, both of which are truly open source language models. By incorporating these additional models into our offering, we aim to create a comprehensive and versatile toolkit that can be used for various applications. Our overarching goal is to provide developers with a flexible and powerful toolkit that can be used to tackle a wide range of challenges and problems.