“Int-4 LLaMa Is Not Enough—Int-3 and Beyond: More Compression, Easier to Build Apps on LLMs That Run Locally”, nolano.org2023-03-13 (, )⁠:

Last Thursday we demonstrated for the first time that GPT-3 level LLM inference is possible via Int4 quantized LLaMa models with our implementation using the awesome ggml C/C++ library.

…Today we share more exciting news about prospects of running LLMs locally on two fronts: Lowering the RAM usage of these models through quantization beyond Int-4, and easier to build python apps using faster LLM inference

…Furthermore, we intend to integrate support for Google’s Flan series and GPT-Neo, both of which are truly open source language models. By incorporating these additional models into our offering, we aim to create a comprehensive and versatile toolkit that can be used for various applications. Our overarching goal is to provide developers with a flexible and powerful toolkit that can be used to tackle a wide range of challenges and problems.