Can't believe I'm actually running a 130B parameter model (GLM-130B) on 2xA6000s, entirely in GPU memory (no CPU offload). INT4 quantization actually seems to work! (NB: it ain't fast – around 2 minutes for this generated sample)
Oct 18, 2022 · 3:34 AM UTC
Apparently they've also added GLM to FasterTransformer and support INT4 and INT8 quantization there, too, with ~2.5X speedup in inference time over the PyTorch implementation. Gotta try that next :D