Can't believe I'm actually running a 130B parameter model (GLM-130B) on 2xA6000s, entirely in GPU memory (no CPU offload). INT4 quantization actually seems to work! (NB: it ain't fast – around 2 minutes for this generated sample)

Oct 18, 2022 · 3:34 AM UTC

Apparently they've also added GLM to FasterTransformer and support INT4 and INT8 quantization there, too, with ~2.5X speedup in inference time over the PyTorch implementation. Gotta try that next :D
Not really sure how good this model is overall, the output seems pretty weird for a 130B model. But maybe it's better in Chinese?
Replying to @moyix
What's the difference between gMASK and MASK
[MASK] is short infill, [gMASK] is longer left-to-right generation
Replying to @moyix
have you run any benchmarks yet? any performance degradation? (sidenote: everytime i see another one of your posts, i inch a little bit closer to buying the berkeley font, it looks so good)
Haven’t run any benchmarks, in part because I don’t have the GPUs locally needed to run the full-size model for comparison!
Replying to @moyix
Just out of curiosity, is this the Berkeley font?
It is indeed, I'm fully a convert and some say my methods have become... unsound
Replying to @moyix
And with Berkeley Mono + Cool Retro Term (forgot to actually set the font!):
Replying to @moyix
So, INT4 means that there are only 16 possible weight values? Crazy that this works!
Yeah, it's kind of shocking to me! I need to read the paper and figure out how they did it