Brendan Dolan-Gavitt · Oct 18, 2022 · 3:34 AM UTC

Brendan Dolan-Gavitt · Oct 18, 2022 · 3:34 AM UTC

Brendan Dolan-Gavitt

18 Oct 2022

Can't believe I'm actually running a 130B parameter model (GLM-130B) on 2xA6000s, entirely in GPU memory (no CPU offload). INT4 quantization actually seems to work! (NB: it ain't fast – around 2 minutes for this generated sample)

Oct 18, 2022 · 3:34 AM UTC

120

Brendan Dolan-Gavitt · Oct 18, 2022 · 3:41 AM UTC

Brendan Dolan-Gavitt @moyix

18 Oct 2022

Apparently they've also added GLM to FasterTransformer and support INT4 and INT8 quantization there, too, with ~2.5X speedup in inference time over the PyTorch implementation. Gotta try that next :D

Brendan Dolan-Gavitt · Oct 18, 2022 · 4:22 AM UTC

Brendan Dolan-Gavitt @moyix

18 Oct 2022

Not really sure how good this model is overall, the output seems pretty weird for a 130B model. But maybe it's better in Chinese?

.... · Oct 18, 2022 · 4:17 AM UTC

.... @BarneyFlames

18 Oct 2022

Replying to @moyix

What's the difference between gMASK and MASK

Brendan Dolan-Gavitt · Oct 18, 2022 · 4:20 AM UTC

Brendan Dolan-Gavitt @moyix

18 Oct 2022

[MASK] is short infill, [gMASK] is longer left-to-right generation

jeffrey · Oct 18, 2022 · 6:31 AM UTC

jeffrey @jeffistyping

18 Oct 2022

Replying to @moyix

have you run any benchmarks yet? any performance degradation? (sidenote: everytime i see another one of your posts, i inch a little bit closer to buying the berkeley font, it looks so good)