“SpikeGPT: Generative Pre-Trained Language Model With Spiking Neural Networks”, Rui-Jie Zhu, Qihang Zhao, Jason K. Eshraghian2023-02-27 (, , )⁠:

As the size of large language models continue to scale, so does the computational resources required to run it. Spiking neural networks (SNNs) have emerged as an energy-efficient approach to deep learning that leverage sparse and event-driven activations to reduce the computational overhead associated with model inference. While they have become competitive with non-spiking models on many computer vision tasks, SNNs have also proven to be more challenging to train. As a result, their performance lags behind modern deep learning, and we are yet to see the effectiveness of SNNs in language generation.

In this paper, inspired by the RWKV language model, we successfully implement SpikeGPT, a generative language model with pure binary, event-driven spiking activation units. We train [on enwik8 & BookCorpus & OpenWebText2] the proposed model [using backpropagation for surrogate gradients] on 3 model variants: 45M, 125M and 260M parameters. To the best of our knowledge, this is 4× larger than any functional backprop-trained SNN to date.

We achieve this by modifying the transformer block to replace multi-head self attention to reduce quadratic computational complexity to linear with increasing sequence length. Input tokens are instead streamed in sequentially to our attention mechanism (as with typical SNNs).

Our preliminary experiments show that SpikeGPT remains competitive with non-spiking models on tested benchmarks, while maintaining 5× less energy consumption when processed on neuromorphic hardware that can leverage sparse, event-driven activations.

Our code implementation is available at Github.

Table 1: enwik8 results, measured in bits per character (bpc): the lower the better. Baseline comparisons are made with Reformer, Synthesizer (the best performing dense version), Linear Transformer, Performer, Stacked LSTM and SHA-LSTM. L, d, and T denote the number of blocks (network depth), dimension of features, and sequence length, respectively. Both Linear Transformer and Performer are implemented with customized CUDA kernels, and all other models are implemented in native Pytorch. (Note: Interim results. Still in training; to be updated.)

[So, SpikeGPT does reasonably well but still falls substantially short of the baseline dense quadratic Transformer GPTs.]

…4.3 Results: While our model’s test performance is slightly less than that of the standard Transformer and several other Transformer variations, it nonetheless remains similar in performance with 22× less synaptic operations (SynOps). SynOps is a metric that accounts for activation sparsity, where only multiply-accumulate operations using non-zero activations are counted. The Transformer is measured using full precision (flt32) SynOps, whereas SpikeGPT uses binarized SynOps. Therefore, a given SynOp for SpikeGPT is substantially cheaper in terms of energy consumption compared to a SynOp of the Transformer.