RWKV (R) (seqlen = 4k) vs. Pythia (GPT3) (T) (seqlen = 2k). RWKV is unable to utilize its full context length.
My current bet is on multiquery local attn with TXL recurrence and some global attn. Multiquery local attn can make the decoding almost as fast as RNN.
Mar 23, 2023 · 8:25 PM UTC