RWKV (R) (seqlen = 4k) vs. Pythia (GPT3) (T) (seqlen = 2k). RWKV is unable to utilize its full context length. My current bet is on multiquery local attn with TXL recurrence and some global attn. Multiquery local attn can make the decoding almost as fast as RNN.

Mar 23, 2023 · 8:25 PM UTC

But RWKV is still in its infancy. Things may change in the future. But I think we still need some global interaction type layer in addition to local attention or RNN for local interaction.
Replying to @arankomatsuzaki
> multiquery local attn with TXL recurrence where can I learn more?
Replying to @arankomatsuzaki
what are 1) multiquery local attn 2) TXL recurrence?
Replying to @arankomatsuzaki
It would be surprising it would beat transformer with the same size, given RNN has to push the context through a tiny bottleneck. Compare the price of inference 7B(R) with 3B(T). 7B(T) must be lower, at least on specialized hardware. If it is, everything is fine.
Replying to @arankomatsuzaki
can you explain what TXL is?
Replying to @arankomatsuzaki
do you think a version of pythia trained on 4k would actually reach a lower loss for the 2k+ positions compared to this one at 2k tho
Replying to @arankomatsuzaki
You've been betting on the RNN+attention combo for a while. I hope you will demonstrate its effectiveness with future LLMs of EleutherAI/Stable Diff
Replying to @arankomatsuzaki
Alternative hypotheses worth falsifying if not tested yet: 1. Tokens earlier than ~3k tokens before are inherently uninformative in the test corpus. To falsify, train a transformer-based model with context length >= 4k and see if test loss continues to decrease. 1/2