Aran Komatsuzaki · Mar 23, 2023 · 8:25 PM UTC

Aran Komatsuzaki · Mar 23, 2023 · 8:25 PM UTC

Aran Komatsuzaki

RWKV (R) (seqlen = 4k) vs. Pythia (GPT3) (T) (seqlen = 2k). RWKV is unable to utilize its full context length. My current bet is on multiquery local attn with TXL recurrence and some global attn. Multiquery local attn can make the decoding almost as fast as RNN.

Mar 23, 2023 · 8:25 PM UTC

127

Aran Komatsuzaki · Mar 23, 2023 · 8:27 PM UTC

Aran Komatsuzaki

@arankomatsuzaki

Mar 23

But RWKV is still in its infancy. Things may change in the future. But I think we still need some global interaction type layer in addition to local attention or RNN for local interaction.

Tom Primožič · May 23, 2023 · 9:30 AM UTC

Tom Primožič

@tomprimozic

May 23

Replying to @arankomatsuzaki

> multiquery local attn with TXL recurrence where can I learn more?

Songlin Yang · Mar 29, 2023 · 3:49 PM UTC

Songlin Yang @SonglinYang4

Mar 29

Replying to @arankomatsuzaki

what are 1) multiquery local attn 2) TXL recurrence?

Dmitry · Mar 24, 2023 · 3:10 PM UTC

Dmitry @RespectToX

Mar 24

Replying to @arankomatsuzaki

It would be surprising it would beat transformer with the same size, given RNN has to push the context through a tiny bottleneck. Compare the price of inference 7B(R) with 3B(T). 7B(T) must be lower, at least on specialized hardware. If it is, everything is fine.

Tamedu · Mar 24, 2023 · 6:03 AM UTC

Tamedu @tamedu81

Mar 24

Replying to @arankomatsuzaki

can you explain what TXL is?

Rob Flynn · Mar 23, 2023 · 10:09 PM UTC

Rob Flynn @RobFlynnHere

Mar 23

Replying to @arankomatsuzaki

do you think a version of pythia trained on 4k would actually reach a lower loss for the 2k+ positions compared to this one at 2k tho

Ω.KendrickPlumard · Mar 23, 2023 · 8:47 PM UTC

Ω.KendrickPlumard @fouriergalois

Mar 23

Replying to @arankomatsuzaki

You've been betting on the RNN+attention combo for a while. I hope you will demonstrate its effectiveness with future LLMs of EleutherAI/Stable Diff

EIFY · Mar 30, 2023 · 9:12 PM UTC

EIFY @EIFY

Mar 30

Replying to @arankomatsuzaki

Alternative hypotheses worth falsifying if not tested yet: 1. Tokens earlier than ~3k tokens before are inherently uninformative in the test corpus. To falsify, train a transformer-based model with context length >= 4k and see if test loss continues to decrease. 1/2