Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Attention Is All You Need
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
MMLU: Measuring Massive Multitask Language Understanding
MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism