Mixture of Parrots: Experts improve memorization more than reasoning
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
Anthropic’s latest Claude AI model pulls ahead of rivals from OpenAI and Google
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Fast Inference of Mixture-of-Experts Language Models with Offloading
LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring
Bridging Discrete and Backpropagation: Straight-Through and Beyond
Scaling Expert Language Models with Unsupervised Domain Discovery
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code
InCoder: A Generative Model for Code Infilling and Synthesis
WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
ST-MoE: Designing Stable and Transferable Sparse Expert Models
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI threatens a new arms race
Efficient Large Scale Language Modeling with Mixtures of Experts
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference
Scalable and Efficient MoE Training for Multitask Multilingual Models
Sparse-MLP: A Fully-MLP Architecture with Conditional Computation
MCL-GAN: Generative Adversarial Networks with Multiple Specialized Discriminators
CPM-2: Large-scale Cost-effective Pre-trained Language Models
Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters
RetGen: A Joint framework for Retrieval and Grounded Text Generation Modeling
China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.
Coordination Among Neural Modules Through a Shared Global Workspace
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Efficient Content-Based Sparse Attention with Routing Transformers
Hard Mixtures of Experts for Large Scale Weakly Supervised Vision
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Conditional Computation in Neural Networks for faster models
Learning Factored Representations in a Deep Mixture of Experts
GTC 2021 Keynote With NVIDIA CEO Jensen Huang: NVIDIA CEO Jensen Huang Delivers the #GTC21 Keynote, Where He Introduced Amazing Breakthroughs in Building Virtual Worlds With NVIDIA Omniverse; in Advancing Enterprise Computing With New NVIDIA DGX Systems and Software; in Turning the Data Center into the New Unit of Computing With the New NVIDIA Grace CPU, BlueField-3 DPU, and DOCA 1.0 SDK; in Broadening the Reach of AI to All Companies and Industries With NVIDIA EGX and Aerial 5G; and in Transforming Transportation With NVIDIA DRIVE Orin and Atlan.
We Ran MoE (2048E,60L) With Bfloat16 Activations With Total of 1 Trillion Model Weights. Although Trainable With Manual Diagnostics, With Deep 1 Trillion Model We Encountered Several Trainability Issues With Numerical Stability. Will Follow Up.
2021-04-12-jensenhuang-gtc2021keynote-eAn_oiZwUXA.en.vtt.txt
https://research.google/blog/learning-to-route-by-task-for-efficient-inference/
https://research.google/blog/more-efficient-in-context-learning-with-glam/
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
https://www.reddit.com/r/LocalLLaMA/comments/18luk10/wait_llama_and_falcon_are_also_moe/
https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini
https://www.sensetime.com/en/news-detail/51167731?categoryId=1072
https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
https%253A%252F%252Farxiv.org%252Fabs%252F2401.04088%2523mistral.html
https%253A%252F%252Farxiv.org%252Fabs%252F2310.07096%2523ibm.html
https%253A%252F%252F152334h.github.io%252Fblog%252Fnon-determinism-in-gpt-4%252F.html
https%253A%252F%252Farxiv.org%252Fabs%252F2306.00008%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2301.13310%2523google.html
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
https%253A%252F%252Farxiv.org%252Fabs%252F2212.05055%2523google.html
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
https%253A%252F%252Farxiv.org%252Fabs%252F2211.01324%2523nvidia.html
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
https%253A%252F%252Farxiv.org%252Fabs%252F2207.10551%2523google.html
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
https%253A%252F%252Farxiv.org%252Fabs%252F2206.03382%2523microsoft.html
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
https%253A%252F%252Farxiv.org%252Fabs%252F2205.12399%2523google.html
InCoder: A Generative Model for Code Infilling and Synthesis
https%253A%252F%252Farxiv.org%252Fabs%252F2204.05999%2523facebook.html
WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
https%253A%252F%252Farxiv.org%252Fabs%252F2203.11480%2523baai.html
https%253A%252F%252Farxiv.org%252Fabs%252F2202.09368%2523google.html
ST-MoE: Designing Stable and Transferable Sparse Expert Models
https%253A%252F%252Farxiv.org%252Fabs%252F2202.08906%2523google.html
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
https%253A%252F%252Farxiv.org%252Fabs%252F2201.05596%2523microsoft.html
U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI threatens a new arms race
https%253A%252F%252Fspectrum.ieee.org%252Fchina-us-militarized-ai.html
https%253A%252F%252Farxiv.org%252Fabs%252F2106.05974%2523google.html
Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters
https%253A%252F%252Fen.pingwest.com%252Fa%252F8693%2523baai.html
https%253A%252F%252Farxiv.org%252Fabs%252F2104.10350%2523google.html
China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.
https%253A%252F%252Fsyncedreview.com%252F2021%252F03%252F23%252Fchinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0%252F%2523baai.html
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
https%253A%252F%252Farxiv.org%252Fabs%252F2101.03961%2523google.html
Efficient Content-Based Sparse Attention with Routing Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2003.05997%2523google.html
%252Fdoc%252Fai%252Fscaling%252Fmixture-of-experts%252F2012-masoudnia.pdf.html
Wikipedia Bibliography: