- See Also
-
Links
- “Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget”, Sehwag et al 2024
- “Anthropic’s Latest Claude AI Model Pulls ahead of Rivals from OpenAI and Google”, Knight 2024
- “JetMoE: Reaching LLaMA-2 Performance With 0.1M Dollars”, Shen et al 2024
- “Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws”, Allen-Zhu & Li 2024
- “Mixture-Of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models”, Raposo et al 2024
- “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-Training”, McKinzie et al 2024
- “Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models”, Ding et al 2024
- “MoE-Mamba: Efficient Selective State Space Models With Mixture of Experts”, Pióro et al 2024
- “Mixtral of Experts”, Jiang et al 2024
- “Fast Inference of Mixture-Of-Experts Language Models With Offloading”, Eliseev & Mazur 2023
- “LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment”, Dou et al 2023
- “SwitchHead: Accelerating Transformers With Mixture-Of-Experts Attention”, Csordás et al 2023
- “Exponentially Faster Language Modeling”, Belcak & Wattenhofer 2023
- “Sparse Universal Transformer”, Tan et al 2023
- “Fast Feedforward Networks”, Belcak & Wattenhofer 2023
- “Non-Determinism in GPT-4 Is Caused by Sparse MoE”, 152334H 2023
- “From Sparse to Soft Mixtures of Experts”, Puigcerver et al 2023
- “Brainformers: Trading Simplicity for Efficiency”, Zhou et al 2023
- “CodeCompose: A Large-Scale Industrial Deployment of AI-Assisted Code Authoring”, Murali et al 2023
- “Bridging Discrete and Backpropagation: Straight-Through and Beyond”, Liu et al 2023
- “Scaling Expert Language Models With Unsupervised Domain Discovery”, Gururangan et al 2023
- “Sparse MoE As the New Dropout: Scaling Dense and Self-Slimmable Transformers”, Chen et al 2023
- “AltUp: Alternating Updates for Efficient Transformers”, Baykal et al 2023
- “Sparse Upcycling: Training Mixture-Of-Experts from Dense Checkpoints”, Komatsuzaki et al 2022
- “MegaBlocks: Efficient Sparse Training With Mixture-Of-Experts”, Gale et al 2022
- “Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production”, Kim et al 2022
- “EDiff-I: Text-To-Image Diffusion Models With an Ensemble of Expert Denoisers”, Balaji et al 2022
- “AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers”, Jawahar et al 2022
- “A Review of Sparse Expert Models in Deep Learning”, Fedus et al 2022
- “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Tay et al 2022
- “MoEC: Mixture of Expert Clusters”, Xie et al 2022
- “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models”, Srivastava et al 2022
- “Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Zhu et al 2022
- “Tutel: Adaptive Mixture-Of-Experts at Scale”, Hwang et al 2022
- “Gating Dropout: Communication-Efficient Regularization for Sparsely Activated Transformers”, Liu et al 2022
- “Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, Lee-Thorp & Ainslie 2022
- “One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code”, Dai et al 2022
- “InCoder: A Generative Model for Code Infilling and Synthesis”, Fried et al 2022
- “WuDaoMM: A Large-Scale Multi-Modal Dataset for Pre-Training Models”, Yuan et al 2022
- “Efficient Language Modeling With Sparse All-MLP”, Yu et al 2022
- “Mixture-Of-Experts With Expert Choice Routing”, Zhou et al 2022
- “ST-MoE: Designing Stable and Transferable Sparse Expert Models”, Zoph et al 2022
- “WuDao 2.0 With Its Lead Creator, Tang Jie”, Smith et al 2022
- “DeepSpeed-MoE: Advancing Mixture-Of-Experts Inference and Training to Power Next-Generation AI Scale”, Rajbhandari et al 2022
- “U.S. vs. China Rivalry Boosts Tech—And Tensions: Militarized AI Threatens a New Arms Race”, Smith 2021
- “Efficient Large Scale Language Modeling With Mixtures of Experts”, Artetxe et al 2021
- “GLaM: Efficient Scaling of Language Models With Mixture-Of-Experts”, Du et al 2021
- “Beyond Distillation: Task-Level Mixture-Of-Experts (TaskMoE) for Efficient Inference”, Kudugunta et al 2021
- “Scalable and Efficient MoE Training for Multitask Multilingual Models”, Kim et al 2021
- “Sparse-MLP: A Fully-MLP Architecture With Conditional Computation”, Lou et al 2021
- “Go Wider Instead of Deeper”, Xue et al 2021
- “MCL-GAN: Generative Adversarial Networks With Multiple Specialized Discriminators”, Choi & Han 2021
- “CPM-2: Large-Scale Cost-Effective Pre-Trained Language Models”, Zhang et al 2021
- “V-MoE: Scaling Vision With Sparse Mixture of Experts”, Riquelme et al 2021
- “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, Du 2021
- “Exploring Sparse Expert Models and Beyond”, Yang et al 2021
- “RetGen: A Joint Framework for Retrieval and Grounded Text Generation Modeling”, Zhang et al 2021
- “Carbon Emissions and Large Neural Network Training”, Patterson et al 2021
- “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-Scale Pretraining Model.”, Synced 2021
- “Coordination Among Neural Modules Through a Shared Global Workspace”, Goyal et al 2021
- “Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, Fedus et al 2021
- “GShard: Scaling Giant Models With Conditional Computation and Automatic Sharding”, Lepikhin et al 2020
- “Efficient Content-Based Sparse Attention With Routing Transformers”, Roy et al 2020
- “One Model To Learn Them All”, Kaiser et al 2017
- “Hard Mixtures of Experts for Large Scale Weakly Supervised Vision”, Gross et al 2017
- “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts Layer”, Shazeer et al 2017
- “Conditional Computation in Neural Networks for Faster Models”, Bengio et al 2015
- “Distilling the Knowledge in a Neural Network”, Hinton et al 2015
- “Learning Factored Representations in a Deep Mixture of Experts”, Eigen et al 2013
- “Mixture of Experts: a Literature Survey”, Masoudnia & Ebrahimpour 2012
- “Introduction to CPM”
- “GTC Spring 2021 Keynote With NVIDIA CEO Jensen Huang”
- “GTC 2021 Keynote With NVIDIA CEO Jensen Huang: NVIDIA CEO Jensen Huang Delivers the #GTC21 Keynote, Where He Introduced Amazing Breakthroughs in Building Virtual Worlds With NVIDIA Omniverse; in Advancing Enterprise Computing With New NVIDIA DGX Systems and Software; in Turning the Data Center into the New Unit of Computing With the New NVIDIA Grace CPU, BlueField-3 DPU, and DOCA 1.0 SDK; in Broadening the Reach of AI to All Companies and Industries With NVIDIA EGX and Aerial 5G; and in Transforming Transportation With NVIDIA DRIVE Orin and Atlan.”
- lepikhin
- Sort By Magic
- Wikipedia
- Miscellaneous
- Bibliography
See Also
Links
“Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget”, Sehwag et al 2024
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
“Anthropic’s Latest Claude AI Model Pulls ahead of Rivals from OpenAI and Google”, Knight 2024
Anthropic’s latest Claude AI model pulls ahead of rivals from OpenAI and Google
“JetMoE: Reaching LLaMA-2 Performance With 0.1M Dollars”, Shen et al 2024
“Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws”, Allen-Zhu & Li 2024
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
“Mixture-Of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models”, Raposo et al 2024
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
“MM1: Methods, Analysis & Insights from Multimodal LLM Pre-Training”, McKinzie et al 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
“Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models”, Ding et al 2024
Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models
“MoE-Mamba: Efficient Selective State Space Models With Mixture of Experts”, Pióro et al 2024
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
“Mixtral of Experts”, Jiang et al 2024
“Fast Inference of Mixture-Of-Experts Language Models With Offloading”, Eliseev & Mazur 2023
Fast Inference of Mixture-of-Experts Language Models with Offloading
“LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment”, Dou et al 2023
“SwitchHead: Accelerating Transformers With Mixture-Of-Experts Attention”, Csordás et al 2023
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
“Exponentially Faster Language Modeling”, Belcak & Wattenhofer 2023
“Sparse Universal Transformer”, Tan et al 2023
“Fast Feedforward Networks”, Belcak & Wattenhofer 2023
“Non-Determinism in GPT-4 Is Caused by Sparse MoE”, 152334H 2023
“From Sparse to Soft Mixtures of Experts”, Puigcerver et al 2023
“Brainformers: Trading Simplicity for Efficiency”, Zhou et al 2023
“CodeCompose: A Large-Scale Industrial Deployment of AI-Assisted Code Authoring”, Murali et al 2023
CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring
“Bridging Discrete and Backpropagation: Straight-Through and Beyond”, Liu et al 2023
Bridging Discrete and Backpropagation: Straight-Through and Beyond
“Scaling Expert Language Models With Unsupervised Domain Discovery”, Gururangan et al 2023
Scaling Expert Language Models with Unsupervised Domain Discovery
“Sparse MoE As the New Dropout: Scaling Dense and Self-Slimmable Transformers”, Chen et al 2023
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers
“AltUp: Alternating Updates for Efficient Transformers”, Baykal et al 2023
“Sparse Upcycling: Training Mixture-Of-Experts from Dense Checkpoints”, Komatsuzaki et al 2022
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
“MegaBlocks: Efficient Sparse Training With Mixture-Of-Experts”, Gale et al 2022
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
“Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production”, Kim et al 2022
Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production
“EDiff-I: Text-To-Image Diffusion Models With an Ensemble of Expert Denoisers”, Balaji et al 2022
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
“AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers”, Jawahar et al 2022
AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers
“A Review of Sparse Expert Models in Deep Learning”, Fedus et al 2022
“Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Tay et al 2022
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
“MoEC: Mixture of Expert Clusters”, Xie et al 2022
“Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models”, Srivastava et al 2022
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
“Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Zhu et al 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
“Tutel: Adaptive Mixture-Of-Experts at Scale”, Hwang et al 2022
“Gating Dropout: Communication-Efficient Regularization for Sparsely Activated Transformers”, Liu et al 2022
Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers
“Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, Lee-Thorp & Ainslie 2022
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
“One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code”, Dai et al 2022
One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code
“InCoder: A Generative Model for Code Infilling and Synthesis”, Fried et al 2022
InCoder: A Generative Model for Code Infilling and Synthesis
“WuDaoMM: A Large-Scale Multi-Modal Dataset for Pre-Training Models”, Yuan et al 2022
WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
“Efficient Language Modeling With Sparse All-MLP”, Yu et al 2022
“Mixture-Of-Experts With Expert Choice Routing”, Zhou et al 2022
“ST-MoE: Designing Stable and Transferable Sparse Expert Models”, Zoph et al 2022
ST-MoE: Designing Stable and Transferable Sparse Expert Models
“WuDao 2.0 With Its Lead Creator, Tang Jie”, Smith et al 2022
“DeepSpeed-MoE: Advancing Mixture-Of-Experts Inference and Training to Power Next-Generation AI Scale”, Rajbhandari et al 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
“U.S. vs. China Rivalry Boosts Tech—And Tensions: Militarized AI Threatens a New Arms Race”, Smith 2021
U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI threatens a new arms race
“Efficient Large Scale Language Modeling With Mixtures of Experts”, Artetxe et al 2021
Efficient Large Scale Language Modeling with Mixtures of Experts
“GLaM: Efficient Scaling of Language Models With Mixture-Of-Experts”, Du et al 2021
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
“Beyond Distillation: Task-Level Mixture-Of-Experts (TaskMoE) for Efficient Inference”, Kudugunta et al 2021
Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference
“Scalable and Efficient MoE Training for Multitask Multilingual Models”, Kim et al 2021
Scalable and Efficient MoE Training for Multitask Multilingual Models
“Sparse-MLP: A Fully-MLP Architecture With Conditional Computation”, Lou et al 2021
Sparse-MLP: A Fully-MLP Architecture with Conditional Computation
“Go Wider Instead of Deeper”, Xue et al 2021
“MCL-GAN: Generative Adversarial Networks With Multiple Specialized Discriminators”, Choi & Han 2021
MCL-GAN: Generative Adversarial Networks with Multiple Specialized Discriminators
“CPM-2: Large-Scale Cost-Effective Pre-Trained Language Models”, Zhang et al 2021
CPM-2: Large-scale Cost-effective Pre-trained Language Models
“V-MoE: Scaling Vision With Sparse Mixture of Experts”, Riquelme et al 2021
“Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, Du 2021
Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters
“Exploring Sparse Expert Models and Beyond”, Yang et al 2021
“RetGen: A Joint Framework for Retrieval and Grounded Text Generation Modeling”, Zhang et al 2021
RetGen: A Joint framework for Retrieval and Grounded Text Generation Modeling
“Carbon Emissions and Large Neural Network Training”, Patterson et al 2021
“China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-Scale Pretraining Model.”, Synced 2021
“Coordination Among Neural Modules Through a Shared Global Workspace”, Goyal et al 2021
Coordination Among Neural Modules Through a Shared Global Workspace
“Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, Fedus et al 2021
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
“GShard: Scaling Giant Models With Conditional Computation and Automatic Sharding”, Lepikhin et al 2020
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
“Efficient Content-Based Sparse Attention With Routing Transformers”, Roy et al 2020
Efficient Content-Based Sparse Attention with Routing Transformers
“One Model To Learn Them All”, Kaiser et al 2017
“Hard Mixtures of Experts for Large Scale Weakly Supervised Vision”, Gross et al 2017
Hard Mixtures of Experts for Large Scale Weakly Supervised Vision
“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts Layer”, Shazeer et al 2017
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
“Conditional Computation in Neural Networks for Faster Models”, Bengio et al 2015
Conditional Computation in Neural Networks for faster models
“Distilling the Knowledge in a Neural Network”, Hinton et al 2015
“Learning Factored Representations in a Deep Mixture of Experts”, Eigen et al 2013
Learning Factored Representations in a Deep Mixture of Experts
“Mixture of Experts: a Literature Survey”, Masoudnia & Ebrahimpour 2012
“Introduction to CPM”
“GTC Spring 2021 Keynote With NVIDIA CEO Jensen Huang”
“GTC 2021 Keynote With NVIDIA CEO Jensen Huang: NVIDIA CEO Jensen Huang Delivers the #GTC21 Keynote, Where He Introduced Amazing Breakthroughs in Building Virtual Worlds With NVIDIA Omniverse; in Advancing Enterprise Computing With New NVIDIA DGX Systems and Software; in Turning the Data Center into the New Unit of Computing With the New NVIDIA Grace CPU, BlueField-3 DPU, and DOCA 1.0 SDK; in Broadening the Reach of AI to All Companies and Industries With NVIDIA EGX and Aerial 5G; and in Transforming Transportation With NVIDIA DRIVE Orin and Atlan.”
lepikhin
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
specialized-models
adaptive-experts multimodal scaling expert-systems sparsity generalized-models
mixture-experts efficient training scalable inference efficient-transformers sparse-routing
sparse-experts
Wikipedia
Miscellaneous
-
/doc/ai/scaling/mixture-of-experts/2021-04-12-jensenhuang-gtc2021keynote-eAn_oiZwUXA.en.vtt.txt
: -
https://research.google/blog/learning-to-route-by-task-for-efficient-inference/
-
https://research.google/blog/more-efficient-in-context-learning-with-glam/
-
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
: -
https://www.reddit.com/r/LocalLLaMA/comments/18luk10/wait_llama_and_falcon_are_also_moe/
-
https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini
-
https://www.sensetime.com/en/news-detail/51167731?categoryId=1072
: -
https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/
:
Bibliography
-
https://arxiv.org/abs/2407.15811
: “Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget”, -
https://arxiv.org/abs/2401.04088#mistral
: “Mixtral of Experts”, -
https://arxiv.org/abs/2311.10770
: “Exponentially Faster Language Modeling”, -
https://arxiv.org/abs/2310.07096#ibm
: “Sparse Universal Transformer”, -
https://arxiv.org/abs/2308.14711
: “Fast Feedforward Networks”, -
https://152334h.github.io/blog/non-determinism-in-gpt-4/
: “Non-Determinism in GPT-4 Is Caused by Sparse MoE”, -
https://arxiv.org/abs/2306.00008#google
: “Brainformers: Trading Simplicity for Efficiency”, -
https://arxiv.org/abs/2301.13310#google
: “AltUp: Alternating Updates for Efficient Transformers”, -
https://arxiv.org/abs/2212.05055#google
: “Sparse Upcycling: Training Mixture-Of-Experts from Dense Checkpoints”, -
https://arxiv.org/abs/2211.15841
: “MegaBlocks: Efficient Sparse Training With Mixture-Of-Experts”, -
https://arxiv.org/abs/2211.01324#nvidia
: “EDiff-I: Text-To-Image Diffusion Models With an Ensemble of Expert Denoisers”, -
https://arxiv.org/abs/2207.10551#google
: “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, -
https://arxiv.org/abs/2206.04615
: “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models”, -
https://arxiv.org/abs/2206.03382#microsoft
: “Tutel: Adaptive Mixture-Of-Experts at Scale”, -
https://arxiv.org/abs/2205.12399#google
: “Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, -
https://arxiv.org/abs/2204.05999#facebook
: “InCoder: A Generative Model for Code Infilling and Synthesis”, -
https://arxiv.org/abs/2203.11480#baai
: “WuDaoMM: A Large-Scale Multi-Modal Dataset for Pre-Training Models”, -
https://arxiv.org/abs/2203.06850
: “Efficient Language Modeling With Sparse All-MLP”, -
https://arxiv.org/abs/2202.09368#google
: “Mixture-Of-Experts With Expert Choice Routing”, -
https://arxiv.org/abs/2202.08906#google
: “ST-MoE: Designing Stable and Transferable Sparse Expert Models”, -
https://arxiv.org/abs/2201.05596#microsoft
: “DeepSpeed-MoE: Advancing Mixture-Of-Experts Inference and Training to Power Next-Generation AI Scale”, -
https://spectrum.ieee.org/china-us-militarized-ai
: “U.S. vs. China Rivalry Boosts Tech—And Tensions: Militarized AI Threatens a New Arms Race”, -
https://arxiv.org/abs/2107.11817
: “Go Wider Instead of Deeper”, -
https://arxiv.org/abs/2106.05974#google
: “V-MoE: Scaling Vision With Sparse Mixture of Experts”, -
https://en.pingwest.com/a/8693#baai
: “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, -
https://arxiv.org/abs/2104.10350#google
: “Carbon Emissions and Large Neural Network Training”, -
https://syncedreview.com/2021/03/23/chinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0/#baai
: “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-Scale Pretraining Model.”, -
https://arxiv.org/abs/2101.03961#google
: “Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, -
https://arxiv.org/abs/2003.05997#google
: “Efficient Content-Based Sparse Attention With Routing Transformers”, -
2012-masoudnia.pdf
: “Mixture of Experts: a Literature Survey”,