- See Also
-
Links
- “Sparse MoE As the New Dropout: Scaling Dense and Self-Slimmable Transformers”, Et Al 2023
- “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, Et Al 2022
- “MegaBlocks: Efficient Sparse Training With Mixture-of-Experts”, Et Al 2022
- “Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production”, Et Al 2022
- “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Et Al 2022
- “AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers”, Et Al 2022
- “A Review of Sparse Expert Models in Deep Learning”, Et Al 2022
- “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Et Al 2022
- “MoEC: Mixture of Expert Clusters”, Et Al 2022
- “Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Et Al 2022
- “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models”, Et Al 2022
- “Tutel: Adaptive Mixture-of-Experts at Scale”, Et Al 2022
- “Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers”, Et Al 2022
- “Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, Lee-2022
- “One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code”, Et Al 2022
- “WuDaoMM: A Large-scale Multi-Modal Dataset for Pre-training Models”, Et Al 2022
- “Mixture-of-Experts With Expert Choice Routing”, Et Al 2022
- “WuDao 2.0 With Its Lead Creator, Tang Jie”, Et Al 2022
- “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, Et Al 2022
- “U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI Threatens a New Arms Race”, 2021
- “Efficient Large Scale Language Modeling With Mixtures of Experts”, Et Al 2021
- “GLaM: Efficient Scaling of Language Models With Mixture-of-Experts”, Et Al 2021
- “Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, Et Al 2021
- “Sparse-MLP: A Fully-MLP Architecture With Conditional Computation”, Et Al 2021
- “Go Wider Instead of Deeper”, Et Al 2021
- “CPM-2: Large-scale Cost-effective Pre-trained Language Models”, Et Al 2021
- “V-MoE: Scaling Vision With Sparse Mixture of Experts”, Et Al 2021
- “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, 2021
- “Exploring Sparse Expert Models and Beyond”, Et Al 2021
- “RetGen: A Joint Framework for Retrieval and Grounded Text Generation Modeling”, Et Al 2021
- “Carbon Emissions and Large Neural Network Training”, Et Al 2021
- “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, 2021
- “Coordination Among Neural Modules Through a Shared Global Workspace”, Et Al 2021
- “Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, Et Al 2021
- “GShard: Scaling Giant Models With Conditional Computation and Automatic Sharding”, Et Al 2020
- “Efficient Content-Based Sparse Attention With Routing Transformers”, Et Al 2020
- “One Model To Learn Them All”, Et Al 2017
- “Hard Mixtures of Experts for Large Scale Weakly Supervised Vision”, Et Al 2017
- “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”, Et Al 2017
- “Distilling the Knowledge in a Neural Network”, Et Al 2015
- “Mixture of Experts: a Literature Survey”, 2012
- “GTC 2021 Keynote With NVIDIA CEO Jensen Huang: NVIDIA CEO Jensen Huang Delivers the #GTC21 Keynote, Where He Introduced Amazing Breakthroughs in Building Virtual Worlds With NVIDIA Omniverse; in Advancing Enterprise Computing With New NVIDIA DGX Systems and Software; in Turning the Data Center into the New Unit of Computing With the New NVIDIA Grace CPU, BlueField-3 DPU, and DOCA 1.0 SDK; in Broadening the Reach of AI to All Companies and Industries With NVIDIA EGX and Aerial 5G; and in Transforming Transportation With NVIDIA DRIVE Orin and Atlan.”
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Links
“Sparse MoE As the New Dropout: Scaling Dense and Self-Slimmable Transformers”, Et Al 2023
“Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers”, 2023-03-02 (similar)
“Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, Et Al 2022
“Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, 2022-12-09 ( ; similar; bibliography)
“MegaBlocks: Efficient Sparse Training With Mixture-of-Experts”, Et Al 2022
“MegaBlocks: Efficient Sparse Training with Mixture-of-Experts”, 2022-11-29 (similar; bibliography)
“Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production”, Et Al 2022
“Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production”, 2022-11-18 ( ; similar)
“EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Et Al 2022
“eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers”, 2022-11-02 ( ; similar; bibliography)
“AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers”, Et Al 2022
“AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers”, 2022-10-14 (similar)
“A Review of Sparse Expert Models in Deep Learning”, Et Al 2022
“A Review of Sparse Expert Models in Deep Learning”, 2022-09-04 (similar)
“Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Et Al 2022
“Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?”, 2022-07-21 ( ; similar; bibliography)
“MoEC: Mixture of Expert Clusters”, Et Al 2022
“MoEC: Mixture of Expert Clusters”, 2022-07-19 (similar)
“Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Et Al 2022
“Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs”, 2022-06-09 ( ; backlinks; similar)
“Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models”, Et Al 2022
“Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models”, 2022-06-09 ( ; backlinks; similar)
“Tutel: Adaptive Mixture-of-Experts at Scale”, Et Al 2022
“Tutel: Adaptive Mixture-of-Experts at Scale”, 2022-06-07 ( ; similar; bibliography)
“Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers”, Et Al 2022
“Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers”, 2022-05-28 (similar)
“Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, Lee-2022
“Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT”, 2022-05-24 ( ; similar; bibliography)
“One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code”, Et Al 2022
“One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code”, 2022-05-12 (similar)
“WuDaoMM: A Large-scale Multi-Modal Dataset for Pre-training Models”, Et Al 2022
“WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models”, 2022-03-22 (similar)
“Mixture-of-Experts With Expert Choice Routing”, Et Al 2022
“Mixture-of-Experts with Expert Choice Routing”, 2022-02-18 ( ; similar; bibliography)
“WuDao 2.0 With Its Lead Creator, Tang Jie”, Et Al 2022
“WuDao 2.0 with its lead creator, Tang Jie”, 2022-01-26
“DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, Et Al 2022
“DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, 2022-01-14 ( ; similar; bibliography)
“U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI Threatens a New Arms Race”, 2021
“U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI threatens a new arms race”, 2021-12-28 ( ; similar; bibliography)
“Efficient Large Scale Language Modeling With Mixtures of Experts”, Et Al 2021
“Efficient Large Scale Language Modeling with Mixtures of Experts”, 2021-12-20 (similar)
“GLaM: Efficient Scaling of Language Models With Mixture-of-Experts”, Et Al 2021
“GLaM: Efficient Scaling of Language Models with Mixture-of-Experts”, 2021-12-13 ( ; similar)
“Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, Et Al 2021
“Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, 2021-09-24 ( ; similar)
“Sparse-MLP: A Fully-MLP Architecture With Conditional Computation”, Et Al 2021
“Sparse-MLP: A Fully-MLP Architecture with Conditional Computation”, 2021-09-05 ( ; backlinks; similar)
“Go Wider Instead of Deeper”, Et Al 2021
“Go Wider Instead of Deeper”, 2021-07-25 (similar; bibliography)
“CPM-2: Large-scale Cost-effective Pre-trained Language Models”, Et Al 2021
“CPM-2: Large-scale Cost-effective Pre-trained Language Models”, 2021-06-20 (backlinks; similar)
“V-MoE: Scaling Vision With Sparse Mixture of Experts”, Et Al 2021
“V-MoE: Scaling Vision with Sparse Mixture of Experts”, 2021-06-10 (similar; bibliography)
“Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, 2021
“Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters”, 2021-06-01 ( ; similar; bibliography)
“Exploring Sparse Expert Models and Beyond”, Et Al 2021
“Exploring Sparse Expert Models and Beyond”, 2021-05-31 (similar)
“RetGen: A Joint Framework for Retrieval and Grounded Text Generation Modeling”, Et Al 2021
“RetGen: A Joint framework for Retrieval and Grounded Text Generation Modeling”, 2021-05-14 ( ; similar)
“Carbon Emissions and Large Neural Network Training”, Et Al 2021
“Carbon Emissions and Large Neural Network Training”, 2021-04-21 ( ; similar; bibliography)
“China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, 2021
“China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.”, 2021-03-23 ( ; similar; bibliography)
“Coordination Among Neural Modules Through a Shared Global Workspace”, Et Al 2021
“Coordination Among Neural Modules Through a Shared Global Workspace”, 2021-03-01 ( ; backlinks; similar)
“Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, Et Al 2021
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”, 2021-01-11 ( ; similar; bibliography)
“GShard: Scaling Giant Models With Conditional Computation and Automatic Sharding”, Et Al 2020
“GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding”, 2020-06-30 (similar)
“Efficient Content-Based Sparse Attention With Routing Transformers”, Et Al 2020
“Efficient Content-Based Sparse Attention with Routing Transformers”, 2020-03-12 ( ; similar; bibliography)
“One Model To Learn Them All”, Et Al 2017
“One Model To Learn Them All”, 2017-06-16 (similar)
“Hard Mixtures of Experts for Large Scale Weakly Supervised Vision”, Et Al 2017
“Hard Mixtures of Experts for Large Scale Weakly Supervised Vision”, 2017-04-20 (similar)
“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”, Et Al 2017
“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”, 2017-01-23 ( ; similar)
“Distilling the Knowledge in a Neural Network”, Et Al 2015
“Distilling the Knowledge in a Neural Network”, 2015-03-09 ( ; similar)
“Mixture of Experts: a Literature Survey”, 2012
“Mixture of experts: a literature survey”, 2012-05-12 (similar; bibliography)
“GTC 2021 Keynote With NVIDIA CEO Jensen Huang: NVIDIA CEO Jensen Huang Delivers the #GTC21 Keynote, Where He Introduced Amazing Breakthroughs in Building Virtual Worlds With NVIDIA Omniverse; in Advancing Enterprise Computing With New NVIDIA DGX Systems and Software; in Turning the Data Center into the New Unit of Computing With the New NVIDIA Grace CPU, BlueField-3 DPU, and DOCA 1.0 SDK; in Broadening the Reach of AI to All Companies and Industries With NVIDIA EGX and Aerial 5G; and in Transforming Transportation With NVIDIA DRIVE Orin and Atlan.”
Wikipedia
Miscellaneous
Link Bibliography
-
https://arxiv.org/abs/2212.05055#google
: “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, : -
https://arxiv.org/abs/2211.15841
: “MegaBlocks: Efficient Sparse Training With Mixture-of-Experts”, Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia: -
https://arxiv.org/abs/2211.01324#nvidia
: “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, : -
https://arxiv.org/abs/2207.10551#google
: “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, : -
https://arxiv.org/abs/2206.03382#microsoft
: “Tutel: Adaptive Mixture-of-Experts at Scale”, : -
https://arxiv.org/abs/2205.12399#google
: “Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, James Lee-Thorp, Joshua Ainslie: -
https://arxiv.org/abs/2202.09368#google
: “Mixture-of-Experts With Expert Choice Routing”, : -
https://arxiv.org/abs/2201.05596#microsoft
: “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, : -
https://spectrum.ieee.org/china-us-militarized-ai
: “U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI Threatens a New Arms Race”, Craig S. Smith: -
https://arxiv.org/abs/2107.11817
: “Go Wider Instead of Deeper”, Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, Yang You: -
https://arxiv.org/abs/2106.05974#google
: “V-MoE: Scaling Vision With Sparse Mixture of Experts”, : -
https://en.pingwest.com/a/8693#baai
: “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, Chen Du: -
https://arxiv.org/abs/2104.10350#google
: “Carbon Emissions and Large Neural Network Training”, : -
https://syncedreview.com/2021/03/23/chinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0/#baai
: “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced: -
https://arxiv.org/abs/2101.03961#google
: “Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, William Fedus, Barret Zoph, Noam Shazeer: -
https://arxiv.org/abs/2003.05997#google
: “Efficient Content-Based Sparse Attention With Routing Transformers”, Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier: -
2012-masoudnia.pdf
: “Mixture of Experts: a Literature Survey”, Saeed Masoudnia, Reza Ebrahimpour: