- See Also
-
Links
- “Brainformers: Trading Simplicity for Efficiency”, Zhou et al 2023
- “CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring”, Murali et al 2023
- “Sparse MoE As the New Dropout: Scaling Dense and Self-Slimmable Transformers”, Chen et al 2023
- “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, Komatsuzaki et al 2022
- “MegaBlocks: Efficient Sparse Training With Mixture-of-Experts”, Gale et al 2022
- “Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production”, Kim et al 2022
- “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Balaji et al 2022
- “AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers”, Jawahar et al 2022
- “A Review of Sparse Expert Models in Deep Learning”, Fedus et al 2022
- “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Tay et al 2022
- “MoEC: Mixture of Expert Clusters”, Xie et al 2022
- “Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Zhu et al 2022
- “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models”, Srivastava et al 2022
- “Tutel: Adaptive Mixture-of-Experts at Scale”, Hwang et al 2022
- “Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers”, Liu et al 2022
- “Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, Lee-Thorp & Ainslie 2022
- “One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code”, Dai et al 2022
- “InCoder: A Generative Model for Code Infilling and Synthesis”, Fried et al 2022
- “WuDaoMM: A Large-scale Multi-Modal Dataset for Pre-training Models”, Yuan et al 2022
- “Mixture-of-Experts With Expert Choice Routing”, Zhou et al 2022
- “ST-MoE: Designing Stable and Transferable Sparse Expert Models”, Zoph et al 2022
- “WuDao 2.0 With Its Lead Creator, Tang Jie”, Smith et al 2022
- “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, Rajbhandari et al 2022
- “U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI Threatens a New Arms Race”, Smith 2021
- “Efficient Large Scale Language Modeling With Mixtures of Experts”, Artetxe et al 2021
- “GLaM: Efficient Scaling of Language Models With Mixture-of-Experts”, Du et al 2021
- “Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, Kudugunta et al 2021
- “Scalable and Efficient MoE Training for Multitask Multilingual Models”, Kim et al 2021
- “Sparse-MLP: A Fully-MLP Architecture With Conditional Computation”, Lou et al 2021
- “Go Wider Instead of Deeper”, Xue et al 2021
- “CPM-2: Large-scale Cost-effective Pre-trained Language Models”, Zhang et al 2021
- “V-MoE: Scaling Vision With Sparse Mixture of Experts”, Riquelme et al 2021
- “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, Du 2021
- “Exploring Sparse Expert Models and Beyond”, Yang et al 2021
- “RetGen: A Joint Framework for Retrieval and Grounded Text Generation Modeling”, Zhang et al 2021
- “Carbon Emissions and Large Neural Network Training”, Patterson et al 2021
- “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced 2021
- “Coordination Among Neural Modules Through a Shared Global Workspace”, Goyal et al 2021
- “Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, Fedus et al 2021
- “GShard: Scaling Giant Models With Conditional Computation and Automatic Sharding”, Lepikhin et al 2020
- “Efficient Content-Based Sparse Attention With Routing Transformers”, Roy et al 2020
- “One Model To Learn Them All”, Kaiser et al 2017
- “Hard Mixtures of Experts for Large Scale Weakly Supervised Vision”, Gross et al 2017
- “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”, Shazeer et al 2017
- “Distilling the Knowledge in a Neural Network”, Hinton et al 2015
- “Mixture of Experts: a Literature Survey”, Masoudnia & Ebrahimpour 2012
- “GTC 2021 Keynote With NVIDIA CEO Jensen Huang: NVIDIA CEO Jensen Huang Delivers the #GTC21 Keynote, Where He Introduced Amazing Breakthroughs in Building Virtual Worlds With NVIDIA Omniverse; in Advancing Enterprise Computing With New NVIDIA DGX Systems and Software; in Turning the Data Center into the New Unit of Computing With the New NVIDIA Grace CPU, BlueField-3 DPU, and DOCA 1.0 SDK; in Broadening the Reach of AI to All Companies and Industries With NVIDIA EGX and Aerial 5G; and in Transforming Transportation With NVIDIA DRIVE Orin and Atlan.”
- Sort By Magic
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Links
“Brainformers: Trading Simplicity for Efficiency”, Zhou et al 2023
“CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring”, Murali et al 2023
“CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring”
“Sparse MoE As the New Dropout: Scaling Dense and Self-Slimmable Transformers”, Chen et al 2023
“Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers”
“Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, Komatsuzaki et al 2022
“Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”
“MegaBlocks: Efficient Sparse Training With Mixture-of-Experts”, Gale et al 2022
“MegaBlocks: Efficient Sparse Training with Mixture-of-Experts”
“Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production”, Kim et al 2022
“Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production”
“EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Balaji et al 2022
“eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers”
“AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers”, Jawahar et al 2022
“AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers”
“A Review of Sparse Expert Models in Deep Learning”, Fedus et al 2022
“Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Tay et al 2022
“Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?”
“MoEC: Mixture of Expert Clusters”, Xie et al 2022
“Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Zhu et al 2022
“Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs”
“Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models”, Srivastava et al 2022
“Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models”
“Tutel: Adaptive Mixture-of-Experts at Scale”, Hwang et al 2022
“Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers”, Liu et al 2022
“Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers”
“Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, Lee-Thorp & Ainslie 2022
“Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT”
“One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code”, Dai et al 2022
“InCoder: A Generative Model for Code Infilling and Synthesis”, Fried et al 2022
“InCoder: A Generative Model for Code Infilling and Synthesis”
“WuDaoMM: A Large-scale Multi-Modal Dataset for Pre-training Models”, Yuan et al 2022
“WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models”
“Mixture-of-Experts With Expert Choice Routing”, Zhou et al 2022
“ST-MoE: Designing Stable and Transferable Sparse Expert Models”, Zoph et al 2022
“ST-MoE: Designing Stable and Transferable Sparse Expert Models”
“WuDao 2.0 With Its Lead Creator, Tang Jie”, Smith et al 2022
“DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, Rajbhandari et al 2022
“U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI Threatens a New Arms Race”, Smith 2021
“U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI threatens a new arms race”
“Efficient Large Scale Language Modeling With Mixtures of Experts”, Artetxe et al 2021
“Efficient Large Scale Language Modeling with Mixtures of Experts”
“GLaM: Efficient Scaling of Language Models With Mixture-of-Experts”, Du et al 2021
“GLaM: Efficient Scaling of Language Models with Mixture-of-Experts”
“Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, Kudugunta et al 2021
“Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”
“Scalable and Efficient MoE Training for Multitask Multilingual Models”, Kim et al 2021
“Scalable and Efficient MoE Training for Multitask Multilingual Models”
“Sparse-MLP: A Fully-MLP Architecture With Conditional Computation”, Lou et al 2021
“Sparse-MLP: A Fully-MLP Architecture with Conditional Computation”
“Go Wider Instead of Deeper”, Xue et al 2021
“CPM-2: Large-scale Cost-effective Pre-trained Language Models”, Zhang et al 2021
“CPM-2: Large-scale Cost-effective Pre-trained Language Models”
“V-MoE: Scaling Vision With Sparse Mixture of Experts”, Riquelme et al 2021
“Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, Du 2021
“Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters”
“Exploring Sparse Expert Models and Beyond”, Yang et al 2021
“RetGen: A Joint Framework for Retrieval and Grounded Text Generation Modeling”, Zhang et al 2021
“RetGen: A Joint framework for Retrieval and Grounded Text Generation Modeling”
“Carbon Emissions and Large Neural Network Training”, Patterson et al 2021
“China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced 2021
“Coordination Among Neural Modules Through a Shared Global Workspace”, Goyal et al 2021
“Coordination Among Neural Modules Through a Shared Global Workspace”
“Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, Fedus et al 2021
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”
“GShard: Scaling Giant Models With Conditional Computation and Automatic Sharding”, Lepikhin et al 2020
“GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding”
“Efficient Content-Based Sparse Attention With Routing Transformers”, Roy et al 2020
“Efficient Content-Based Sparse Attention with Routing Transformers”
“One Model To Learn Them All”, Kaiser et al 2017
“Hard Mixtures of Experts for Large Scale Weakly Supervised Vision”, Gross et al 2017
“Hard Mixtures of Experts for Large Scale Weakly Supervised Vision”
“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”, Shazeer et al 2017
“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”
“Distilling the Knowledge in a Neural Network”, Hinton et al 2015
“Mixture of Experts: a Literature Survey”, Masoudnia & Ebrahimpour 2012
“GTC 2021 Keynote With NVIDIA CEO Jensen Huang: NVIDIA CEO Jensen Huang Delivers the #GTC21 Keynote, Where He Introduced Amazing Breakthroughs in Building Virtual Worlds With NVIDIA Omniverse; in Advancing Enterprise Computing With New NVIDIA DGX Systems and Software; in Turning the Data Center into the New Unit of Computing With the New NVIDIA Grace CPU, BlueField-3 DPU, and DOCA 1.0 SDK; in Broadening the Reach of AI to All Companies and Industries With NVIDIA EGX and Aerial 5G; and in Transforming Transportation With NVIDIA DRIVE Orin and Atlan.”
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
expert-models
sparse-moe
ai-arms-race
expert-scaling
Wikipedia
Miscellaneous
-
/doc/ai/scaling/mixture-of-experts/2021-04-12-jensenhuang-gtc2021keynote-eAn_oiZwUXA.en.vtt.txt
-
https://blog.research.google/2021/12/more-efficient-in-context-learning-with.html
-
https://blog.research.google/2022/01/learning-to-route-by-task-for-efficient.html
-
https://twitter.com/soumithchintala/status/1671267150101721090
Link Bibliography
-
https://arxiv.org/abs/2306.00008#google
: “Brainformers: Trading Simplicity for Efficiency”, -
https://arxiv.org/abs/2212.05055#google
: “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, -
https://arxiv.org/abs/2211.15841
: “MegaBlocks: Efficient Sparse Training With Mixture-of-Experts”, Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia -
https://arxiv.org/abs/2211.01324#nvidia
: “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, -
https://arxiv.org/abs/2207.10551#google
: “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, -
https://arxiv.org/abs/2206.03382#microsoft
: “Tutel: Adaptive Mixture-of-Experts at Scale”, -
https://arxiv.org/abs/2205.12399#google
: “Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, James Lee-Thorp, Joshua Ainslie -
https://arxiv.org/abs/2204.05999#facebook
: “InCoder: A Generative Model for Code Infilling and Synthesis”, -
https://arxiv.org/abs/2202.09368#google
: “Mixture-of-Experts With Expert Choice Routing”, -
https://arxiv.org/abs/2202.08906#google
: “ST-MoE: Designing Stable and Transferable Sparse Expert Models”, Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus -
https://arxiv.org/abs/2201.05596#microsoft
: “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, -
https://spectrum.ieee.org/china-us-militarized-ai
: “U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI Threatens a New Arms Race”, Craig S. Smith -
https://arxiv.org/abs/2107.11817
: “Go Wider Instead of Deeper”, Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, Yang You -
https://arxiv.org/abs/2106.05974#google
: “V-MoE: Scaling Vision With Sparse Mixture of Experts”, -
https://en.pingwest.com/a/8693#baai
: “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, Chen Du -
https://arxiv.org/abs/2104.10350#google
: “Carbon Emissions and Large Neural Network Training”, -
https://syncedreview.com/2021/03/23/chinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0/#baai
: “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced -
https://arxiv.org/abs/2101.03961#google
: “Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, William Fedus, Barret Zoph, Noam Shazeer -
https://arxiv.org/abs/2003.05997#google
: “Efficient Content-Based Sparse Attention With Routing Transformers”, Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier -
2012-masoudnia.pdf
: “Mixture of Experts: a Literature Survey”, Saeed Masoudnia, Reza Ebrahimpour