Bibliography:

  1. ‘AI scaling’ tag

  2. ‘NN sparsity’ tag

  3. Mixture of Parrots: Experts improve memorization more than reasoning

  4. Upcycling Large Language Models into Mixture of Experts

  5. Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

  6. Anthropic’s latest Claude AI model pulls ahead of rivals from OpenAI and Google

  7. JetMoE: Reaching LLaMA-2 Performance with 0.1M Dollars

  8. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

  9. Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

  10. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

  11. Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models

  12. MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

  13. Mixtral of Experts

  14. Fast Inference of Mixture-of-Experts Language Models with Offloading

  15. LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment

  16. SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

  17. Exponentially Faster Language Modeling

  18. Sparse Universal Transformer

  19. Fast Feedforward Networks

  20. Non-determinism in GPT-4 is caused by Sparse MoE

  21. From Sparse to Soft Mixtures of Experts

  22. Brainformers: Trading Simplicity for Efficiency

  23. CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring

  24. Bridging Discrete and Backpropagation: Straight-Through and Beyond

  25. Scaling Expert Language Models with Unsupervised Domain Discovery

  26. Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

  27. AltUp: Alternating Updates for Efficient Transformers

  28. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

  29. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

  30. Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production

  31. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

  32. AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers

  33. A Review of Sparse Expert Models in Deep Learning

  34. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

  35. MoEC: Mixture of Expert Clusters

  36. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

  37. Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

  38. Tutel: Adaptive Mixture-of-Experts at Scale

  39. Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

  40. Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

  41. One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code

  42. InCoder: A Generative Model for Code Infilling and Synthesis

  43. WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

  44. Efficient Language Modeling with Sparse All-MLP

  45. Mixture-of-Experts with Expert Choice Routing

  46. ST-MoE: Designing Stable and Transferable Sparse Expert Models

  47. WuDao 2.0 With Its Lead Creator, Tang Jie

  48. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

  49. U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI threatens a new arms race

  50. Efficient Large Scale Language Modeling with Mixtures of Experts

  51. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  52. Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference

  53. Scalable and Efficient MoE Training for Multitask Multilingual Models

  54. Sparse-MLP: A Fully-MLP Architecture with Conditional Computation

  55. Go Wider Instead of Deeper

  56. MCL-GAN: Generative Adversarial Networks with Multiple Specialized Discriminators

  57. CPM-2: Large-scale Cost-effective Pre-trained Language Models

  58. V-MoE: Scaling Vision with Sparse Mixture of Experts

  59. Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters

  60. Exploring Sparse Expert Models and Beyond

  61. RetGen: A Joint framework for Retrieval and Grounded Text Generation Modeling

  62. Carbon Emissions and Large Neural Network Training

  63. China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.

  64. Coordination Among Neural Modules Through a Shared Global Workspace

  65. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  66. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

  67. Efficient Content-Based Sparse Attention with Routing Transformers

  68. One Model To Learn Them All

  69. Hard Mixtures of Experts for Large Scale Weakly Supervised Vision

  70. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

  71. Conditional Computation in Neural Networks for faster models

  72. Distilling the Knowledge in a Neural Network

  73. Learning Factored Representations in a Deep Mixture of Experts

  74. Mixture of experts: a literature survey

  75. Introduction to CPM

  76. GTC Spring 2021 Keynote With NVIDIA CEO Jensen Huang

  77. GTC 2021 Keynote With NVIDIA CEO Jensen Huang: NVIDIA CEO Jensen Huang Delivers the #GTC21 Keynote, Where He Introduced Amazing Breakthroughs in Building Virtual Worlds With NVIDIA Omniverse; in Advancing Enterprise Computing With New NVIDIA DGX Systems and Software; in Turning the Data Center into the New Unit of Computing With the New NVIDIA Grace CPU, BlueField-3 DPU, and DOCA 1.0 SDK; in Broadening the Reach of AI to All Companies and Industries With NVIDIA EGX and Aerial 5G; and in Transforming Transportation With NVIDIA DRIVE Orin and Atlan.

  78. We Ran MoE (2048E,60L) With Bfloat16 Activations With Total of 1 Trillion Model Weights. Although Trainable With Manual Diagnostics, With Deep 1 Trillion Model We Encountered Several Trainability Issues With Numerical Stability. Will Follow Up.

  79. design#future-tag-features

    [Transclude the forward-link's context]

  80. 2021-04-12-jensenhuang-gtc2021keynote-eAn_oiZwUXA.en.vtt.txt

  81. https://euclaise.xyz/vq-is-mlp

  82. https://patents.google.com/patent/US20230419079A1/en

  83. 4299d42e759363e6cfa4d197a926b04ff9bf978f.html

  84. https://research.google/blog/learning-to-route-by-task-for-efficient-inference/

  85. https://research.google/blog/more-efficient-in-context-learning-with-glam/

  86. https://www.ai21.com/blog/announcing-jamba

  87. 4eb64a22dba344b9bdff4f6c9767f8e64482e521.html

  88. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

  89. 53d164f364ffab504f17aff8295f450568ce50bc.html

  90. https://www.reddit.com/r/LocalLLaMA/comments/18luk10/wait_llama_and_falcon_are_also_moe/

  91. https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini

  92. https://www.sensetime.com/en/news-detail/51167731?categoryId=1072

  93. fcd612790741a311ee51fae545aac6a36ee3c2f6.html

  94. https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/

  95. 9aeee6c73d1894be5d6fe63fe4997a27e9151e06.html

  96. https://x.ai/blog/grok-os

  97. 2d8909ada2a4bf3423b705d7373d9df1e7c5defc.html

  98. https://x.com/soumithchintala/status/1671267150101721090

  99. Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

  100. https%253A%252F%252Farxiv.org%252Fabs%252F2407.15811.html

  101. Mixtral of Experts

  102. Teven Le Scao

  103. Thomas Wang

  104. https%253A%252F%252Farxiv.org%252Fabs%252F2401.04088%2523mistral.html

  105. Exponentially Faster Language Modeling

  106. https%253A%252F%252Farxiv.org%252Fabs%252F2311.10770.html

  107. Sparse Universal Transformer

  108. Aaron Courville

  109. https%253A%252F%252Farxiv.org%252Fabs%252F2310.07096%2523ibm.html

  110. Fast Feedforward Networks

  111. https%253A%252F%252Farxiv.org%252Fabs%252F2308.14711.html

  112. Non-determinism in GPT-4 is caused by Sparse MoE

  113. https%253A%252F%252F152334h.github.io%252Fblog%252Fnon-determinism-in-gpt-4%252F.html

  114. Brainformers: Trading Simplicity for Efficiency

  115. https%253A%252F%252Farxiv.org%252Fabs%252F2306.00008%2523google.html

  116. AltUp: Alternating Updates for Efficient Transformers

  117. https%253A%252F%252Farxiv.org%252Fabs%252F2301.13310%2523google.html

  118. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

  119. Yi Tay

  120. Neil Houlsby

  121. https%253A%252F%252Farxiv.org%252Fabs%252F2212.05055%2523google.html

  122. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

  123. https%253A%252F%252Farxiv.org%252Fabs%252F2211.15841.html

  124. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

  125. https%253A%252F%252Farxiv.org%252Fabs%252F2211.01324%2523nvidia.html

  126. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

  127. Yi Tay

  128. https%253A%252F%252Farxiv.org%252Fabs%252F2207.10551%2523google.html

  129. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

  130. About Me

  131. Andrea Santilli

  132. Andy Zou

  133. Barret Zoph

  134. Behnam Neyshabur

  135. Colin Raffel

  136. https://people.eecs.berkeley.edu/~hendrycks/

  137. Daniel Levy

  138. Eric Tang

  139. Hannaneh Hajishirzi—University of Washington

  140. Jacob Hilton's Homepage

  141. Jared Kaplan

  142. Jascha Sohl-Dickstein

  143. Jason Wei

  144. Leo Gao

  145. Luke Metz

  146. Mantas Mazeika

  147. Mohit Bansal

  148. Nikita Nangia

  149. Omer Levy

  150. Owain Evans, AI Alignment Researcher

  151. Percy Liang

  152. Sam Bowman

  153. Stefano Ermon

  154. Stella Biderman

  155. Steven T. Piantadosi

  156. Vedant Misra

  157. https%253A%252F%252Farxiv.org%252Fabs%252F2206.04615.html

  158. Tutel: Adaptive Mixture-of-Experts at Scale

  159. https%253A%252F%252Farxiv.org%252Fabs%252F2206.03382%2523microsoft.html

  160. Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

  161. https%253A%252F%252Farxiv.org%252Fabs%252F2205.12399%2523google.html

  162. InCoder: A Generative Model for Code Infilling and Synthesis

  163. Luke Zettlemoyer

  164. Mike Lewis

  165. https%253A%252F%252Farxiv.org%252Fabs%252F2204.05999%2523facebook.html

  166. WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

  167. https%253A%252F%252Farxiv.org%252Fabs%252F2203.11480%2523baai.html

  168. Efficient Language Modeling with Sparse All-MLP

  169. https%253A%252F%252Farxiv.org%252Fabs%252F2203.06850.html

  170. Mixture-of-Experts with Expert Choice Routing

  171. https%253A%252F%252Farxiv.org%252Fabs%252F2202.09368%2523google.html

  172. ST-MoE: Designing Stable and Transferable Sparse Expert Models

  173. Barret Zoph

  174. https%253A%252F%252Farxiv.org%252Fabs%252F2202.08906%2523google.html

  175. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

  176. https%253A%252F%252Farxiv.org%252Fabs%252F2201.05596%2523microsoft.html

  177. U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI threatens a new arms race

  178. https%253A%252F%252Fspectrum.ieee.org%252Fchina-us-militarized-ai.html

  179. Go Wider Instead of Deeper

  180. https%253A%252F%252Farxiv.org%252Fabs%252F2107.11817.html

  181. V-MoE: Scaling Vision with Sparse Mixture of Experts

  182. Neil Houlsby

  183. https%253A%252F%252Farxiv.org%252Fabs%252F2106.05974%2523google.html

  184. Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters

  185. https%253A%252F%252Fen.pingwest.com%252Fa%252F8693%2523baai.html

  186. Carbon Emissions and Large Neural Network Training

  187. https%253A%252F%252Farxiv.org%252Fabs%252F2104.10350%2523google.html

  188. China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.

  189. https%253A%252F%252Fsyncedreview.com%252F2021%252F03%252F23%252Fchinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0%252F%2523baai.html

  190. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  191. Barret Zoph

  192. https%253A%252F%252Farxiv.org%252Fabs%252F2101.03961%2523google.html

  193. Efficient Content-Based Sparse Attention with Routing Transformers

  194. https%253A%252F%252Farxiv.org%252Fabs%252F2003.05997%2523google.html

  195. Mixture of experts: a literature survey

  196. %252Fdoc%252Fai%252Fscaling%252Fmixture-of-experts%252F2012-masoudnia.pdf.html