- See Also
-
Links
- “Context on the NVIDIA ChatGPT Opportunity—and Ramifications of Large Language Model Enthusiasm”, 2023
- “Microsoft and OpenAI Extend Partnership”, 2023
- “Efficiently Scaling Transformer Inference”, Et Al 2022
- “Reserve Capacity of NVIDIA HGX H100s on CoreWeave Now: Available at Scale in Q1 2023 Starting at $2.23/hr”, Core2022
- “Petals: Collaborative Inference and Fine-tuning of Large Models”, Et Al 2022
- “Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training”, Et Al 2022
- “Is Integer Arithmetic Enough for Deep Learning Training?”, Et Al 2022
- “Efficient NLP Inference at the Edge via Elastic Pipelining”, Et Al 2022
- “Training Transformers Together”, Et Al 2022
- “Tutel: Adaptive Mixture-of-Experts at Scale”, Et Al 2022
- “8-bit Numerical Formats for Deep Neural Networks”, Et Al 2022
- “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Et Al 2022
- “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness”, Et Al 2022
- “A Low-latency Communication Design for Brain Simulations”, 2022
- “Reducing Activation Recomputation in Large Transformer Models”, Et Al 2022
- “What Language Model to Train If You Have One Million GPU Hours?”, Et Al 2022
- “Monarch: Expressive Structured Matrices for Efficient and Accurate Training”, Et Al 2022
- “Pathways: Asynchronous Distributed Dataflow for ML”, Et Al 2022
- “Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads”, Et Al 2022
- “Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam”, Et Al 2022
- “Introducing the AI Research SuperCluster—Meta’s Cutting-edge AI Supercomputer for AI Research”, 2022
- “Is Programmable Overhead Worth The Cost? How Much Do We Pay for a System to Be Programmable? It Depends upon Who You Ask”, 2022
- “Spiking Neural Networks and Their Applications: A Review”, Et Al 2022
- “On the Working Memory of Humans and Great Apes: Strikingly Similar or Remarkably Different?”, Et Al 2021
- “SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient”, Et Al 2021
- “M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, Et Al 2021
- “Sustainable AI: Environmental Implications, Challenges and Opportunities”, Et Al 2021
- “China Has Already Reached Exascale—On Two Separate Systems”, 2021
- “The Efficiency Misnomer”, Et Al 2021
- “Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning”, Et Al 2021
- “WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU”, Et Al 2021
- “Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning”, Et Al 2021
- “PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management”, Et Al 2021
- “Demonstration of Decentralized, Physics-Driven Learning”, Et Al 2021
- “Chimera: Efficiently Training Large-Scale Neural Networks With Bidirectional Pipelines”, 2021
- “First-Generation Inference Accelerator Deployment at Facebook”, Et Al 2021
- “Single-chip Photonic Deep Neural Network for Instantaneous Image Classification”, Et Al 2021
- “Distributed Deep Learning in Open Collaborations”, Et Al 2021
- “Ten Lessons From Three Generations Shaped Google’s TPUv4i”, Et Al 2021
- “2.5-dimensional Distributed Model Training”, Et Al 2021
- “Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks”, Et Al 2021
- “A Full-stack Accelerator Search Technique for Vision Applications”, Et Al 2021
- “ChinAI #141: The PanGu Origin Story: Notes from an Informative Zhihu Thread on PanGu”, 2021
- “GSPMD: General and Scalable Parallelization for ML Computation Graphs”, Et Al 2021
- “PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models With Auto-parallel Computation”, Et Al 2021
- “ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning”, Et Al 2021
- “How to Train BERT With an Academic Budget”, Et Al 2021
- “Podracer Architectures for Scalable Reinforcement Learning”, Et Al 2021
- “An Efficient 2D Method for Training Super-Large Deep Learning Models”, Et Al 2021
- “High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)”, Et Al 2021
- “Efficient Large-Scale Language Model Training on GPU Clusters”, Et Al 2021
- “Large Batch Simulation for Deep Reinforcement Learning”, Et Al 2021
- “Warehouse-Scale Video Acceleration (Argos): Co-design and Deployment in the Wild”, Et Al 2021
- “TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models”, Et Al 2021
- “PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers”, Et Al 2021
- “ZeRO-Offload: Democratizing Billion-Scale Model Training”, Et Al 2021
- “The Design Process for Google’s Training Chips: TPUv2 and TPUv3”, Et Al 2021
- “Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment”, Et Al 2020
- “Parallel Training of Deep Networks With Local Updates”, Et Al 2020
- “Exploring the Limits of Concurrency in ML Training on Google TPUs”, Et Al 2020
- “BytePS: A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters”, Et Al 2020
- “Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour”, Et Al 2020
- “Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?”, Et Al 2020
- “Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures”, Et Al 2020b
- “L2L: Training Large Neural Networks With Constant Memory Using a New Execution Algorithm”, Et Al 2020
- “Interlocking Backpropagation: Improving Depthwise Model-parallelism”, Et Al 2020
- “DeepSpeed: Extreme-scale Model Training for Everyone”, Et Al 2020
- “Measuring Hardware Overhang”, Hippke 2020
- “The Node Is Nonsense: There Are Better Ways to Measure Progress Than the Old Moore’s Law Metric”, 2020
- “Are We in an AI Overhang?”, 2020
- “HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks”, 2020
- “The Computational Limits of Deep Learning”, Et Al 2020
- “Data Movement Is All You Need: A Case Study on Optimizing Transformers”, Et Al 2020
- “PyTorch Distributed: Experiences on Accelerating Data Parallel Training”, Et Al 2020
- “Japanese Supercomputer Is Crowned World’s Speediest: In the Race for the Most Powerful Computers, Fugaku, a Japanese Supercomputer, Recently Beat American and Chinese Machines”, 2020
- “Sample Factory: Egocentric 3D Control from Pixels at 100,000 FPS With Asynchronous Reinforcement Learning”, Et Al 2020
- “PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training”, Et Al 2020
- “There’s Plenty of Room at the Top: What Will Drive Computer Performance After Moore’s Law?”, Et Al 2020
- “A Domain-specific Supercomputer for Training Deep Neural Networks”, Et Al 2020
- “Microsoft Announces New Supercomputer, Lays out Vision for Future AI Work”, 2020
- “AI and Efficiency: We’re Releasing an Analysis Showing That Since 2012 the Amount of Compute Needed to Train a Neural Net to the Same Performance on ImageNet Classification Has Been Decreasing by a Factor of 2 Every 16 Months”, 2020
- “Computation in the Human Cerebral Cortex Uses Less Than 0.2 Watts yet This Great Expense Is Optimal When considering Communication Costs”, 2020
- “Startup Tenstorrent Shows AI Is Changing Computing and vice Versa: Tenstorrent Is One of the Rush of AI Chip Makers Founded in 2016 and Finally Showing Product. The New Wave of Chips Represent a Substantial Departure from How Traditional Computer Chips Work, but Also Point to Ways That Neural Network Design May Change in the Years to Come”, 2020
- “AI Chips: What They Are and Why They Matter-An AI Chips Reference”, 2020
- “Pipelined Backpropagation at Scale: Training Large Models without Batches”, Et Al 2020
- “2019 Recent Trends in GPU Price per FLOPS”, 2020
- “Ultrafast Machine Vision With 2D Material Neural Network Image Sensors”, Et Al 2020
- “Towards Spike-based Machine Intelligence With Neuromorphic Computing”, Et Al 2019
- “Checkmate: Breaking the Memory Wall With Optimal Tensor Rematerialization”, Et Al 2019
- “Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos”, Et Al 2019
- “Energy and Policy Considerations for Deep Learning in NLP”, Et Al 2019
- “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes”, Et Al 2019
- “GAP: Generalizable Approximate Graph Partitioning Framework”, Et Al 2019
- “An Empirical Model of Large-Batch Training”, Et Al 2018
- “Bayesian Layers: A Module for Neural Network Uncertainty”, Et Al 2018
- “GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism”, Et Al 2018
- “Measuring the Effects of Data Parallelism on Neural Network Training”, Et Al 2018
- “Mesh-TensorFlow: Deep Learning for Supercomputers”, Et Al 2018
- “There Is Plenty of Time at the Bottom: the Economics, Risk and Ethics of Time Compression”, 2018
- “Highly Scalable Deep Learning Training System With Mixed-Precision: Training ImageNet in 4 Minutes”, Et Al 2018
- “AI and Compute”, Et Al 2018
- “Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions”, Et Al 2018
- “Loihi: A Neuromorphic Manycore Processor With On-Chip Learning”, Et Al 2018
- “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”, Et Al 2017
- “Mixed Precision Training”, Et Al 2017
- “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”, Et Al 2016
- “Training Deep Nets With Sublinear Memory Cost”, Et Al 2016
- “GeePS: Scalable Deep Learning on Distributed GPUs With a GPU-specialized Parameter Server”, Et Al 2016
- “Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”, Et Al 2016
- “Communication-Efficient Learning of Deep Networks from Decentralized Data”, Et Al 2016
- “Persistent RNNs: Stashing Recurrent Weights On-Chip”, Et Al 2016
- “Scaling Distributed Machine Learning With the Parameter Server”, Et Al 2014
- “Multi-column Deep Neural Network for Traffic Sign Classification”, Cireşan Et Al 2012
- “Slowing Moore’s Law: How It Could Happen”, 2012
- “Multi-column Deep Neural Networks for Image Classification”, Cireşan Et Al 2012
- “HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent”, Et Al 2011
- “DanNet: Flexible, High Performance Convolutional Neural Networks for Image Classification”, Et Al 2011
- “Goodbye 2010”, 2010
- “Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations”, 2009
- “Whole Brain Emulation: A Roadmap”
- “Moore’s Law and the Technology S-Curve”, 2004
- “Ultimate Physical Limits to Computation”, 1999
- “When Will Computer Hardware Match the Human Brain?”, 1998
- “A Sociological Study of the Official History of the Perceptrons Controversy [1993]”, 1993
- “Intelligence As an Emergent Behavior; Or, The Songs of Eden”, 1988
- “TensorFlow Research Cloud (TRC): Accelerate Your Cutting-edge Machine Learning Research With Free Cloud TPUs”, TRC 2023
- “48:44—Tesla Vision · 1:13:12—Planning and Control · 1:24:35—Manual Labeling · 1:28:11—Auto Labeling · 1:35:15—Simulation · 1:42:10—Hardware Integration · 1:45:40—Dojo”
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Links
“Context on the NVIDIA ChatGPT Opportunity—and Ramifications of Large Language Model Enthusiasm”, 2023
“Context on the NVIDIA ChatGPT opportunity—and ramifications of large language model enthusiasm”, 2023-02-10 ( ; backlinks; similar; bibliography)
“Microsoft and OpenAI Extend Partnership”, 2023
“Microsoft and OpenAI extend partnership”, 2023-01-23 ( ; backlinks; similar; bibliography)
“Efficiently Scaling Transformer Inference”, Et Al 2022
“Efficiently Scaling Transformer Inference”, 2022-11-09 ( ; similar; bibliography)
“Reserve Capacity of NVIDIA HGX H100s on CoreWeave Now: Available at Scale in Q1 2023 Starting at $2.23/hr”, Core2022
“Petals: Collaborative Inference and Fine-tuning of Large Models”, Et Al 2022
“Petals: Collaborative Inference and Fine-tuning of Large Models”, 2022-09-02 ( ; similar)
“Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training”, Et Al 2022
“Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training”, 2022-08-12 ( ; similar)
“Is Integer Arithmetic Enough for Deep Learning Training?”, Et Al 2022
“Is Integer Arithmetic Enough for Deep Learning Training?”, 2022-07-18 ( ; similar)
“Efficient NLP Inference at the Edge via Elastic Pipelining”, Et Al 2022
“Efficient NLP Inference at the Edge via Elastic Pipelining”, 2022-07-11 (similar)
“Training Transformers Together”, Et Al 2022
“Training Transformers Together”, 2022-07-07 ( ; backlinks; similar)
“Tutel: Adaptive Mixture-of-Experts at Scale”, Et Al 2022
“Tutel: Adaptive Mixture-of-Experts at Scale”, 2022-06-07 ( ; similar; bibliography)
“8-bit Numerical Formats for Deep Neural Networks”, Et Al 2022
“8-bit Numerical Formats for Deep Neural Networks”, 2022-06-06 ( ; similar)
“ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Et Al 2022
“ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, 2022-06-04 ( ; similar; bibliography)
“FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness”, Et Al 2022
“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”, 2022-05-27 ( ; backlinks; similar; bibliography)
“A Low-latency Communication Design for Brain Simulations”, 2022
“A Low-latency Communication Design for Brain Simulations”, 2022-05-14 ( ; similar)
“Reducing Activation Recomputation in Large Transformer Models”, Et Al 2022
“Reducing Activation Recomputation in Large Transformer Models”, 2022-05-10 (similar)
“What Language Model to Train If You Have One Million GPU Hours?”, Et Al 2022
“What Language Model to Train if You Have One Million GPU Hours?”, 2022-04-11 ( ; similar)
“Monarch: Expressive Structured Matrices for Efficient and Accurate Training”, Et Al 2022
“Monarch: Expressive Structured Matrices for Efficient and Accurate Training”, 2022-04-01 ( ; similar; bibliography)
“Pathways: Asynchronous Distributed Dataflow for ML”, Et Al 2022
“Pathways: Asynchronous Distributed Dataflow for ML”, 2022-03-23 ( ; similar)
“Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads”, Et Al 2022
“Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads”, 2022-02-16 (similar)
“Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam”, Et Al 2022
“Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam”, 2022-02-12 ( ; similar; bibliography)
“Introducing the AI Research SuperCluster—Meta’s Cutting-edge AI Supercomputer for AI Research”, 2022
“Introducing the AI Research SuperCluster—Meta’s cutting-edge AI supercomputer for AI research”, 2022-01-24 (similar; bibliography)
“Is Programmable Overhead Worth The Cost? How Much Do We Pay for a System to Be Programmable? It Depends upon Who You Ask”, 2022
“Is Programmable Overhead Worth The Cost? How much do we pay for a system to be programmable? It depends upon who you ask”, 2022-01-13 ( ; backlinks; similar; bibliography)
“Spiking Neural Networks and Their Applications: A Review”, Et Al 2022
“Spiking Neural Networks and Their Applications: A Review”, 2022 ( ; similar)
“On the Working Memory of Humans and Great Apes: Strikingly Similar or Remarkably Different?”, Et Al 2021
“On the Working Memory of Humans and Great Apes: Strikingly Similar or Remarkably Different?”, 2021-12-14 ( ; similar)
“SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient”, Et Al 2021
“SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient”, 2021-11-23 (backlinks; similar)
“M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, Et Al 2021
“M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, 2021-11-17 ( ; similar)
“Sustainable AI: Environmental Implications, Challenges and Opportunities”, Et Al 2021
“Sustainable AI: Environmental Implications, Challenges and Opportunities”, 2021-10-30 ( ; similar)
“China Has Already Reached Exascale—On Two Separate Systems”, 2021
“China Has Already Reached Exascale—On Two Separate Systems”, 2021-10-26 (backlinks; similar)
“The Efficiency Misnomer”, Et Al 2021
“The Efficiency Misnomer”, 2021-10-25 ( ; similar)
“Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning”, Et Al 2021
“Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning”, 2021-09-24 ( ; similar)
“WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU”, Et Al 2021
“WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU”, 2021-08-31 ( ; similar)
“Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning”, Et Al 2021
“Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning”, 2021-08-24 ( ; similar)
“PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management”, Et Al 2021
“PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management”, 2021-08-12 (similar)
“Demonstration of Decentralized, Physics-Driven Learning”, Et Al 2021
“Demonstration of Decentralized, Physics-Driven Learning”, 2021-07-31 (similar)
“Chimera: Efficiently Training Large-Scale Neural Networks With Bidirectional Pipelines”, 2021
“Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines”, 2021-07-14 (backlinks; similar)
“First-Generation Inference Accelerator Deployment at Facebook”, Et Al 2021
“First-Generation Inference Accelerator Deployment at Facebook”, 2021-07-08 ( ; similar)
“Single-chip Photonic Deep Neural Network for Instantaneous Image Classification”, Et Al 2021
“Single-chip photonic deep neural network for instantaneous image classification”, 2021-06-19 (similar)
“Distributed Deep Learning in Open Collaborations”, Et Al 2021
“Distributed Deep Learning in Open Collaborations”, 2021-06-18 (backlinks; similar; bibliography)
“Ten Lessons From Three Generations Shaped Google’s TPUv4i”, Et Al 2021
“Ten Lessons From Three Generations Shaped Google’s TPUv4i”, 2021-06-14 ( ; similar; bibliography)
“2.5-dimensional Distributed Model Training”, Et Al 2021
“2.5-dimensional distributed model training”, 2021-05-30 (similar)
“Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks”, Et Al 2021
“Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks”, 2021-05-30 ( ; similar)
“A Full-stack Accelerator Search Technique for Vision Applications”, Et Al 2021
“A Full-stack Accelerator Search Technique for Vision Applications”, 2021-05-26 ( ; similar)
“ChinAI #141: The PanGu Origin Story: Notes from an Informative Zhihu Thread on PanGu”, 2021
“ChinAI #141: The PanGu Origin Story: Notes from an informative Zhihu Thread on PanGu”, 2021-05-17 (similar; bibliography)
“GSPMD: General and Scalable Parallelization for ML Computation Graphs”, Et Al 2021
“GSPMD: General and Scalable Parallelization for ML Computation Graphs”, 2021-05-10 (similar)
“PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models With Auto-parallel Computation”, Et Al 2021
“PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation”, 2021-04-26 (similar)
“ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning”, Et Al 2021
“ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning”, 2021-04-16 (similar)
“How to Train BERT With an Academic Budget”, Et Al 2021
“How to Train BERT with an Academic Budget”, 2021-04-15 ( ; similar)
“Podracer Architectures for Scalable Reinforcement Learning”, Et Al 2021
“Podracer architectures for scalable Reinforcement Learning”, 2021-04-13 ( ; similar; bibliography)
“An Efficient 2D Method for Training Super-Large Deep Learning Models”, Et Al 2021
“An Efficient 2D Method for Training Super-Large Deep Learning Models”, 2021-04-12 (similar)
“High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)”, Et Al 2021
“High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)”, 2021-04-12 ( ; similar)
“Efficient Large-Scale Language Model Training on GPU Clusters”, Et Al 2021
“Efficient Large-Scale Language Model Training on GPU Clusters”, 2021-04-09 (similar)
“Large Batch Simulation for Deep Reinforcement Learning”, Et Al 2021
“Large Batch Simulation for Deep Reinforcement Learning”, 2021-03-12 ( ; backlinks; similar)
“Warehouse-Scale Video Acceleration (Argos): Co-design and Deployment in the Wild”, Et Al 2021
“Warehouse-Scale Video Acceleration (Argos): Co-design and Deployment in the Wild”, 2021-02-27 ( ; similar)
“TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models”, Et Al 2021
“TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models”, 2021-02-16 (backlinks; similar; bibliography)
“PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers”, Et Al 2021
“PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers”, 2021-02-05 (similar; bibliography)
“ZeRO-Offload: Democratizing Billion-Scale Model Training”, Et Al 2021
“ZeRO-Offload: Democratizing Billion-Scale Model Training”, 2021-01-18 (similar)
“The Design Process for Google’s Training Chips: TPUv2 and TPUv3”, Et Al 2021
“The Design Process for Google's Training Chips: TPUv2 and TPUv3”, 2021 (similar)
“Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment”, Et Al 2020
“Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment”, 2020-12-11 (similar; bibliography)
“Parallel Training of Deep Networks With Local Updates”, Et Al 2020
“Parallel Training of Deep Networks with Local Updates”, 2020-12-07 (similar)
“Exploring the Limits of Concurrency in ML Training on Google TPUs”, Et Al 2020
“Exploring the limits of Concurrency in ML Training on Google TPUs”, 2020-11-07 (similar)
“BytePS: A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters”, Et Al 2020
“BytePS: A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters”, 2020-11-04 (similar; bibliography)
“Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour”, Et Al 2020
“Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour”, 2020-10-30 (similar; bibliography)
“Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?”, Et Al 2020
“Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?”, 2020-10-27 (similar)
“Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures”, Et Al 2020b
“Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures”, 2020-10-22 (similar; bibliography)
“L2L: Training Large Neural Networks With Constant Memory Using a New Execution Algorithm”, Et Al 2020
“L2L: Training Large Neural Networks with Constant Memory using a New Execution Algorithm”, 2020-10-16 ( ; similar)
“Interlocking Backpropagation: Improving Depthwise Model-parallelism”, Et Al 2020
“Interlocking Backpropagation: Improving depthwise model-parallelism”, 2020-10-08 (similar)
“DeepSpeed: Extreme-scale Model Training for Everyone”, Et Al 2020
“DeepSpeed: Extreme-scale model training for everyone”, 2020-09-10 ( ; backlinks; similar; bibliography)
“Measuring Hardware Overhang”, Hippke 2020
“Measuring hardware overhang”, 2020-08-05 ( ; backlinks; similar)
“The Node Is Nonsense: There Are Better Ways to Measure Progress Than the Old Moore’s Law Metric”, 2020
“The Node Is Nonsense: There are better ways to measure progress than the old Moore’s law metric”, 2020-07-28 (similar)
“Are We in an AI Overhang?”, 2020
“Are we in an AI overhang?”, 2020-07-27 (backlinks; similar)
“HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks”, 2020
“HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks”, 2020-07-11 ( ; similar)
“The Computational Limits of Deep Learning”, Et Al 2020
“The Computational Limits of Deep Learning”, 2020-07-10 (backlinks; similar)
“Data Movement Is All You Need: A Case Study on Optimizing Transformers”, Et Al 2020
“Data Movement Is All You Need: A Case Study on Optimizing Transformers”, 2020-06-30 ( ; similar)
“PyTorch Distributed: Experiences on Accelerating Data Parallel Training”, Et Al 2020
“PyTorch Distributed: Experiences on Accelerating Data Parallel Training”, 2020-06-28 (similar)
“Japanese Supercomputer Is Crowned World’s Speediest: In the Race for the Most Powerful Computers, Fugaku, a Japanese Supercomputer, Recently Beat American and Chinese Machines”, 2020
“Japanese Supercomputer Is Crowned World’s Speediest: In the race for the most powerful computers, Fugaku, a Japanese supercomputer, recently beat American and Chinese machines”, 2020-06-22 ( ; backlinks; similar)
“Sample Factory: Egocentric 3D Control from Pixels at 100,000 FPS With Asynchronous Reinforcement Learning”, Et Al 2020
“Sample Factory: Egocentric 3D Control from Pixels at 100,000 FPS with Asynchronous Reinforcement Learning”, 2020-06-21 ( ; similar)
“PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training”, Et Al 2020
“PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training”, 2020-06-16 ( ; similar)
“There’s Plenty of Room at the Top: What Will Drive Computer Performance After Moore’s Law?”, Et Al 2020
“There’s plenty of room at the Top: What will drive computer performance after Moore’s law?”, 2020-06-05 ( ; backlinks; similar)
“A Domain-specific Supercomputer for Training Deep Neural Networks”, Et Al 2020
“A domain-specific supercomputer for training deep neural networks”, 2020-06-01 (similar)
“Microsoft Announces New Supercomputer, Lays out Vision for Future AI Work”, 2020
“Microsoft announces new supercomputer, lays out vision for future AI work”, 2020-05-19 (backlinks; similar; bibliography)
“AI and Efficiency: We’re Releasing an Analysis Showing That Since 2012 the Amount of Compute Needed to Train a Neural Net to the Same Performance on ImageNet Classification Has Been Decreasing by a Factor of 2 Every 16 Months”, 2020
“AI and Efficiency: We’re releasing an analysis showing that since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet classification has been decreasing by a factor of 2 every 16 months”, 2020-05-05 ( ; backlinks; similar)
“Computation in the Human Cerebral Cortex Uses Less Than 0.2 Watts yet This Great Expense Is Optimal When considering Communication Costs”, 2020
“Computation in the human cerebral cortex uses less than 0.2 watts yet this great expense is optimal when considering communication costs”, 2020-04-25 ( ; similar)
“Startup Tenstorrent Shows AI Is Changing Computing and vice Versa: Tenstorrent Is One of the Rush of AI Chip Makers Founded in 2016 and Finally Showing Product. The New Wave of Chips Represent a Substantial Departure from How Traditional Computer Chips Work, but Also Point to Ways That Neural Network Design May Change in the Years to Come”, 2020
“Startup Tenstorrent shows AI is changing computing and vice versa: Tenstorrent is one of the rush of AI chip makers founded in 2016 and finally showing product. The new wave of chips represent a substantial departure from how traditional computer chips work, but also point to ways that neural network design may change in the years to come”, 2020-04-10 (similar; bibliography)
“AI Chips: What They Are and Why They Matter-An AI Chips Reference”, 2020
“AI Chips: What They Are and Why They Matter-An AI Chips Reference”, 2020-04 (backlinks; similar; bibliography)
“Pipelined Backpropagation at Scale: Training Large Models without Batches”, Et Al 2020
“Pipelined Backpropagation at Scale: Training Large Models without Batches”, 2020-03-25 (similar)
“2019 Recent Trends in GPU Price per FLOPS”, 2020
“2019 recent trends in GPU price per FLOPS”, 2020-03-25 ( ; backlinks; similar)
“Ultrafast Machine Vision With 2D Material Neural Network Image Sensors”, Et Al 2020
“Towards Spike-based Machine Intelligence With Neuromorphic Computing”, Et Al 2019
“Towards spike-based machine intelligence with neuromorphic computing”, 2019-11-27 (similar)
“Checkmate: Breaking the Memory Wall With Optimal Tensor Rematerialization”, Et Al 2019
“Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization”, 2019-10-07 (similar)
“Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos”, Et Al 2019
“Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos”, 2019-10-01 (similar)
“Energy and Policy Considerations for Deep Learning in NLP”, Et Al 2019
“Energy and Policy Considerations for Deep Learning in NLP”, 2019-06-05 ( ; backlinks; similar)
“Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes”, Et Al 2019
“Large Batch Optimization for Deep Learning: Training BERT in 76 minutes”, 2019-04-01 ( ; similar; bibliography)
“GAP: Generalizable Approximate Graph Partitioning Framework”, Et Al 2019
“GAP: Generalizable Approximate Graph Partitioning Framework”, 2019-03-02 ( ; similar)
“An Empirical Model of Large-Batch Training”, Et Al 2018
“An Empirical Model of Large-Batch Training”, 2018-12-14 ( ; similar)
“Bayesian Layers: A Module for Neural Network Uncertainty”, Et Al 2018
“Bayesian Layers: A Module for Neural Network Uncertainty”, 2018-12-10 ( ; similar)
“GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism”, Et Al 2018
“GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism”, 2018-11-16 (similar)
“Measuring the Effects of Data Parallelism on Neural Network Training”, Et Al 2018
“Measuring the Effects of Data Parallelism on Neural Network Training”, 2018-11-08 (similar)
“Mesh-TensorFlow: Deep Learning for Supercomputers”, Et Al 2018
“Mesh-TensorFlow: Deep Learning for Supercomputers”, 2018-11-05 (similar; bibliography)
“There Is Plenty of Time at the Bottom: the Economics, Risk and Ethics of Time Compression”, 2018
“There is plenty of time at the bottom: the economics, risk and ethics of time compression”, 2018-10-30 ( ; backlinks; similar)
“Highly Scalable Deep Learning Training System With Mixed-Precision: Training ImageNet in 4 Minutes”, Et Al 2018
“Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in 4 Minutes”, 2018-07-30 ( ; similar)
“AI and Compute”, Et Al 2018
“AI and Compute”, 2018-05-26 ( ; backlinks; similar; bibliography)
“Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions”, Et Al 2018
“Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions”, 2018-02-13 ( ; backlinks; similar)
“Loihi: A Neuromorphic Manycore Processor With On-Chip Learning”, Et Al 2018
“Loihi: A Neuromorphic Manycore Processor with On-Chip Learning”, 2018-01-16 (similar)
“Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”, Et Al 2017
“Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”, 2017-12-05 ( ; similar; bibliography)
“Mixed Precision Training”, Et Al 2017
“Mixed Precision Training”, 2017-10-10 ( ; similar)
“On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”, Et Al 2016
“On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”, 2016-09-15 ( ; similar)
“Training Deep Nets With Sublinear Memory Cost”, Et Al 2016
“Training Deep Nets with Sublinear Memory Cost”, 2016-04-21 ( ; backlinks; similar)
“GeePS: Scalable Deep Learning on Distributed GPUs With a GPU-specialized Parameter Server”, Et Al 2016
“GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server”, 2016-04-01 (similar)
“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”, Et Al 2016
“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”, 2016-03-28 ( ; similar)
“Communication-Efficient Learning of Deep Networks from Decentralized Data”, Et Al 2016
“Communication-Efficient Learning of Deep Networks from Decentralized Data”, 2016-02-17 (similar)
“Persistent RNNs: Stashing Recurrent Weights On-Chip”, Et Al 2016
“Persistent RNNs: Stashing Recurrent Weights On-Chip”, 2016-01 ( ; similar)
“Scaling Distributed Machine Learning With the Parameter Server”, Et Al 2014
“Scaling Distributed Machine Learning with the Parameter Server”, 2014-10-06 (similar)
“Multi-column Deep Neural Network for Traffic Sign Classification”, Cireşan Et Al 2012
“Multi-column deep neural network for traffic sign classification”, 2012-08 ( ; backlinks; similar)
“Slowing Moore’s Law: How It Could Happen”, 2012
“Slowing Moore’s Law: How It Could Happen”, 2012-03-16 ( ; backlinks; similar; bibliography)
“Multi-column Deep Neural Networks for Image Classification”, Cireşan Et Al 2012
“Multi-column Deep Neural Networks for Image Classification”, 2012-02-13 ( ; similar)
“HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent”, Et Al 2011
“HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent”, 2011-06-28 ( ; backlinks; similar)
“DanNet: Flexible, High Performance Convolutional Neural Networks for Image Classification”, Et Al 2011
“DanNet: Flexible, High Performance Convolutional Neural Networks for Image Classification”, 2011-02-01 ( ; similar; bibliography)
“Goodbye 2010”, 2010
“Goodbye 2010”, 2010-10-10 ( ; backlinks; similar)
“Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations”, 2009
“Bandwidth optimal all-reduce algorithms for clusters of workstations”, 2009-02-01 (similar; bibliography)
“Whole Brain Emulation: A Roadmap”
“Moore’s Law and the Technology S-Curve”, 2004
“Moore's law and the Technology S-Curve”, 2004-12-01 ( )
“Ultimate Physical Limits to Computation”, 1999
“Ultimate physical limits to computation”, 1999-08-13 ( ; backlinks; similar)
“When Will Computer Hardware Match the Human Brain?”, 1998
“When will computer hardware match the human brain?”, 1998 ( ; backlinks; similar)
“A Sociological Study of the Official History of the Perceptrons Controversy [1993]”, 1993
“A Sociological Study of the Official History of the Perceptrons Controversy [1993]”, 1993-08 ( ; backlinks; similar; bibliography)
“Intelligence As an Emergent Behavior; Or, The Songs of Eden”, 1988
“TensorFlow Research Cloud (TRC): Accelerate Your Cutting-edge Machine Learning Research With Free Cloud TPUs”, TRC 2023
“TensorFlow Research Cloud (TRC): Accelerate your cutting-edge machine learning research with free Cloud TPUs”, ( ; backlinks; similar)
“48:44—Tesla Vision · 1:13:12—Planning and Control · 1:24:35—Manual Labeling · 1:28:11—Auto Labeling · 1:35:15—Simulation · 1:42:10—Hardware Integration · 1:45:40—Dojo”
Wikipedia
Miscellaneous
-
https://astralcodexten.substack.com/p/biological-anchors-a-trick-that-might
-
https://blogs.nvidia.com/blog/2021/04/12/cpu-grace-cscs-alps/
-
https://blogs.nvidia.com/blog/2022/09/08/hopper-mlperf-inference/
-
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.323.9505&rep=rep1&type=pdf
-
https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
-
https://nitter.moomoo.me/jordanschnyc/status/1580889342402129921
-
https://openai.com/blog/techniques-for-training-large-neural-networks/
-
https://siliconangle.com/2021/05/27/perlmutter-said-worlds-fastest-ai-supercomputer-comes-online/
-
https://spectrum.ieee.org/computing/hardware/the-future-of-deep-learning-is-photonic
-
https://top500.org/news/fugaku-holds-top-spot-exascale-remains-elusive/
-
https://venturebeat.com/2020/11/17/cerebras-wafer-size-chip-is-10000-times-faster-than-a-gpu/
-
https://www.anandtech.com/show/17327/nvidia-hopper-gpu-architecture-and-h100-accelerator-announced
-
https://www.chinatalk.media/p/new-chip-export-controls-explained
-
https://www.graphcore.ai/posts/the-next-big-thing-introducing-ipu-pod128-and-ipu-pod256
-
https://www.graphcore.ai/posts/the-wow-factor-graphcore-systems-get-huge-power-and-efficiency-boost
-
https://www.hpcwire.com/2020/11/02/aws-ultraclusters-with-new-p4-a100-instances/
-
https://www.lesswrong.com/posts/gPmGTND8Kroxgpgsn/how-fast-can-we-perform-a-forward-pass
-
https://www.newyorker.com/tech/annals-of-technology/the-worlds-largest-computer-chip
-
https://www.nytimes.com/2022/10/13/us/politics/biden-china-technology-semiconductors.html
-
https://www.top500.org/news/ornls-frontier-first-to-break-the-exaflop-ceiling/
Link Bibliography
-
https://nitter.moomoo.me/davidtayar5/status/1627690520456691712
: “Context on the NVIDIA ChatGPT Opportunity—and Ramifications of Large Language Model Enthusiasm”, Morgan Stanley: -
https://blogs.microsoft.com/blog/2023/01/23/microsoftandopenaiextendpartnership/
: “Microsoft and OpenAI Extend Partnership”, Microsoft: -
https://arxiv.org/abs/2211.05102#google
: “Efficiently Scaling Transformer Inference”, : -
https://arxiv.org/abs/2206.03382#microsoft
: “Tutel: Adaptive Mixture-of-Experts at Scale”, : -
https://arxiv.org/abs/2206.01861#microsoft
: “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He: -
https://arxiv.org/abs/2205.14135
: “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness”, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré: -
https://arxiv.org/abs/2204.00595
: “Monarch: Expressive Structured Matrices for Efficient and Accurate Training”, : -
https://arxiv.org/abs/2202.06009#microsoft
: “Maximizing Communication Efficiency for Large-scale Training via 0 / 1 Adam”, Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He: -
https://ai.facebook.com/blog/ai-rsc
: “Introducing the AI Research SuperCluster—Meta’s Cutting-edge AI Supercomputer for AI Research”, Kevin Lee, Shubho Sengupta: -
https://semiengineering.com/is-programmable-overhead-worth-the-cost/
: “Is Programmable Overhead Worth The Cost? How Much Do We Pay for a System to Be Programmable? It Depends upon Who You Ask”, Brian Bailey: -
https://arxiv.org/abs/2106.10207
: “Distributed Deep Learning in Open Collaborations”, : -
2021-jouppi.pdf
: “Ten Lessons From Three Generations Shaped Google’s TPUv4i”, : -
https://chinai.substack.com/p/chinai-141-the-pangu-origin-story
: “ChinAI #141: The PanGu Origin Story: Notes from an Informative Zhihu Thread on PanGu”, Jeffrey Ding: -
https://arxiv.org/abs/2104.06272#deepmind
: “Podracer Architectures for Scalable Reinforcement Learning”, Matteo Hessel, Manuel Kroiss, Aidan Clark, Iurii Kemaev, John Quan, Thomas Keck, Fabio Viola, Hado van Hasselt: -
https://arxiv.org/abs/2102.07988
: “TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models”, Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica: -
https://arxiv.org/abs/2102.03161
: “PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers”, Chaoyang He, Shen Li, Mahdi Soltanolkotabi, Salman Avestimehr: -
https://arxiv.org/abs/2012.06373#lighton
: “Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment”, : -
2020-jiang.pdf
: “BytePS: A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU / CPU Clusters”, Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo: -
https://arxiv.org/abs/2011.00071#google
: “Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour”, Arissa Wongpanich, Hieu Pham, James Demmel, Mingxing Tan, Quoc Le, Yang You, Sameer Kumar: -
2020-launay-2.pdf
: “Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures”, Julien Launay, Iacopo Poli, François Boniface, Florent Krzakala: -
https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
: “DeepSpeed: Extreme-scale Model Training for Everyone”, DeepSpeed Team, Rangan Majumder, Junhua Wang: -
https://blogs.microsoft.com/ai/openai-azure-supercomputer/
: “Microsoft Announces New Supercomputer, Lays out Vision for Future AI Work”, Jennifer Langston: -
https://www.zdnet.com/article/startup-tenstorrent-and-competitors-show-how-computing-is-changing-ai-and-vice-versa/
: “Startup Tenstorrent Shows AI Is Changing Computing and vice Versa: Tenstorrent Is One of the Rush of AI Chip Makers Founded in 2016 and Finally Showing Product. The New Wave of Chips Represent a Substantial Departure from How Traditional Computer Chips Work, but Also Point to Ways That Neural Network Design May Change in the Years to Come”, Tiernan Ray: -
2020-khan.pdf
: “AI Chips: What They Are and Why They Matter-An AI Chips Reference”, Saif M. Khan, Alexander Mann: -
https://arxiv.org/abs/1904.00962#google
: “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes”, : -
https://arxiv.org/abs/1811.02084#google
: “Mesh-TensorFlow: Deep Learning for Supercomputers”, : -
https://openai.com/blog/ai-and-compute/
: “AI and Compute”, Dario Amodei, Danny Hernandez, Girish Sastry, Jack Clark, Greg Brockman, Ilya Sutskever: -
https://arxiv.org/abs/1712.01887
: “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”, Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally: -
slowing-moores-law
: “Slowing Moore’s Law: How It Could Happen”, Gwern Branwen: -
https://arxiv.org/abs/1102.0183#schmidhuber
: “DanNet: Flexible, High Performance Convolutional Neural Networks for Image Classification”, Dan Claudiu Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, Jürgen Schmidhuber: -
2009-patarasuk.pdf
: “Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations”, Pitch Patarasuk, Xin Yuan: -
1993-olazaran.pdf
: “A Sociological Study of the Official History of the Perceptrons Controversy [1993]”, Mikel Olazaran: