‘AI hardware’ directory

Gwern

‘AI hardware’ directory

Gwern

“LLMs Can Be Faster Than You Think ”, Gwern 2025

LLMs can be faster than you think⁠

“Hardware Hedging Scaling Risks ”, Gwern 2024

Hardware hedging scaling risks⁠

“Computer Optimization: Your Computer Is Faster Than You Think ”, Gwern 2021

⁠Computer Optimization: Your Computer Is Faster Than You Think

“Slowing Moore’s Law: How It Could Happen ”, Gwern 2012

Slowing Moore’s Law: How It Could Happen

Links

“Alphabet, Nvidia Invest in OpenAI Co-Founder Sutskever’s SSI, Source Says ”, Cai & Hu 2025

⁠Alphabet, Nvidia invest in OpenAI co-founder Sutskever’s SSI, source says⁠

“What’s Wrong With Apple? Even Before the Threat of President Trump’s Tariffs, There Were Questions about the Company’s Inability to Make Good on New Ideas ”, Mickle 2025

⁠What’s Wrong With Apple? Even before the threat of President Trump’s tariffs, there were questions about the company’s inability to make good on new ideas⁠

“Pre-Training GPT-4.5 ”, Tootoonchian et al 2025

⁠⁠Pre-Training GPT-4.5⁠ :

⁠https://www.youtube.com/watch?v=6nJZopACRuQ⁠

“OpenAI’s First Stargate Site to Hold Up to 400,000 Nvidia Chips ”, Ford et al 2025

⁠OpenAI’s First Stargate Site to Hold Up to 400,000 Nvidia Chips⁠

“Measuring Automated Kernel Engineering ”

⁠⁠Measuring Automated Kernel Engineering

“How To Scale Your Model on TPUs ”, Austin et al 2025

⁠How To Scale Your Model on TPUs⁠ :

View HTML:

⁠/doc/www/jax-ml.github.io/d2a695dfc9fc630e8d46aa398cfecad580c1da78.html#google⁠

“Learning-By-Doing in the Semiconductor Industry ”, Decker 2025

Learning-by-doing in the Semiconductor Industry⁠

“Cerebras Launches World’s Fastest DeepSeek R1 Llama-70B Inference ”, Wang 2025

Cerebras Launches World’s Fastest DeepSeek R1 Llama-70B Inference⁠

“Neuromorphic Computing at Scale ”, Kudithipudi et al 2025

Neuromorphic computing at scale⁠

“Good Things Come in Small Packages: Should We Adopt Lite-GPUs in AI Infrastructure? ”, Canakci et al 2025

Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure?⁠

“Christophe Fouquet, CEO ASML: ‘Je Moest Eens Weten Hoeveel Fuck-Ups Er Nodig Zijn Om De Meest Complexe Machine Ter Wereld Te Maken’—NRC ”

⁠Christophe Fouquet, CEO ASML: ‘Je moest eens weten hoeveel fuck-ups er nodig zijn om de meest complexe machine ter wereld te maken’—NRC :

View HTML:

⁠/doc/www/www.nrc.nl/1b69949c07fa78ee7c709c2000c7f6ca47f583d1.html⁠

“Why a US AI "Manhattan Project" Could Backfire: Notes from Conversations in China ”, Todd 2024

Why a US AI "Manhattan Project" could backfire: notes from conversations in China⁠

“Getting AI Datacenters in the UK: Why the UK Needs to Create Special Compute Zones; and How to Do It ”, Wiseman et al 2024

Getting AI datacenters in the UK: Why the UK needs to create Special Compute Zones; and how to do it⁠

“The Future of Compute: Nvidia’s Crown Is Slipping ”, Dagarwal 2024

The Future of Compute: Nvidia’s Crown is Slipping⁠

“Jake Sullivan: The American Who Waged a Tech War on China ”

⁠Jake Sullivan: The American Who Waged a Tech War on China⁠ :

View External Link:

⁠https://www.wired.com/story/jake-sullivan-china-tech-profile/⁠

“Nvidia’s AI Chips Are Cheaper to Rent in China Than US: Supply of Processors Helps Chinese Start-Ups Advance Artificial Intelligence Technology despite Washington’s Restrictions ”, McMorrow & Olcott 2024

Nvidia’s AI chips are cheaper to rent in China than US: Supply of processors helps Chinese start-ups advance artificial intelligence technology despite Washington’s restrictions⁠

“Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine ”, Zhang et al 2024

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine⁠

“Chips or Not, Chinese AI Pushes Ahead: A Host of Chinese AI Startups Are Attempting to Write More Efficient Code for Large Language Models ”, Kao & Huang 2024

Chips or Not, Chinese AI Pushes Ahead: A host of Chinese AI startups are attempting to write more efficient code for large language models⁠

“Can AI Scaling Continue Through 2030? ”, Sevilla et al 2024

⁠Can AI Scaling Continue Through 2030? :

View HTML:

⁠/doc/www/epochai.org/5c00a88806d4f5be233a817b199df13bc601f299.html⁠

“UK Government Shelves £1.3bn UK Tech and AI Plans ”

UK Government shelves £1.3bn UK tech and AI plans⁠

“OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training ”, Jaghouar et al 2024

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training⁠

“Huawei Faces Production Challenges With 20% Yield Rate for AI Chip ”, Trendforce 2024

Huawei Faces Production Challenges with 20% Yield Rate for AI Chip

“RAM Is Practically Endless Now ”, fxtentacles 2024

RAM is practically endless now⁠

“Huawei ‘Unable to Secure 3.5 Nanometer Chips’ ”, Choi 2024

Huawei ‘Unable to Secure 3.5 Nanometer Chips’

“China Is Losing the Chip War: Xi Jinping Picked a Fight over Semiconductor Technology—One He Can’t Win ”, Schuman 2024

China Is Losing the Chip War: Xi Jinping picked a fight over semiconductor technology—one he can’t win⁠

“Scalable Matmul-Free Language Modeling ”, Zhu et al 2024

Scalable Matmul-free Language Modeling⁠

“Elon Musk Ordered Nvidia to Ship Thousands of AI Chips Reserved for Tesla to Twitter/xAI ”, Kolodny 2024

Elon Musk ordered Nvidia to ship thousands of AI chips reserved for Tesla to Twitter/xAI⁠

“Earnings Call: Tesla Discusses Q1 2024 Challenges and AI Expansion ”, Abdulkadir 2024

Earnings call: Tesla Discusses Q1 2024 Challenges and AI Expansion

“Microsoft, OpenAI Plan $100 Billion Data-Center Project, Media Report Says ”, Reuters 2024

Microsoft, OpenAI plan $100 billion data-center project, media report says⁠

“AI and Memory Wall ”, Gholami et al 2024

AI and Memory Wall⁠

“Singapore’s Temasek in Discussions to Invest in OpenAI: State-Backed Group in Talks With ChatGPT Maker’s Chief Sam Altman Who Is Seeking Funding to Build Chips Business ”, Murgia & Ruehl 2024

Singapore’s Temasek in discussions to invest in OpenAI: State-backed group in talks with ChatGPT maker’s chief Sam Altman who is seeking funding to build chips business⁠

“China’s Military and Government Acquire Nvidia Chips despite US Ban ”, Baptista 2024

China’s military and government acquire Nvidia chips despite US ban⁠

“Generative AI Beyond LLMs: System Implications of Multi-Modal Generation ”, Golden et al 2023

Generative AI Beyond LLMs: System Implications of Multi-Modal Generation⁠

“Real-Time AI & The Future of AI Hardware ”, Uberti 2023

⁠Real-Time AI & The Future of AI Hardware⁠

“How Jensen Huang’s Nvidia Is Powering the AI Revolution: The Company’s CEO Bet It All on a New Kind of Chip. Now That Nvidia Is One of the Biggest Companies in the World, What Will He Do Next? ”, Witt 2023

How Jensen Huang’s Nvidia Is Powering the AI Revolution: The company’s CEO bet it all on a new kind of chip. Now that Nvidia is one of the biggest companies in the world, what will he do next?⁠

“OpenAI Agreed to Buy $51 Million of AI Chips From a Startup Backed by CEO Sam Altman ”, Dave 2023

OpenAI Agreed to Buy $51 Million of AI Chips From a Startup Backed by CEO Sam Altman⁠

“Microsoft Swallows OpenAI’s Core Team § Compute Is King ”, Patel & Nishball 2023

Microsoft Swallows OpenAI’s Core Team § Compute Is King

“Altman Sought Billions For Chip Venture Before OpenAI Ouster: Altman Was Fundraising in the Middle East for New Chip Venture; The Project, Code-Named Tigris, Is Intended to Rival Nvidia ”, Ludlow & Vance 2023

Altman Sought Billions For Chip Venture Before OpenAI Ouster: Altman was fundraising in the Middle East for new chip venture; The project, code-named Tigris, is intended to rival Nvidia⁠

“DiLoCo: Distributed Low-Communication Training of Language Models ”, Douillard et al 2023

DiLoCo: Distributed Low-Communication Training of Language Models⁠

“LSS Transformer: Ultra-Long Sequence Distributed Transformer ”, Wang et al 2023

LSS Transformer: Ultra-Long Sequence Distributed Transformer⁠

“ChipNeMo: Domain-Adapted LLMs for Chip Design ”, Liu et al 2023

ChipNeMo: Domain-Adapted LLMs for Chip Design⁠

wagieeacc @ "2023-10-17"

GPT-5 hardware rumor⁠

“Saudi-China Collaboration Raises Concerns about Access to AI Chips: Fears Grow at Gulf Kingdom’s Top University That Ties to Chinese Researchers Risk Upsetting US Government ”, Kerr et al 2023

Saudi-China collaboration raises concerns about access to AI chips: Fears grow at Gulf kingdom’s top university that ties to Chinese researchers risk upsetting US government⁠

“Efficient Video and Audio Processing With Loihi 2 ”, Shrestha et al 2023

Efficient Video and Audio processing with Loihi 2⁠

“Biden Is Beating China on Chips. It May Not Be Enough. ”, Wang 2023

Biden Is Beating China on Chips. It May Not Be Enough.⁠

“Deep Mind’s Chief on AI’s Dangers—And the UK’s £900 Million Supercomputer: Demis Hassabis Says We Shouldn’t Let AI Fall into the Wrong Hands and the Government’s Plan to Build a Supercomputer for AI Is Likely to Be out of Date Before It Has Even Started ”, Sellman 2023

Deep Mind’s chief on AI’s dangers—and the UK’s £900 million supercomputer: Demis Hassabis says we shouldn’t let AI fall into the wrong hands and the government’s plan to build a supercomputer for AI is likely to be out of date before it has even started⁠

“Inflection AI Announces $1.3 Billion of Funding Led by Current Investors, Microsoft, and NVIDIA ”, AI 2023

Inflection AI announces $1.3 billion of funding led by current investors, Microsoft, and NVIDIA

“U.S. Considers New Curbs on AI Chip Exports to China: Restrictions Come amid Concerns That China Could Use AI Chips from Nvidia and Others for Weapon Development and Hacking ”, Fitch et al 2023

U.S. Considers New Curbs on AI Chip Exports to China: Restrictions come amid concerns that China could use AI chips from Nvidia and others for weapon development and hacking⁠

“Unleashing True Utility Computing With Quicksand ”, Ruan et al 2023

Unleashing True Utility Computing with Quicksand⁠

“The AI Boom Runs on Chips, but It Can’t Get Enough: ‘It’s like Toilet Paper during the Pandemic.’ Startups, Investors Scrounge for Computational Firepower ”, Seetharaman & Dotan 2023

The AI Boom Runs on Chips, but It Can’t Get Enough: ‘It’s like toilet paper during the pandemic.’ Startups, investors scrounge for computational firepower⁠

“Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning [Updated] ”, Lie 2023

Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning [Updated]⁠

“Big-PERCIVAL: Exploring the Native Use of 64-Bit Posit Arithmetic in Scientific Computing ”, Mallasén et al 2023

Big-PERCIVAL: Exploring the Native Use of 64-Bit Posit Arithmetic in Scientific Computing⁠

“Accelerating Large GPT Training With Sparse Pre-Training and Dense Fine-Tuning ”, Thangarasa 2023

Accelerating Large GPT Training with Sparse Pre-Training and Dense Fine-Tuning⁠

davidtayar5 @ "2023-02-10"

Context on the NVIDIA ChatGPT opportunity—and ramifications of large language model enthusiasm⁠

“SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient ”, Ryabinin et al 2023

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient⁠

“Microsoft and OpenAI Extend Partnership ”, Microsoft 2023

Microsoft and OpenAI extend partnership⁠

“A 64-Core Mixed-Signal In-Memory Compute Chip Based on Phase-Change Memory for Deep Neural Network Inference ”, Gallo et al 2022

A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference⁠

“MultiRay: Optimizing Efficiency for Large-Scale AI Models ”, Gupta et al 2022

MultiRay: Optimizing efficiency for large-scale AI models⁠

“Efficiently Scaling Transformer Inference ”, Pope et al 2022

Efficiently Scaling Transformer Inference⁠

“Reserve Capacity of NVIDIA HGX H100s on CoreWeave Now: Available at Scale in Q1 2023 Starting at $2.23/hr ”, CoreWeave 2022

⁠Reserve capacity of NVIDIA HGX H100s on CoreWeave now: available at scale in Q1 2023 starting at $2.23/hr :

View HTML:

⁠/doc/www/www.businesswire.com/674677f3e8ac350ac136e733bd3264f75027595a.html⁠

“Petals: Collaborative Inference and Fine-Tuning of Large Models ”, Borzunov et al 2022

Petals: Collaborative Inference and Fine-tuning of Large Models⁠

“Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training ”, You et al 2022

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training⁠

“Is Integer Arithmetic Enough for Deep Learning Training? ”, Ghaffari et al 2022

Is Integer Arithmetic Enough for Deep Learning Training?⁠

“Efficient NLP Inference at the Edge via Elastic Pipelining ”, Guo et al 2022

Efficient NLP Inference at the Edge via Elastic Pipelining⁠

“Training Transformers Together ”, Borzunov et al 2022

Training Transformers Together⁠

“Tutel: Adaptive Mixture-Of-Experts at Scale ”, Hwang et al 2022

Tutel: Adaptive Mixture-of-Experts at Scale⁠

“8-Bit Numerical Formats for Deep Neural Networks ”, Noune et al 2022

8-bit Numerical Formats for Deep Neural Networks⁠

“ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers ”, Yao et al 2022

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers⁠

“FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness ”, Dao et al 2022

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness⁠

“A Low-Latency Communication Design for Brain Simulations ”, Du 2022

A Low-latency Communication Design for Brain Simulations⁠

“Reducing Activation Recomputation in Large Transformer Models ”, Korthikanti et al 2022

Reducing Activation Recomputation in Large Transformer Models⁠

“What Language Model to Train If You Have One Million GPU Hours? ”, Scao et al 2022

What Language Model to Train if You Have One Million GPU Hours?⁠

“Monarch: Expressive Structured Matrices for Efficient and Accurate Training ”, Dao et al 2022

Monarch: Expressive Structured Matrices for Efficient and Accurate Training⁠

“Pathways: Asynchronous Distributed Dataflow for ML ”, Barham et al 2022

Pathways: Asynchronous Distributed Dataflow for ML⁠

“LiteTransformerSearch: Training-Free Neural Architecture Search for Efficient Language Models ”, Javaheripi et al 2022

LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models⁠

“Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads ”, Shukla et al 2022

Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads⁠

“Maximizing Communication Efficiency for Large-Scale Training via 0/1 Adam ”, Lu et al 2022

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam⁠

“Introducing the AI Research SuperCluster—Facebook’s Cutting-Edge AI Supercomputer for AI Research ”, Lee & Sengupta 2022

Introducing the AI Research SuperCluster—Facebook’s cutting-edge AI supercomputer for AI research⁠

“Is Programmable Overhead Worth The Cost? How Much Do We Pay for a System to Be Programmable? It Depends upon Who You Ask ”, Bailey 2022

Is Programmable Overhead Worth The Cost? How much do we pay for a system to be programmable? It depends upon who you ask

“Spiking Neural Networks and Their Applications: A Review ”, Yamazaki et al 2022

Spiking Neural Networks and Their Applications: A Review⁠

“On the Working Memory of Humans and Great Apes: Strikingly Similar or Remarkably Different? ”, Read et al 2021

On the Working Memory of Humans and Great Apes: Strikingly Similar or Remarkably Different?⁠

“Sustainable AI: Environmental Implications, Challenges and Opportunities ”, Wu et al 2021

Sustainable AI: Environmental Implications, Challenges and Opportunities⁠

“China Has Already Reached Exascale—On Two Separate Systems ”, Hemsoth 2021

China Has Already Reached Exascale—On Two Separate Systems⁠

“The Efficiency Misnomer ”, Dehghani et al 2021

The Efficiency Misnomer⁠

“Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning ”, Rudin et al 2021

Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning⁠

“WarpDrive: Extremely Fast End-To-End Deep Multi-Agent Reinforcement Learning on a GPU ”, Lan et al 2021

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU⁠

“Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning ”, Makoviychuk et al 2021

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning⁠

“PatrickStar: Parallel Training of Pre-Trained Models via Chunk-Based Memory Management ”, Fang et al 2021

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management⁠

“Demonstration of Decentralized, Physics-Driven Learning ”, Dillavou et al 2021

Demonstration of Decentralized, Physics-Driven Learning⁠

“Chimera: Efficiently Training Large-Scale Neural Networks With Bidirectional Pipelines ”, Li & Hoefler 2021

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines⁠

“First-Generation Inference Accelerator Deployment at Facebook ”, Anderson et al 2021

First-Generation Inference Accelerator Deployment at Facebook⁠

“Single-Chip Photonic Deep Neural Network for Instantaneous Image Classification ”, Ashtiani et al 2021

Single-chip photonic deep neural network for instantaneous image classification⁠

“Distributed Deep Learning in Open Collaborations ”, Diskin et al 2021

Distributed Deep Learning in Open Collaborations⁠

“Ten Lessons From Three Generations Shaped Google’s TPUv4i ”, Jouppi et al 2021

Ten Lessons From Three Generations Shaped Google’s TPUv4i⁠

“Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks ”, Bian et al 2021

Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks⁠

“2.5-Dimensional Distributed Model Training ”, Wang et al 2021

2.5-dimensional distributed model training⁠

“A Full-Stack Accelerator Search Technique for Vision Applications ”, Zhang et al 2021

A Full-stack Accelerator Search Technique for Vision Applications⁠

“ChinAI #141: The PanGu Origin Story: Notes from an Informative Zhihu Thread on PanGu ”, Ding 2021

ChinAI #141: The PanGu Origin Story: Notes from an informative Zhihu Thread on PanGu⁠

“GSPMD: General and Scalable Parallelization for ML Computation Graphs ”, Xu et al 2021

GSPMD: General and Scalable Parallelization for ML Computation Graphs⁠

“PanGu-Α: Large-Scale Autoregressive Pretrained Chinese Language Models With Auto-Parallel Computation ”, Zeng et al 2021

PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation⁠

“ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning ”, Rajbhandari et al 2021

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning⁠

“How to Train BERT With an Academic Budget ”, Izsak et al 2021

How to Train BERT with an Academic Budget⁠

“Podracer Architectures for Scalable Reinforcement Learning ”, Hessel et al 2021

Podracer architectures for scalable Reinforcement Learning⁠

“High-Performance, Distributed Training of Large-Scale Deep Learning Recommendation Models (DLRMs) ”, Mudigere et al 2021

High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)⁠

“An Efficient 2D Method for Training Super-Large Deep Learning Models ”, Xu et al 2021

An Efficient 2D Method for Training Super-Large Deep Learning Models⁠

“Efficient Large-Scale Language Model Training on GPU Clusters ”, Narayanan et al 2021

Efficient Large-Scale Language Model Training on GPU Clusters⁠

“Large Batch Simulation for Deep Reinforcement Learning ”, Shacklett et al 2021

Large Batch Simulation for Deep Reinforcement Learning⁠

“Warehouse-Scale Video Acceleration (Argos): Co-Design and Deployment in the Wild ”, Ranganathan et al 2021

Warehouse-Scale Video Acceleration (Argos): Co-design and Deployment in the Wild⁠

“TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models ”, Li et al 2021

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models⁠

“PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers ”, He et al 2021

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers⁠

“ZeRO-Offload: Democratizing Billion-Scale Model Training ”, Ren et al 2021

ZeRO-Offload: Democratizing Billion-Scale Model Training⁠

“The Design Process for Google’s Training Chips: TPUv2 and TPUv3 ”, Norrie et al 2021

The Design Process for Google’s Training Chips: TPUv2 and TPUv3⁠

“Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment ”, Launay et al 2020

Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment⁠

“Parallel Training of Deep Networks With Local Updates ”, Laskin et al 2020

Parallel Training of Deep Networks with Local Updates⁠

“Exploring the Limits of Concurrency in ML Training on Google TPUs ”, Kumar et al 2020

Exploring the limits of Concurrency in ML Training on Google TPUs⁠

“BytePS: A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters ”, Jiang et al 2020

BytePS: A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters⁠

“Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour ”, Wongpanich et al 2020

Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour⁠

“Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws? ”, Domke et al 2020

Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?⁠

“Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures ”, Launay et al 2020b

Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures⁠

“Interlocking Backpropagation: Improving Depthwise Model-Parallelism ”, Gomez et al 2020

Interlocking Backpropagation: Improving depthwise model-parallelism⁠

“DeepSpeed: Extreme-Scale Model Training for Everyone ”, Team et al 2020

DeepSpeed: Extreme-scale model training for everyone⁠

“Measuring Hardware Overhang ”, Hippke 2020

Measuring hardware overhang⁠

“The Node Is Nonsense: There Are Better Ways to Measure Progress Than the Old Moore’s Law Metric ”, Moore 2020

The Node Is Nonsense: There are better ways to measure progress than the old Moore’s law metric⁠

“Are We in an AI Overhang? ”, Jones 2020

Are we in an AI overhang?⁠

“HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks ”, Garland & Gregg 2020

HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks⁠

“The Computational Limits of Deep Learning ”, Thompson et al 2020

The Computational Limits of Deep Learning⁠

“Data Movement Is All You Need: A Case Study on Optimizing Transformers ”, Ivanov et al 2020

Data Movement Is All You Need: A Case Study on Optimizing Transformers⁠

“PyTorch Distributed: Experiences on Accelerating Data Parallel Training ”, Li et al 2020

PyTorch Distributed: Experiences on Accelerating Data Parallel Training⁠

“Japanese Supercomputer Is Crowned World’s Speediest: In the Race for the Most Powerful Computers, Fugaku, a Japanese Supercomputer, Recently Beat American and Chinese Machines ”, Clark 2020

Japanese Supercomputer Is Crowned World’s Speediest: In the race for the most powerful computers, Fugaku, a Japanese supercomputer, recently beat American and Chinese machines⁠

“Sample Factory: Egocentric 3D Control from Pixels at 100,000 FPS With Asynchronous Reinforcement Learning ”, Petrenko et al 2020

Sample Factory: Egocentric 3D Control from Pixels at 100,000 FPS with Asynchronous Reinforcement Learning⁠

“PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training ”, Narayanan et al 2020

PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training⁠

“There’s Plenty of Room at the Top: What Will Drive Computer Performance After Moore’s Law? ”, Leiserson et al 2020

There’s plenty of room at the Top: What will drive computer performance after Moore’s law?⁠

“A Domain-Specific Supercomputer for Training Deep Neural Networks ”, Jouppi et al 2020

A domain-specific supercomputer for training deep neural networks⁠

“Microsoft Announces New Supercomputer, Lays out Vision for Future AI Work ”, Langston 2020

Microsoft announces new supercomputer, lays out vision for future AI work⁠

“AI and Efficiency: We’re Releasing an Analysis Showing That Since 2012 the Amount of Compute Needed to Train a Neural Net to the Same Performance on ImageNet Classification Has Been Decreasing by a Factor of 2 Every 16 Months ”, Hernandez & Brown 2020

AI and Efficiency: We’re releasing an analysis showing that since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet classification has been decreasing by a factor of 2 every 16 months⁠

“Computation in the Human Cerebral Cortex Uses Less Than 0.2 Watts yet This Great Expense Is Optimal When considering Communication Costs ”, Levy & Calvert 2020

Computation in the human cerebral cortex uses less than 0.2 watts yet this great expense is optimal when considering communication costs⁠

“Startup Tenstorrent Shows AI Is Changing Computing and vice Versa: Tenstorrent Is One of the Rush of AI Chip Makers Founded in 2016 and Finally Showing Product. The New Wave of Chips Represent a Substantial Departure from How Traditional Computer Chips Work, but Also Point to Ways That Neural Network Design May Change in the Years to Come ”, Ray 2020

Startup Tenstorrent shows AI is changing computing and vice versa: Tenstorrent is one of the rush of AI chip makers founded in 2016 and finally showing product. The new wave of chips represent a substantial departure from how traditional computer chips work, but also point to ways that neural network design may change in the years to come

“AI Chips: What They Are and Why They Matter—An AI Chips Reference ”, Khan & Mann 2020

AI Chips: What They Are and Why They Matter—An AI Chips Reference⁠

“2019 Recent Trends in GPU Price per FLOPS ”, Bergal 2020

2019 recent trends in GPU price per FLOPS⁠

“Pipelined Backpropagation at Scale: Training Large Models without Batches ”, Kosson et al 2020

Pipelined Backpropagation at Scale: Training Large Models without Batches⁠

“Ultrafast Machine Vision With 2D Material Neural Network Image Sensors ”, Mennel et al 2020

⁠Ultrafast machine vision with 2D material neural network image sensors⁠ :

View PDF:

⁠/doc/ai/scaling/hardware/2020-mennel.pdf⁠

“L2L: Training Large Neural Networks With Constant Memory Using a New Execution Algorithm ”, Pudipeddi et al 2020

L2L: Training Large Neural Networks with Constant Memory using a New Execution Algorithm⁠

“Towards Spike-Based Machine Intelligence With Neuromorphic Computing ”, Roy et al 2019

Towards spike-based machine intelligence with neuromorphic computing⁠

“Checkmate: Breaking the Memory Wall With Optimal Tensor Rematerialization ”, Jain et al 2019

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization⁠

“Training Kinetics in 15 Minutes: Large-Scale Distributed Training on Videos ”, Lin et al 2019

Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos⁠

“Energy and Policy Considerations for Deep Learning in NLP ”, Strubell et al 2019

Energy and Policy Considerations for Deep Learning in NLP⁠

“Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes ”, You et al 2019

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes⁠

“GAP: Generalizable Approximate Graph Partitioning Framework ”, Nazi et al 2019

GAP: Generalizable Approximate Graph Partitioning Framework⁠

“An Empirical Model of Large-Batch Training ”, McCandlish et al 2018

An Empirical Model of Large-Batch Training⁠

“Bayesian Layers: A Module for Neural Network Uncertainty ”, Tran et al 2018

Bayesian Layers: A Module for Neural Network Uncertainty⁠

“GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism ”, Huang et al 2018

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism⁠

“Measuring the Effects of Data Parallelism on Neural Network Training ”, Shallue et al 2018

Measuring the Effects of Data Parallelism on Neural Network Training⁠

“Mesh-TensorFlow: Deep Learning for Supercomputers ”, Shazeer et al 2018

Mesh-TensorFlow: Deep Learning for Supercomputers⁠

“There Is Plenty of Time at the Bottom: the Economics, Risk and Ethics of Time Compression ”, Sandberg 2018

There is plenty of time at the bottom: the economics, risk and ethics of time compression⁠

“Highly Scalable Deep Learning Training System With Mixed-Precision: Training ImageNet in 4 Minutes ”, Jia et al 2018

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in 4 Minutes⁠

“AI and Compute ”, Amodei et al 2018

AI and Compute⁠

“Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions ”, Vasilache et al 2018

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions⁠

“Loihi: A Neuromorphic Manycore Processor With On-Chip Learning ”, Davies et al 2018

Loihi: A Neuromorphic Manycore Processor with On-Chip Learning⁠

“Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training ”, Lin et al 2017

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training⁠

“Mixed Precision Training ”, Micikevicius et al 2017

Mixed Precision Training⁠

“On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima ”, Keskar et al 2016

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima⁠

“Training Deep Nets With Sublinear Memory Cost ”, Chen et al 2016

Training Deep Nets with Sublinear Memory Cost⁠

“GeePS: Scalable Deep Learning on Distributed GPUs With a GPU-Specialized Parameter Server ”, Cui et al 2016

GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server⁠

“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing ”, Esser et al 2016

Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing⁠

“Communication-Efficient Learning of Deep Networks from Decentralized Data ”, McMahan et al 2016

Communication-Efficient Learning of Deep Networks from Decentralized Data⁠

“Persistent RNNs: Stashing Recurrent Weights On-Chip ”, Diamos et al 2016

Persistent RNNs: Stashing Recurrent Weights On-Chip⁠

“The Brain As a Universal Learning Machine ”, Cannell 2015

The Brain as a Universal Learning Machine⁠

“Scaling Distributed Machine Learning With the Parameter Server ”, Li et al 2014

Scaling Distributed Machine Learning with the Parameter Server⁠

“Multi-Column Deep Neural Network for Traffic Sign Classification ”, Cireşan et al 2012b

Multi-column deep neural network for traffic sign classification⁠

“Multi-Column Deep Neural Networks for Image Classification ”, Cireşan et al 2012

Multi-column Deep Neural Networks for Image Classification⁠

“Building High-Level Features Using Large Scale Unsupervised Learning ”, Le et al 2011

Building high-level features using large scale unsupervised learning⁠

“Implications of Historical Trends in the Electrical Efficiency of Computing ”, Koomey et al 2011

Implications of Historical Trends in the Electrical Efficiency of Computing⁠

“HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent ”, Niu et al 2011

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent⁠

“DanNet: Flexible, High Performance Convolutional Neural Networks for Image Classification ”, Ciresan et al 2011

DanNet: Flexible, High Performance Convolutional Neural Networks for Image Classification⁠

“Goodbye 2010 ”, Legg 2010

Goodbye 2010⁠

“Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition ”, Ciresan et al 2010

Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition⁠

“The Cat Is out of the Bag: Cortical Simulations With 10⁹ Neurons, 10¹³ Synapses ”, Ananthanarayanan et al 2009

The cat is out of the bag: cortical simulations with 10⁹ neurons, 10¹³ synapses⁠

“Large-Scale Deep Unsupervised Learning Using Graphics Processors ”, Raina et al 2009

Large-scale deep unsupervised learning using graphics processors⁠

“Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations ”, Patarasuk & Yuan 2009

Bandwidth optimal all-reduce algorithms for clusters of workstations⁠

“Whole Brain Emulation: A Roadmap ”

⁠Whole Brain Emulation: A Roadmap⁠

“Moore’s Law and the Technology S-Curve ”, Bowden 2004

⁠Moore’s law and the Technology S-Curve⁠

“DARPA and the Quest for Machine Intelligence, 1983–1993 ”, Roland & Shiman 2002

DARPA and the Quest for Machine Intelligence, 1983–1993⁠

“Ultimate Physical Limits to Computation ”, Lloyd 1999

Ultimate physical limits to computation⁠

“Matrioshka Brains ”

⁠Matrioshka Brains⁠

“When Will Computer Hardware Match the Human Brain? ”, Moravec 1998

When will computer hardware match the human brain?

“Superhumanism: According to Hans Moravec § AI Scaling ”, Platt 1995

Superhumanism: According to Hans Moravec § AI Scaling⁠

“A Sociological Study of the Official History of the Perceptrons Controversy [1993] ”, Olazaran 1993

A Sociological Study of the Official History of the Perceptrons Controversy [1993]⁠

“Intelligence As an Emergent Behavior; Or, The Songs of Eden ”, Hillis 1988

⁠Intelligence as an Emergent Behavior; or, The Songs of Eden⁠ :

View PDF:

⁠/doc/ai/scaling/hardware/1988-hillis.pdf⁠

“Computing With Connections ”, Sejnowski 1987

Computing with Connections⁠

“The Role Of RAW POWER In INTELLIGENCE ”, Moravec 1976

The Role Of RAW POWER In INTELLIGENCE⁠

“Brain Performance in FLOPS ”

⁠Brain performance in FLOPS⁠ :

View HTML:

⁠/doc/www/aiimpacts.org/1ed3a397742812a8b113c337d47877fac67cbcb6.html⁠

“Google Demonstrates Leading Performance in Latest MLPerf Benchmarks ”

⁠Google demonstrates leading performance in latest MLPerf Benchmarks⁠ :

View HTML:

⁠/doc/www/blog.tensorflow.org/cfd3f72d37292ced91bee8d23ce0ab0195b78847.html⁠

“H100 GPUs Set Standard for Gen AI in Debut MLPerf Benchmark ”

H100 GPUs Set Standard for Gen AI in Debut MLPerf Benchmark⁠

“Introducing Cerebras Inference: AI at Instant Speed ”, Cerebras 2025

⁠Introducing Cerebras Inference: AI at Instant Speed⁠ :

View HTML:

⁠/doc/www/cerebras.ai/93054a04376e115e35f9d567e12d546ce9c4794e.html⁠

“Llama-3.1-405B Now Runs at 969 Tokens/s on Cerebras Inference ”

⁠Llama-3.1-405B now runs at 969 tokens/s on Cerebras Inference⁠ :

View HTML:

⁠/doc/www/cerebras.ai/546558ea684ac6128349889a7e7d6ac5fcac4ced.html⁠

“Cerebras Systems Unveils World’s Fastest AI Chip With Whopping 4 Trillion Transistors ”

⁠Cerebras Systems Unveils World’s Fastest AI Chip with Whopping 4 Trillion Transistors⁠ :

View HTML:

⁠/doc/www/cerebras.ai/59f92e10bd9b113094149f6c072188a6ee94ea8c.html⁠

“Crusoe Expands AI Data Center Campus in Abilene to 1.2 Gigawatts ”

⁠Crusoe Expands AI Data Center Campus in Abilene to 1.2 Gigawatts

“NVIDIA Hopper Architecture In-Depth ”

NVIDIA Hopper Architecture In-Depth⁠

“AI Progress Is about to Speed Up ”

⁠⁠AI progress is about to speed up

“Trends in GPU Price-Performance ”

⁠Trends in GPU Price-Performance :

View HTML:

⁠/doc/www/epochai.org/16ace4644cad33f483db98603b340abcebfc15bd.html⁠

“NVIDIA/Megatron-LM: Ongoing Research Training Transformer Models at Scale ”

NVIDIA/Megatron-LM: Ongoing research training transformer models at scale⁠

“12 Hours Later, Groq Deploys Llama-3-Instruct (8 & 70B) ”

⁠12 Hours Later, Groq Deploys Llama-3-instruct (8 & 70B) :

View HTML:

⁠/doc/www/groq.com/662cdeca6b61c88f42902dcfe9e653fb9e51ab40.html⁠

“The Technology Behind BLOOM Training ”

The Technology Behind BLOOM Training⁠

“From Bare Metal to a 70B Model: Infrastructure Set-Up and Scripts ”

From bare metal to a 70B model: infrastructure set-up and scripts

“AI Accelerators, Part IV: The Very Rich Landscape ”, Fuchs 2025

AI Accelerators, Part IV: The Very Rich Landscape⁠

“NVIDIA Announces DGX H100 Systems – World’s Most Advanced Enterprise AI Infrastructure ”

NVIDIA Announces DGX H100 Systems – World’s Most Advanced Enterprise AI Infrastructure⁠

“NVIDIA Launches UK’s Most Powerful Supercomputer, for Research in AI and Healthcare ”

NVIDIA Launches UK’s Most Powerful Supercomputer, for Research in AI and Healthcare⁠

“Perlmutter, Said to Be the World’s Fastest AI Supercomputer, Comes Online ”

⁠Perlmutter, said to be the world’s fastest AI supercomputer, comes online :

View HTML:

⁠/doc/www/siliconangle.com/25111ddc81802255205e696717cbd0c620b30f2f.html⁠

“TensorFlow Research Cloud (TRC): Accelerate Your Cutting-Edge Machine Learning Research With Free Cloud TPUs ”, TRC 2025

TensorFlow Research Cloud (TRC): Accelerate your cutting-edge machine learning research with free Cloud TPUs⁠

“Cerebras’ Tech Trains "Brain-Scale" AIs ”

⁠Cerebras’ Tech Trains "Brain-Scale" AIs⁠ :

View HTML:

⁠/doc/www/spectrum.ieee.org/f730e692e977a46aa2bf47424079417a4b210706.html⁠

“Fugaku Holds Top Spot, Exascale Remains Elusive ”

⁠Fugaku Holds Top Spot, Exascale Remains Elusive :

View HTML:

⁠/doc/www/top500.org/ec0272e1a8b474f3f2b9094120e747539af13f94.html⁠

“342 Transistors for Every Person In the World: Cerebras 2^nd Gen Wafer Scale Engine Teased ”

⁠342 Transistors for Every Person In the World: Cerebras 2^nd Gen Wafer Scale Engine Teased⁠ :

View HTML:

⁠/doc/www/www.anandtech.com/adf4dfb8b81a22a219673037bf7d6eb80069b0ea.html⁠

“Jim Keller Becomes CTO at Tenstorrent: ‘The Most Promising Architecture Out There’ ”

⁠Jim Keller Becomes CTO at Tenstorrent: ‘The Most Promising Architecture Out There’ :

View HTML:

⁠/doc/www/www.anandtech.com/c4047eb71ea08ce3b22e6ff73217bf7d198ebf98.html⁠

“NVIDIA Unveils Grace: A High-Performance Arm Server CPU For Use In Big AI Systems ”

⁠NVIDIA Unveils Grace: A High-Performance Arm Server CPU For Use In Big AI Systems⁠ :

View HTML:

⁠/doc/www/www.anandtech.com/c065db10b68fdd5859b472c3aaf4704e8d9de985.html⁠

“Cerebras Unveils Wafer Scale Engine Two (WSE2): 2.6 Trillion Transistors, 100% Yield ”

⁠Cerebras Unveils Wafer Scale Engine Two (WSE2): 2.6 Trillion Transistors, 100% Yield⁠ :

View HTML:

⁠/doc/www/www.anandtech.com/2645e9a38b13758a78cc776662225fdefe815a9b.html⁠

“AMD Announces Instinct MI200 Accelerator Family: Taking Servers to Exascale and Beyond ”

⁠AMD Announces Instinct MI200 Accelerator Family: Taking Servers to Exascale and Beyond :

View HTML:

⁠/doc/www/www.anandtech.com/1f63181a48b21b558bd7b182b305ec77b032016e.html⁠

“NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder ”

⁠NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder⁠ :

View HTML:

⁠/doc/www/www.anandtech.com/916e33bcf29ef23efd87485c62df1c46065ffec0.html⁠

“Biological Anchors: A Trick That Might Or Might Not Work ”

Biological Anchors: A Trick That Might Or Might Not Work⁠

“Scaling Up and Out: Training Massive Models on Cerebras Systems Using Weight Streaming ”

⁠Scaling Up and Out: Training Massive Models on Cerebras Systems using Weight Streaming⁠ :

View External Link:

⁠https://www.cerebras.net/blog/scaling-up-and-out-training-massive-models-on-cerebras-systems-using-weight-streaming/⁠

“DeepSeek: The Quiet Giant Leading China’s AI Race ”

⁠DeepSeek: The Quiet Giant Leading China’s AI Race⁠ :

View HTML:

⁠/doc/www/www.chinatalk.media/0428036ab2addeb96178a7a33078626ee92708a9.html⁠

“DeepSeek: The View from China ”

⁠DeepSeek: The View from China⁠

“DeepSeek’s Edge ”

⁠DeepSeek’s Edge⁠ :

View HTML:

⁠/doc/www/www.chinatalk.media/5251f69d13eef10fbcc14acdec45bf348fc59135.html⁠

“Fermi Estimate of Future Training Runs ”

⁠Fermi estimate of future training runs :

View HTML:

⁠/doc/www/www.danieldewey.net/406339a5eb84d61715fb9f289c600916f6724037.html⁠

“Carl Shulman #2: AI Takeover, Bio & Cyber Attacks, Detecting Deception, & Humanity's Far Future ”

Carl Shulman #2: AI Takeover, Bio & Cyber Attacks, Detecting Deception, & Humanity's Far Future⁠

“Etched Is Making the Biggest Bet in AI ”

⁠Etched is Making the Biggest Bet in AI :

View HTML:

⁠/doc/www/www.etched.com/de827f3f96e99801324672b5ad20de1f492133fc.html⁠

“The Emerging Age of AI Diplomacy: To Compete With China, the United States Must Walk a Tightrope in the Gulf ”

The Emerging Age of AI Diplomacy: To Compete With China, the United States Must Walk a Tightrope in the Gulf

“How Far Can AI Progress Before Hitting Effective Physical Limits? ”

⁠How Far Can AI Progress Before Hitting Effective Physical Limits?

“The Resilience Myth: Fatal Flaws in the Push to Secure Chip Supply Chains ”

⁠The resilience myth: fatal flaws in the push to secure chip supply chains⁠ :

View HTML:

⁠/doc/www/www.ft.com/23abfe70455f34fc99f78a865404c11a68c5f9e2.html⁠

“Prime Minister Sets out Blueprint to Turbocharge AI ”

⁠Prime Minister sets out blueprint to turbocharge AI :

View HTML:

⁠/doc/www/www.gov.uk/3e8c93f13b5f6b4956229fa668a2a278734ecafe.html⁠

“Compute Funds and Pre-Trained Models ”

⁠Compute Funds and Pre-trained Models :

View HTML:

⁠/doc/www/www.governance.ai/1faf498a4eedbd00fac7473eb720939b629b3953.html⁠

“The Next Big Thing: Introducing IPU-POD128 and IPU-POD256 ”

⁠The Next Big Thing: Introducing IPU-POD128 and IPU-POD256 :

View HTML:

⁠/doc/www/www.graphcore.ai/5448e88bf1ce4bde10ada3402571c9fccdb9d460.html⁠

“The WoW Factor: Graphcore Systems Get Huge Power and Efficiency Boost ”

⁠The WoW Factor: Graphcore systems get huge power and efficiency boost :

View HTML:

⁠/doc/www/www.graphcore.ai/4bc9da4f2d802111e0aed36c5066b4e7039d188b.html⁠

“AWS Enables 4,000-GPU UltraClusters With New P4 A100 Instances ”

⁠AWS Enables 4,000-GPU UltraClusters with New P4 A100 Instances :

View HTML:

⁠/doc/www/www.hpcwire.com/bf93d271915af698f8243c85c941ab2904cee930.html⁠

“Estimating Training Compute of Deep Learning Models ”

⁠Estimating training compute of Deep Learning models⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/HvqQm6o8KnwxbdmhZ/estimating-training-compute-of-deep-learning-models⁠

“What O3 Becomes by 2028 ”, Nesov 2025

⁠What o3 Becomes by 2028⁠ :

View HTML:

⁠/doc/www/www.greaterwrong.com/9fd299330b11f65353137aae085a95449a405633.html⁠

“The Colliding Exponentials of AI ”

⁠The Colliding Exponentials of AI⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/QWuegBA9kGBv3xBFy/the-colliding-exponentials-of-ai⁠

“Moore’s Law, AI, and the pace of Progress ”

⁠Moore’s Law, AI, and the pace of progress⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/aNAFrGbzXddQBMDqh/moore-s-law-ai-and-the-pace-of-progress⁠

“How Fast Can We Perform a Forward Pass? ”

⁠How fast can we perform a forward pass?⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/gPmGTND8Kroxgpgsn/how-fast-can-we-perform-a-forward-pass⁠

“"AI and Compute" Trend Isn’t Predictive of What Is Happening ”

⁠"AI and Compute" trend isn’t predictive of what is happening⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/wfpdejMWog4vEDLDg/ai-and-compute-trend-isn-t-predictive-of-what-is-happening⁠

“Brain Efficiency: Much More Than You Wanted to Know ”

⁠Brain Efficiency: Much More than You Wanted to Know⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know⁠

“DeepSpeed: Accelerating Large-Scale Model Inference and Training via System Optimizations and Compression ”

⁠DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression⁠ :

View HTML:

⁠/doc/www/www.microsoft.com/caf5ba6af01512146528779151d7f031661cceb5.html⁠

“ZeRO-Infinity and DeepSpeed: Unlocking Unprecedented Model Scale for Deep Learning Training ”

⁠ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training⁠ :

View HTML:

⁠/doc/www/www.microsoft.com/e57637ecf7a947a4991692445c4acb6565c3e53d.html⁠

“The World’s Largest Computer Chip ”

The World’s Largest Computer Chip⁠

“The Billion Dollar AI Problem That Just Keeps Scaling ”

⁠The Billion Dollar AI Problem That Just Keeps Scaling⁠ :

View HTML:

⁠/doc/www/www.nextplatform.com/5dc4055df5f5f1842b5cc5c9a7c1cfb739d6530f.html⁠

“TSMC Confirms 3nm Tech for 2022, Could Enable Epic 80 Billion Transistor GPUs ”

⁠TSMC confirms 3nm tech for 2022, could enable epic 80 billion transistor GPUs :

View HTML:

⁠/doc/www/www.pcgamer.com/18ccebd223a8f1c7f2cb634a2fb0a2f6daf773c1.html⁠

“ORNL’s Frontier First to Break the Exaflop Ceiling ”

⁠ORNL’s Frontier First to Break the Exaflop Ceiling :

View HTML:

⁠/doc/www/www.top500.org/877f3d7da4f3add6cee170d5591fe6c095ff4b97.html⁠

“A Spymaster Sheikh Controls a $1.5 Trillion Fortune. Tahnoun Bin Zayed Al Nahyan Wants to Use It to Dominate AI ”

⁠A Spymaster Sheikh Controls a $1.5 Trillion Fortune. Tahnoun bin Zayed al Nahyan Wants to Use It to Dominate AI⁠

“Returning to Google DeepMind ”, Tay 2025

⁠Returning to Google DeepMind⁠ :

View HTML:

⁠/doc/www/www.yitay.net/b5cdfdb299d0547c8ad348147368c576d1ff2ea5.html⁠

“How to Accelerate Innovation With AI at Scale ”

⁠How to accelerate innovation with AI at Scale⁠ :

⁠https://www.youtube.com/watch?v=DLw-wC4zntw?t=233⁠

“48:44—Tesla Vision • 1:13:12—Planning and Control • 1:24:35—Manual Labeling • 1:28:11—Auto Labeling • 1:35:15—Simulation • 1:42:10—Hardware Integration • 1:45:40—Dojo ”

⁠48:44—Tesla Vision • 1:13:12—Planning and Control • 1:24:35—Manual Labeling • 1:28:11—Auto Labeling • 1:35:15—Simulation • 1:42:10—Hardware Integration • 1:45:40—Dojo⁠ :

⁠https://www.youtube.com/watch?v=j0z4FweCy4M&t=2928⁠

ID_AA_Carmack

Seymour Cray was famous for packing, powering, and cooling circuits incredibly densely. Classic Crays were made obsolete by microprocessors, but we may yet do similar things at a larger scale. Hyperscale data centers and even national supercomputers are loosely coupled things today, but if challenges demanded it, there is a world with a zetta scale, tightly integrated, low latency matrix dissipating a gigawatt in a swimming pool of circulating fluorinert.⁠ :

/doc/www/localhost/4968ab26ba59dc36ece6f4e76624b41e38e64042.html⁠

lepikhin

We ran MoE (2048E,60L) with bfloat16 activations with total of 1 trillion model weights. Although trainable with manual diagnostics, with deep 1 trillion model we encountered several trainability issues with numerical stability. Will follow up.⁠ :

/doc/www/localhost/7f9a857ac2f5416ec463e8d50e67596ab6328955.html⁠

Wikipedia (16)

Miscellaneous

Bibliography

https://www.reuters.com/technology/artificial-intelligence/alphabet-nvidia-invest-openai-co-founder-sutskevers-ssi-source-says-2025-04-12/: “Alphabet, Nvidia Invest in OpenAI Co-Founder Sutskever’s SSI, Source Says ”⁠, Kenrick Cai, Krystal Hu⁠
link-bibliography⁠
https://benjamintodd.substack.com/p/why-a-us-ai-manhattan-project-could: “Why a US AI "Manhattan Project" Could Backfire: Notes from Conversations in China ”⁠, ⁠Benjamin Todd
link-bibliography⁠
https://www.theatlantic.com/international/archive/2024/06/china-microchip-technology-competition/678612/: “China Is Losing the Chip War: Xi Jinping Picked a Fight over Semiconductor Technology—One He Can’t Win ”⁠, Michael Schuman
link-bibliography⁠
https://www.ft.com/content/8e8a65a0-a990-4c77-a6e8-ec4e5d247f80: “Singapore’s Temasek in Discussions to Invest in OpenAI: State-Backed Group in Talks With ChatGPT Maker’s Chief Sam Altman Who Is Seeking Funding to Build Chips Business ”⁠, Madhumita Murgia, Mercedes Ruehl
link-bibliography⁠
https://www.newyorker.com/magazine/2023/12/04/how-jensen-huangs-nvidia-is-powering-the-ai-revolution: “How Jensen Huang’s Nvidia Is Powering the AI Revolution: The Company’s CEO Bet It All on a New Kind of Chip. Now That Nvidia Is One of the Biggest Companies in the World, What Will He Do Next? ”⁠, Stephen Witt
link-bibliography⁠
https://www.wired.com/story/openai-buy-ai-chips-startup-sam-altman/: “OpenAI Agreed to Buy $51 Million of AI Chips From a Startup Backed by CEO Sam Altman ”⁠, Paresh Dave
link-bibliography⁠
https://www.semianalysis.com/p/microsoft-swallows-openais-core-team#%C2%A7compute-is-king: “Microsoft Swallows OpenAI’s Core Team § Compute Is King ”, Dylan Patel, Daniel Nishball
link-bibliography⁠
https://www.ft.com/content/2a636cee-b0d2-45c2-a815-11ca32371763: “Saudi-China Collaboration Raises Concerns about Access to AI Chips: Fears Grow at Gulf Kingdom’s Top University That Ties to Chinese Researchers Risk Upsetting US Government ”⁠, Simeon Kerr, Samer Al-Atrush, Qianer Liu, Madhumita Murgia
link-bibliography⁠
https://www.nytimes.com/2023/07/16/opinion/biden-china-ai-chips-trade.html: “Biden Is Beating China on Chips. It May Not Be Enough. ”⁠, Dan Wang
link-bibliography⁠
https://archive.is/c5jTk: “Deep Mind’s Chief on AI’s Dangers—And the UK’s £900 Million Supercomputer: Demis Hassabis Says We Shouldn’t Let AI Fall into the Wrong Hands and the Government’s Plan to Build a Supercomputer for AI Is Likely to Be out of Date Before It Has Even Started ”⁠, Mark Sellman
link-bibliography⁠
https://www.wsj.com/articles/the-ai-boom-runs-on-chips-but-it-cant-get-enough-9f76f554: “The AI Boom Runs on Chips, but It Can’t Get Enough: ‘It’s like Toilet Paper during the Pandemic.’ Startups, Investors Scrounge for Computational Firepower ”⁠, Deepa Seetharaman⁠, Tom Dotan
link-bibliography⁠
https://arxiv.org/abs/2305.06946: “Big-PERCIVAL: Exploring the Native Use of 64-Bit Posit Arithmetic in Scientific Computing ”⁠, David Mallasén, Alberto A. Del Barrio, Manuel Prieto-Matias
link-bibliography⁠
https://x.com/davidtayar5/status/1627690520456691712: “Context on the NVIDIA ChatGPT Opportunity—And Ramifications of Large Language Model Enthusiasm ”⁠, Morgan Stanley⁠
link-bibliography⁠
https://blogs.microsoft.com/blog/2023/01/23/microsoftandopenaiextendpartnership/: “Microsoft and OpenAI Extend Partnership ”⁠, Microsoft
link-bibliography⁠
https://ai.meta.com/blog/multiray-large-scale-AI-models/: “MultiRay: Optimizing Efficiency for Large-Scale AI Models ”⁠, Nikhil Gupta, Michael Gschwind⁠, Don Husa …, Christopher Dewan, Madian Khabsa
link-bibliography⁠
https://arxiv.org/abs/2211.05102#google: “Efficiently Scaling Transformer Inference ”⁠, Reiner Pope, Sholto Douglas⁠, Aakanksha Chowdhery …, Jacob Devlin, James Bradbury⁠, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean⁠
link-bibliography⁠
https://arxiv.org/abs/2206.03382#microsoft: “Tutel: Adaptive Mixture-Of-Experts at Scale ”⁠, Changho Hwang, Wei Cui, Yifan Xiong …, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas⁠, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng⁠, Fan Yang, Mao Yang, Yongqiang Xiong
link-bibliography⁠
https://arxiv.org/abs/2206.01861#microsoft: “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers ”⁠, Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang …, Xiaoxia Wu, Conglong Li, Yuxiong He
link-bibliography⁠
https://arxiv.org/abs/2205.14135: “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness ”⁠, ⁠Tri Dao, Daniel Y. Fu, Stefano Ermon⁠ …, Atri Rudra, Christopher Ré⁠
link-bibliography⁠
https://arxiv.org/abs/2204.00595: “Monarch: Expressive Structured Matrices for Efficient and Accurate Training ”⁠, ⁠Tri Dao, Beidi Chen, Nimit Sohoni …, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Ré⁠
link-bibliography⁠
https://arxiv.org/abs/2203.02094#microsoft: “LiteTransformerSearch: Training-Free Neural Architecture Search for Efficient Language Models ”⁠, Mojan Javaheripi, Gustavo H. de Rosa, Subhabrata Mukherjee …, Shital Shah, Tomasz L. Religa, Caio C. T. Mendes, Sebastien Bubeck⁠, Farinaz Koushanfar⁠, Debadeepta Dey
link-bibliography⁠
https://arxiv.org/abs/2202.06009#microsoft: “Maximizing Communication Efficiency for Large-Scale Training via 0/1 Adam ”⁠, Yucheng Lu, Conglong Li, Minjia Zhang …, Christopher De Sa, Yuxiong He
link-bibliography⁠
https://ai.meta.com/blog/ai-rsc/: “Introducing the AI Research SuperCluster—Facebook’s Cutting-Edge AI Supercomputer for AI Research ”⁠, Kevin Lee, Shubho Sengupta
link-bibliography⁠
https://semiengineering.com/is-programmable-overhead-worth-the-cost/: “Is Programmable Overhead Worth The Cost? How Much Do We Pay for a System to Be Programmable? It Depends upon Who You Ask ”, Brian Bailey
link-bibliography⁠
https://arxiv.org/abs/2106.10207: “Distributed Deep Learning in Open Collaborations ”⁠, Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin …, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur, Ilia Kobelev, Yacine Jernite, ⁠Thomas Wolf, Gennady Pekhimenko
link-bibliography⁠
2021-jouppi.pdf: “Ten Lessons From Three Generations Shaped Google’s TPUv4i ”⁠, Norman P. Jouppi⁠, Doe Hyun Yoon, Matthew Ashcraft …, Mark Gottscho, Thomas B. Jablin, George Kurian⁠, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, David Patterson
link-bibliography⁠
https://chinai.substack.com/p/chinai-141-the-pangu-origin-story: “ChinAI #141: The PanGu Origin Story: Notes from an Informative Zhihu Thread on PanGu ”⁠, Jeffrey Ding
link-bibliography⁠
https://arxiv.org/abs/2104.06272#deepmind: “Podracer Architectures for Scalable Reinforcement Learning ”⁠, Matteo Hessel, Manuel Kroiss, Aidan Clark …, Iurii Kemaev, John Quan⁠, Thomas Keck, Fabio Viola, Hado van Hasselt⁠
link-bibliography⁠
https://arxiv.org/abs/2104.05343: “An Efficient 2D Method for Training Super-Large Deep Learning Models ”⁠, Qifan Xu, Shenggui Li, Chaoyu Gong, Yang You⁠
link-bibliography⁠
https://arxiv.org/abs/2102.07988: “TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models ”⁠, Zhuohan Li, Siyuan Zhuang, Shiyuan Guo …, Danyang Zhuo, Hao Zhang, Dawn Song⁠, Ion Stoica⁠
link-bibliography⁠
https://arxiv.org/abs/2102.03161: “PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers ”⁠, Chaoyang He, Shen Li⁠, Mahdi Soltanolkotabi, Salman Avestimehr⁠
link-bibliography⁠
2020-jiang.pdf: “BytePS: A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters ”⁠, Yimin Jiang, Yibo Zhu, Chang Lan …, Bairen Yi, Yong Cui, Chuanxiong Guo
link-bibliography⁠
https://arxiv.org/abs/2011.00071#google: “Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour ”⁠, Arissa Wongpanich, Hieu Pham, James Demmel⁠ …, Mingxing Tan, Quoc V. Le⁠, Yang You⁠, Sameer Kumar
link-bibliography⁠
2020-launay-2.pdf: “Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures ”⁠, Julien Launay, Iacopo Poli, François Boniface, Florent Krzakala⁠
link-bibliography⁠
https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/: “DeepSpeed: Extreme-Scale Model Training for Everyone ”⁠, DeepSpeed Team, Rangan Majumder, Junhua Wang
link-bibliography⁠
https://news.microsoft.com/source/features/ai/openai-azure-supercomputer/: “Microsoft Announces New Supercomputer, Lays out Vision for Future AI Work ”⁠, Jennifer Langston
link-bibliography⁠
https://www.zdnet.com/article/startup-tenstorrent-and-competitors-show-how-computing-is-changing-ai-and-vice-versa/: “Startup Tenstorrent Shows AI Is Changing Computing and vice Versa: Tenstorrent Is One of the Rush of AI Chip Makers Founded in 2016 and Finally Showing Product. The New Wave of Chips Represent a Substantial Departure from How Traditional Computer Chips Work, but Also Point to Ways That Neural Network Design May Change in the Years to Come ”, Tiernan Ray
link-bibliography⁠
2020-khan.pdf: “AI Chips: What They Are and Why They Matter—An AI Chips Reference ”⁠, Saif M. Khan, Alexander Mann⁠
link-bibliography⁠
https://arxiv.org/abs/1904.00962#google: “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes ”⁠, Yang You⁠, Jing Li, Sashank Reddi …, Jonathan Hseu, Sanjiv Kumar⁠, Srinadh Bhojanapalli, Xiaodan Song, James Demmel⁠, Kurt Keutzer⁠, Cho-Jui Hsieh
link-bibliography⁠
https://arxiv.org/abs/1811.06965#google: “GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism ”⁠, Yanping Huang, Youlong Cheng, Ankur Bapna …, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le⁠, Yonghui Wu⁠, Zhifeng Chen
link-bibliography⁠
https://arxiv.org/abs/1811.02084#google: “Mesh-TensorFlow: Deep Learning for Supercomputers ”⁠, Noam Shazeer⁠, Youlong Cheng, Niki Parmar⁠ …, Dustin Tran, Ashish Vaswani⁠, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman
link-bibliography⁠
https://openai.com/research/ai-and-compute: “AI and Compute ”⁠, Dario Amodei⁠, Danny Hernandez⁠, Girish Sastry …, ⁠Jack Clark⁠, Greg Brockman⁠, Ilya Sutskever⁠
link-bibliography⁠
https://arxiv.org/abs/1712.01887: “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training ”⁠, Yujun Lin, Song Han, Huizi Mao …, Yu Wang, William J. Dally⁠
link-bibliography⁠
https://arxiv.org/abs/1102.0183#schmidhuber: “DanNet: Flexible, High Performance Convolutional Neural Networks for Image Classification ”⁠, Dan Claudiu Ciresan, Ueli Meier, Jonathan Masci …, Luca Maria Gambardella⁠, Jürgen Schmidhuber⁠
link-bibliography⁠
https://www.vetta.org/2010/12/goodbye-2010/: “Goodbye 2010 ”⁠, Shane Legg⁠
link-bibliography⁠
2009-raina.pdf: “Large-Scale Deep Unsupervised Learning Using Graphics Processors ”⁠, Rajat Raina, Anand Madhavan, Andrew Y. Ng
link-bibliography⁠
2009-patarasuk.pdf: “Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations ”⁠, Pitch Patarasuk, Xin Yuan
link-bibliography⁠
https://www.wired.com/1995/10/moravec/#scaling: “Superhumanism: According to Hans Moravec § AI Scaling ”⁠, Charles Platt
link-bibliography⁠
1993-olazaran.pdf: “A Sociological Study of the Official History of the Perceptrons Controversy [1993] ”⁠, Mikel Olazaran⁠
link-bibliography⁠
1987-sejnowski.pdf: “Computing With Connections ”⁠, Terrence J. Sejnowski⁠
link-bibliography⁠
https://web.archive.org/web/20230710000944/https://frc.ri.cmu.edu/~hpm/project.archive/general.articles/1975/Raw.Power.html: “The Role Of RAW POWER In INTELLIGENCE ”⁠, Hans Moravec⁠
link-bibliography⁠
https://sites.research.google/trc/: “TensorFlow Research Cloud (TRC): Accelerate Your Cutting-Edge Machine Learning Research With Free Cloud TPUs ”⁠, TRC
link-bibliography⁠