2:4 Sparse Llama: Smaller Models for Efficient GPU Inference
OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training
Probing the Decision Boundaries of In-context Learning in Large Language Models
Neural Networks (MNIST Inference) on the ‘3¢’ Microcontroller
How Good Are Low-bit Quantized LLaMA-3 Models? An Empirical Study
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Big-PERCIVAL: Exploring the Native Use of 64-Bit Posit Arithmetic in Scientific Computing
Int-4 LLaMa is not enough—Int-3 and beyond: More compression, easier to build apps on LLMs that run locally
SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks
Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam
Is Programmable Overhead Worth The Cost? How much do we pay for a system to be programmable? It depends upon who you ask
FQ-ViT: Fully Quantized Vision Transformer without Retraining
𝜇NCA: Texture Generation with Ultra-Compact Neural Cellular Automata
Understanding and Overcoming the Challenges of Efficient Transformer Quantization
Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better
A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness
High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)
1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed
ES-ENAS: Blackbox Optimization over Hybrid Spaces via Combinatorial and Continuous Evolution
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
L2L: Training Large Neural Networks with Constant Memory using a New Execution Algorithm
HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks
General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference
Training with Quantization Noise for Extreme Model Compression
Moniqua: Modulo Quantized Communication in Decentralized SGD
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
SCaNN: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
And the Bit Goes Down: Revisiting the Quantization of Neural Networks
Rethinking Numerical Representations for Deep Neural Networks
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in 4 Minutes
Quantization Mimic: Towards Very Tiny CNN for Object Detection
Training Imagenet in 3 hours for $25; and CIFAR-10 for $0.26
Training wide residual networks for deployment using a single bit for each weight
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions
Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method
Compressing Word Embeddings via Deep Compositional Code Learning
Learning Discrete Weights Using the Local Reparameterization Trick
TensorQuant—A Simulation Toolbox for Deep Neural Network Quantization
Bolt: Accelerated Data Mining with Fast Vector Compression
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Ternary Neural Networks for Resource-Efficient AI Applications
Deep neural networks are robust to weight binarization and other non-linear distortions
Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1
BinaryConnect: Training Deep Neural Networks with binary weights during propagations
Efficient supervised learning in networks with binary synapses
A self-optimizing, non-symmetrical neural net for content addressable memory and pattern recognition
Building a Vector Database in 2GB for 36 Million Wikipedia Passages
FlashAttention-3: Fast and Accurate Attention With Asynchrony and Low-Precision
2023-xu-table1-enwik8textpredictionresultsusingspikegptandothertransformerrnnbaselines.jpg
2022-pope-figure1-tpucostvssamplinglatencyofpalm540bmodelonatpucluster.png
2021-fedus-figure13-switchtransformerknowledgevsreasoningscaling.jpg
https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors
https://cpldcpu.wordpress.com/2024/04/24/implementing-neural-networks-on-the-10-cent-risc-v-mcu-without-multiplier/
https://github.com/THUDM/GLM-130B/blob/main/doc/quantization.md
https://github.com/vitoplantamura/OnnxStream/tree/846da873570a737b49154e8f835704264864b0fe
https://observablehq.com/@rreusser/half-precision-floating-point-visualized
https://research.google/blog/quantization-for-fast-and-environmentally-sustainable-reinforcement-learning/
https://www.reddit.com/r/LocalLLaMA/comments/1gsyp7q/humaneval_benchmark_of_exl2_quants_of_popular/
https://www.reddit.com/r/mlscaling/comments/146rgq2/chatgpt_is_running_quantized/
Probing the Decision Boundaries of In-context Learning in Large Language Models
How Good Are Low-bit Quantized LLaMA-3 Models? An Empirical Study
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
https%253A%252F%252Farxiv.org%252Fabs%252F2401.14112%2523microsoft.html
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Big-PERCIVAL: Exploring the Native Use of 64-Bit Posit Arithmetic in Scientific Computing
Int-4 LLaMa is not enough—Int-3 and beyond: More compression, easier to build apps on LLMs that run locally
https%253A%252F%252Fnolanoorg.substack.com%252Fp%252Fint-4-llama-is-not-enough-int-3-and.html
SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks
https%253A%252F%252Farxiv.org%252Fabs%252F2302.04907%2523google.html
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2211.05102%2523google.html
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2210.02414%2523baai.html
https%253A%252F%252Farxiv.org%252Fabs%252F2206.04114%2523google.html
XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient
https%253A%252F%252Farxiv.org%252Fabs%252F2206.01859%2523microsoft.html
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2206.01861%2523microsoft.html
Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam
https%253A%252F%252Farxiv.org%252Fabs%252F2202.06009%2523microsoft.html
Is Programmable Overhead Worth The Cost? How much do we pay for a system to be programmable? It depends upon who you ask
https%253A%252F%252Fsemiengineering.com%252Fis-programmable-overhead-worth-the-cost%252F.html
FQ-ViT: Fully Quantized Vision Transformer without Retraining
Understanding and Overcoming the Challenges of Efficient Transformer Quantization
%252Fdoc%252Fai%252Fscaling%252Fhardware%252F2021-jouppi.pdf.html
1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed
https%253A%252F%252Farxiv.org%252Fabs%252F2102.02888%2523microsoft.html
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
https%253A%252F%252Farxiv.org%252Fabs%252F2101.03961%2523google.html
Training with Quantization Noise for Extreme Model Compression
https%253A%252F%252Farxiv.org%252Fabs%252F2004.07320%2523facebook.html
https%253A%252F%252Farxiv.org%252Fabs%252F1910.01055%2523google.html
Training Imagenet in 3 hours for $25; and CIFAR-10 for $0.26
https%253A%252F%252Fwww.fast.ai%252F2018%252F04%252F30%252Fdawnbench-fastai%252F.html
Training wide residual networks for deployment using a single bit for each weight
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
Wikipedia Bibliography: