Bibliography:

  1. ‘NN sparsity’ tag

  2. ‘AI hardware’ tag

  3. 2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

  4. Model Equality Testing: Which Model Is This API Serving?

  5. A Visual Guide to Quantization

  6. 68f773f828014db8c6467e04067f2ebe8075e20d.html

  7. OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

  8. Probing the Decision Boundaries of In-context Learning in Large Language Models

  9. Nemotron-4 340B Technical Report

  10. Scalable Matmul-free Language Modeling

  11. Neural Networks (MNIST Inference) on the ‘3¢’ Microcontroller

  12. How Good Are Low-bit Quantized LLaMA-3 Models? An Empirical Study

  13. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

  14. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

  15. Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

  16. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

  17. FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

  18. Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

  19. TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

  20. LLM-FP4: 4-Bit Floating-Point Quantized Transformers

  21. Training Transformers with 4-bit Integers

  22. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

  23. Binary and Ternary Natural Language Generation

  24. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

  25. Big-PERCIVAL: Exploring the Native Use of 64-Bit Posit Arithmetic in Scientific Computing

  26. Int-4 LLaMa is not enough—Int-3 and beyond: More compression, easier to build apps on LLMs that run locally

  27. SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

  28. BMT: Binarized Neural Machine Translation

  29. Self-Compressing Neural Networks

  30. Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production

  31. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

  32. Efficiently Scaling Transformer Inference

  33. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

  34. Fast DistilBERT on CPUs

  35. Broken Neural Scaling Laws

  36. GLM-130B: An Open Bilingual Pre-trained Model

  37. FP8 Formats for Deep Learning

  38. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

  39. Is Integer Arithmetic Enough for Deep Learning Training?

  40. On-Device Training Under 256KB Memory

  41. How to train accurate BNNs for embedded systems?

  42. Director: Deep Hierarchical Planning from Pixels

  43. 8-bit Numerical Formats for Deep Neural Networks

  44. XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient

  45. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

  46. Matryoshka Representations for Adaptive Deployment

  47. PLAID: An Efficient Engine for Late Interaction Retrieval

  48. Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

  49. Is Programmable Overhead Worth The Cost? How much do we pay for a system to be programmable? It depends upon who you ask

  50. Boosted Dense Retriever

  51. FQ-ViT: Fully Quantized Vision Transformer without Retraining

  52. 𝜇NCA: Texture Generation with Ultra-Compact Neural Cellular Automata

  53. Prune Once for All: Sparse Pre-Trained Language Models

  54. 8-bit Optimizers via Block-wise Quantization

  55. Understanding and Overcoming the Challenges of Efficient Transformer Quantization

  56. Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

  57. A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness

  58. Ten Lessons From Three Generations Shaped Google’s TPUv4i

  59. High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)

  60. Deep Residual Learning in Spiking Neural Networks

  61. 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed

  62. ES-ENAS: Blackbox Optimization over Hybrid Spaces via Combinatorial and Continuous Evolution

  63. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  64. A Primer in BERTology: What we know about how BERT works

  65. L2L: Training Large Neural Networks with Constant Memory using a New Execution Algorithm

  66. RegDeepDanbooru: Yet another Deep Danbooru project

  67. TernaryBERT: Distillation-aware Ultra-low Bit BERT

  68. HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks

  69. Bayesian Bits: Unifying Quantization and Pruning

  70. General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference

  71. Lite Transformer with Long-Short Range Attention

  72. Training with Quantization Noise for Extreme Model Compression

  73. Moniqua: Modulo Quantized Communication in Decentralized SGD

  74. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

  75. SWAT: Sparse Weight Activation Training

  76. QUARL: Quantized Reinforcement Learning (ActorQ)

  77. SCaNN: Accelerating Large-Scale Inference with Anisotropic Vector Quantization

  78. And the Bit Goes Down: Revisiting the Quantization of Neural Networks

  79. Surrogate Gradient Learning in Spiking Neural Networks

  80. Rethinking floating point for deep learning

  81. Learning Recurrent Binary/Ternary Weights

  82. Rethinking Numerical Representations for Deep Neural Networks

  83. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in 4 Minutes

  84. Quantization Mimic: Towards Very Tiny CNN for Object Detection

  85. Training Imagenet in 3 hours for $25; and CIFAR-10 for $0.26

  86. High-Accuracy Low-Precision Training

  87. Training wide residual networks for deployment using a single bit for each weight

  88. Universal Deep Neural Network Compression

  89. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

  90. Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

  91. Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method

  92. Compressing Word Embeddings via Deep Compositional Code Learning

  93. Learning Discrete Weights Using the Local Reparameterization Trick

  94. TensorQuant—A Simulation Toolbox for Deep Neural Network Quantization

  95. Mixed Precision Training

  96. BitNet: Bit-Regularized Deep Neural Networks

  97. Beating Floating Point at its Own Game: Posit Arithmetic

  98. Bolt: Accelerated Data Mining with Fast Vector Compression

  99. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

  100. Ternary Neural Networks for Resource-Efficient AI Applications

  101. Deep neural networks are robust to weight binarization and other non-linear distortions

  102. Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing

  103. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

  104. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1

  105. BinaryConnect: Training Deep Neural Networks with binary weights during propagations

  106. Efficient supervised learning in networks with binary synapses

  107. A self-optimizing, non-symmetrical neural net for content addressable memory and pattern recognition

  108. Binary Vector Embeddings Are so Cool

  109. Building a Vector Database in 2GB for 36 Million Wikipedia Passages

  110. FlashAttention-3: Fast and Accurate Attention With Asynchrony and Low-Precision

  111. c17cf751cc778ec4481da07e013f94580bf3db97.html

  112. design#future-tag-features

    [Transclude the forward-link's context]

  113. 2023-xu-table1-enwik8textpredictionresultsusingspikegptandothertransformerrnnbaselines.jpg

  114. 2022-pope-figure1-tpucostvssamplinglatencyofpalm540bmodelonatpucluster.png

  115. 2021-fedus-figure1-switchmoetransformerscaling.png

  116. 2021-fedus-figure13-switchtransformerknowledgevsreasoningscaling.jpg

  117. https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors

  118. fd175d2fa010e0a1ec26ee2b46fae52a2fd8be7d.html

  119. https://cpldcpu.wordpress.com/2024/04/24/implementing-neural-networks-on-the-10-cent-risc-v-mcu-without-multiplier/

  120. 2caa403f2b8648b5d69eb9d973bb2cd075167b22.html

  121. https://github.com/NolanoOrg/llama-int4-quant/

  122. https://github.com/THUDM/GLM-130B/blob/main/doc/quantization.md

  123. https://github.com/qwopqwop200/GPTQ-for-LLaMa

  124. https://github.com/vitoplantamura/OnnxStream

  125. https://github.com/vitoplantamura/OnnxStream/tree/846da873570a737b49154e8f835704264864b0fe

  126. https://huggingface.co/blog/embedding-quantization

  127. 7b0d2c6e7d4974f9a367fac80ca288f8121bed84.html

  128. https://justine.lol/matmul/

  129. a758f15e02c56c1677a4e1917cb89372948613f8.html

  130. https://lightning.ai/pages/community/lora-insights/

  131. https://observablehq.com/@rreusser/half-precision-floating-point-visualized

  132. 3b84729cff5d5c6b57e64e1f43a6f9a2c1952e8f.html

  133. https://research.google/blog/quantization-for-fast-and-environmentally-sustainable-reinforcement-learning/

  134. https://txt.cohere.com/int8-binary-embeddings/

  135. fd4ac24d83d49b2cc0c3e026f8af187838e47f23.html

  136. https://www.reddit.com/r/LocalLLaMA/comments/1gsyp7q/humaneval_benchmark_of_exl2_quants_of_popular/

  137. https://www.reddit.com/r/mlscaling/comments/146rgq2/chatgpt_is_running_quantized/

  138. d8955b8f7f77c41587434170074e66cee41cf31c.html

  139. https://x.com/NolanoOrg/status/1634027966651834370

  140. https://x.com/aidan_mclau/status/1822830757137596521

  141. https://x.com/moyix/status/1582213498703990784

  142. https://x.com/thecharlieblake/status/1581913495670755328

  143. https://x.com/thiteanish/status/1635188333705043969

  144. Probing the Decision Boundaries of In-context Learning in Large Language Models

  145. Aditya Grover

  146. https%253A%252F%252Farxiv.org%252Fabs%252F2406.11233.html

  147. How Good Are Low-bit Quantized LLaMA-3 Models? An Empirical Study

  148. https%253A%252F%252Farxiv.org%252Fabs%252F2404.14047.html

  149. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

  150. Furu Wei

  151. https%253A%252F%252Farxiv.org%252Fabs%252F2402.17764.html

  152. FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

  153. https%253A%252F%252Farxiv.org%252Fabs%252F2401.14112%2523microsoft.html

  154. TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

  155. https%253A%252F%252Farxiv.org%252Fabs%252F2312.16862.html

  156. LLM-FP4: 4-Bit Floating-Point Quantized Transformers

  157. https%253A%252F%252Farxiv.org%252Fabs%252F2310.16836.html

  158. Big-PERCIVAL: Exploring the Native Use of 64-Bit Posit Arithmetic in Scientific Computing

  159. https%253A%252F%252Farxiv.org%252Fabs%252F2305.06946.html

  160. Int-4 LLaMa is not enough—Int-3 and beyond: More compression, easier to build apps on LLMs that run locally

  161. https%253A%252F%252Fnolanoorg.substack.com%252Fp%252Fint-4-llama-is-not-enough-int-3-and.html

  162. SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

  163. https%253A%252F%252Farxiv.org%252Fabs%252F2302.13939.html

  164. BMT: Binarized Neural Machine Translation

  165. https%253A%252F%252Farxiv.org%252Fabs%252F2302.04907%2523google.html

  166. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

  167. https%253A%252F%252Farxiv.org%252Fabs%252F2211.10438.html

  168. Efficiently Scaling Transformer Inference

  169. https://x.com/jekbradbury

  170. https%253A%252F%252Farxiv.org%252Fabs%252F2211.05102%2523google.html

  171. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

  172. https%253A%252F%252Farxiv.org%252Fabs%252F2210.17323.html

  173. GLM-130B: An Open Bilingual Pre-trained Model

  174. https%253A%252F%252Farxiv.org%252Fabs%252F2210.02414%2523baai.html

  175. On-Device Training Under 256KB Memory

  176. https%253A%252F%252Farxiv.org%252Fabs%252F2206.15472.html

  177. Director: Deep Hierarchical Planning from Pixels

  178. https%253A%252F%252Farxiv.org%252Fabs%252F2206.04114%2523google.html

  179. XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient

  180. https%253A%252F%252Farxiv.org%252Fabs%252F2206.01859%2523microsoft.html

  181. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

  182. https%253A%252F%252Farxiv.org%252Fabs%252F2206.01861%2523microsoft.html

  183. Matryoshka Representations for Adaptive Deployment

  184. https%253A%252F%252Farxiv.org%252Fabs%252F2205.13147.html

  185. Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

  186. https%253A%252F%252Farxiv.org%252Fabs%252F2202.06009%2523microsoft.html

  187. Is Programmable Overhead Worth The Cost? How much do we pay for a system to be programmable? It depends upon who you ask

  188. https%253A%252F%252Fsemiengineering.com%252Fis-programmable-overhead-worth-the-cost%252F.html

  189. FQ-ViT: Fully Quantized Vision Transformer without Retraining

  190. https%253A%252F%252Farxiv.org%252Fabs%252F2111.13824.html

  191. Prune Once for All: Sparse Pre-Trained Language Models

  192. https%253A%252F%252Farxiv.org%252Fabs%252F2111.05754.html

  193. 8-bit Optimizers via Block-wise Quantization

  194. Mike Lewis

  195. Luke Zettlemoyer

  196. https%253A%252F%252Farxiv.org%252Fabs%252F2110.02861.html

  197. Understanding and Overcoming the Challenges of Efficient Transformer Quantization

  198. https%253A%252F%252Farxiv.org%252Fabs%252F2109.12948.html

  199. Ten Lessons From Three Generations Shaped Google’s TPUv4i

  200. %252Fdoc%252Fai%252Fscaling%252Fhardware%252F2021-jouppi.pdf.html

  201. Deep Residual Learning in Spiking Neural Networks

  202. https%253A%252F%252Farxiv.org%252Fabs%252F2102.04159.html

  203. 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed

  204. https%253A%252F%252Farxiv.org%252Fabs%252F2102.02888%2523microsoft.html

  205. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  206. Barret Zoph

  207. https%253A%252F%252Farxiv.org%252Fabs%252F2101.03961%2523google.html

  208. Training with Quantization Noise for Extreme Model Compression

  209. https%253A%252F%252Farxiv.org%252Fabs%252F2004.07320%2523facebook.html

  210. SWAT: Sparse Weight Activation Training

  211. https%253A%252F%252Farxiv.org%252Fabs%252F2001.01969.html

  212. QUARL: Quantized Reinforcement Learning (ActorQ)

  213. https%253A%252F%252Farxiv.org%252Fabs%252F1910.01055%2523google.html

  214. Training Imagenet in 3 hours for $25; and CIFAR-10 for $0.26

  215. https%253A%252F%252Fwww.fast.ai%252F2018%252F04%252F30%252Fdawnbench-fastai%252F.html

  216. Training wide residual networks for deployment using a single bit for each weight

  217. https%253A%252F%252Farxiv.org%252Fabs%252F1802.08530.html

  218. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

  219. https%253A%252F%252Farxiv.org%252Fabs%252F1712.01887.html

  220. Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

  221. https%253A%252F%252Farxiv.org%252Fabs%252F1711.08141.html

  222. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

  223. https%253A%252F%252Farxiv.org%252Fabs%252F1603.05279.html