Bibliography:

  1. ‘NN sparsity’ tag

  2. ‘inner monologue (AI)’ tag

  3. ‘dark knowledge (human)’ tag

  4. ‘brain imitation learning’ tag

  5. Research Ideas

  6. A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

  7. LoLCATs: On Low-Rank Linearizing of Large Language Models

  8. The Mamba in the Llama: Distilling and Accelerating Hybrid Models

  9. Gemma 2: Improving Open Language Models at a Practical Size

  10. Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

  11. From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

  12. Streamlining Redundant Layers to Compress Large Language Models

  13. SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

  14. Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

  15. CLLMs: Consistency Large Language Models

  16. Bridging the Gap: Sketch to Color Diffusion Model with Semantic Prompt Learning

  17. Improving Text Embeddings with Large Language Models

  18. ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

  19. ByteDance is secretly using OpenAI’s tech to build a competitor

  20. SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration

  21. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReSTEM)

  22. Generative Models: What do they know? Do they know things? Let’s find out!

  23. Efficient Transformer Knowledge Distillation: A Performance Review

  24. Implicit Chain-of-Thought Reasoning via Knowledge Distillation

  25. Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

  26. HyperFields: Towards Zero-Shot Generation of NeRFs from Text

  27. Polynomial Time Cryptanalytic Extraction of Neural Network Models

  28. OSD: Online Speculative Decoding

  29. ReST: Reinforced Self-Training (ReST) for Language Modeling

  30. Composable Function-preserving Expansions for Transformer Architectures

  31. Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events

  32. Explaining Competitive-Level Programming Solutions using LLMs

  33. GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models

  34. WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia

  35. VanillaNet: the Power of Minimalism in Deep Learning

  36. Mimetic Initialization of Self-Attention Layers

  37. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

  38. Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation

  39. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

  40. LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

  41. Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning

  42. A Cookbook of Self-Supervised Learning

  43. KD-DLGAN: Data Limited Image Generation via Knowledge Distillation

  44. TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

  45. Learning Humanoid Locomotion with Transformers

  46. Consistency Models

  47. ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics

  48. Scaling Vision Transformers to 22 Billion Parameters

  49. BMT: Binarized Neural Machine Translation

  50. Use GPT-3 incorrectly: reduce costs 40× and increase speed by 5×

  51. TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models

  52. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

  53. Solving math word problems with process & outcome-based feedback

  54. Distilled DeepConsensus: Knowledge distillation for fast and accurate DNA sequence correction

  55. MaskDistill: A Unified View of Masked Image Modeling

  56. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

  57. Legged Locomotion in Challenging Terrains using Egocentric Vision

  58. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

  59. Fast DistilBERT on CPUs

  60. Large Language Models Can Self-Improve

  61. Exclusive Supermask Subnetwork Training for Continual Learning

  62. The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes

  63. On Distillation of Guided Diffusion Models

  64. Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints

  65. Omnigrok: Grokking Beyond Algorithmic Data

  66. Human-level Atari 200× faster

  67. On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)

  68. Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members

  69. Re2G: Retrieve, Rerank, Generate

  70. Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

  71. SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features

  72. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

  73. Dataset Condensation via Efficient Synthetic-Data Parameterization

  74. UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

  75. Dialog Inpainting: Turning Documents into Dialogues

  76. Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results

  77. STaR: Bootstrapping Reasoning With Reasoning

  78. Knowledge Distillation: Bad Models Can Be Good Role Models

  79. PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression

  80. Self-Distilled StyleGAN: Towards Generation from Internet Photos

  81. AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

  82. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

  83. Microdosing: Knowledge Distillation for GAN based Compression

  84. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

  85. Amortized Noisy Channel Neural Machine Translation

  86. Causal Distillation for Language Models

  87. Extrapolating from a Single Image to a Thousand Classes using Distillation

  88. Prune Once for All: Sparse Pre-Trained Language Models

  89. Training Verifiers to Solve Math Word Problems

  90. Wav2CLIP: Learning Robust Audio Representations From CLIP

  91. When in Doubt, Summon the Titans: Efficient Inference with Large Models

  92. Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

  93. Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

  94. Language Modeling via Learning to Rank

  95. Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes

  96. Unsupervised Neural Machine Translation with Generative Language Models Only

  97. OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation

  98. Progressive Distillation for Fast Sampling of Diffusion Models

  99. On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

  100. ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation

  101. Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference

  102. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval

  103. KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

  104. Multi-Task Self-Training for Learning General Representations

  105. Dataset Distillation with Infinitely Wide Convolutional Networks

  106. Knowledge-Adaptation Priors

  107. Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

  108. Knowledge distillation: A good teacher is patient and consistent

  109. ResMLP: Feedforward networks for image classification with data-efficient training

  110. DINO: Emerging Properties in Self-Supervised Vision Transformers

  111. Zero-Shot Detection via Vision and Language Knowledge Distillation

  112. Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

  113. ALD: Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

  114. KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs

  115. China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.

  116. Distilling Large Language Models into Tiny and Effective Students using pQRNN

  117. Training data-efficient image transformers & distillation through attention

  118. Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

  119. Towards Playing Full MOBA Games with Deep Reinforcement Learning

  120. A Primer in BERTology: What we know about how BERT works

  121. Dataset Meta-Learning from Kernel Ridge-Regression

  122. TernaryBERT: Distillation-aware Ultra-low Bit BERT

  123. SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners

  124. Movement Pruning: Adaptive Sparsity by Fine-Tuning

  125. General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference

  126. Cryptanalytic Extraction of Neural Network Models

  127. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

  128. Towards a Conversational Agent that Can Chat About…Anything

  129. Understanding the generalization of ‘lottery tickets’ in neural networks

  130. Self-training with Noisy Student improves ImageNet classification

  131. On Warm-Starting Neural Network Training

  132. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

  133. TinyBERT: Distilling BERT for Natural Language Understanding

  134. Smaller, faster, cheaper, lighter: Introducing DistilGPT, a distilled version of GPT

  135. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

  136. ICML 2019 Notes

  137. NoGAN: Decrappification, DeOldification, and Super Resolution

  138. Mask-Predict: Parallel Decoding of Conditional Masked Language Models

  139. Distilling Policy Distillation

  140. Compressing GANs using Knowledge Distillation

  141. Neural probabilistic motor primitives for humanoid control

  142. Dataset Distillation

  143. Exploration by Random Network Distillation

  144. OCD: Optimal Completion Distillation for Sequence Learning

  145. Network Recasting: A Universal Method for Network Architecture Transformation

  146. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

  147. Self-Net: Lifelong Learning via Continual Self-Modeling

  148. Self-distillation: Born Again Neural Networks

  149. Kickstarting Deep Reinforcement Learning

  150. Faster gaze prediction with dense networks and Fisher pruning

  151. Parallel WaveNet: Fast High-Fidelity Speech Synthesis

  152. Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN

  153. Policy Optimization by Genetic Distillation

  154. N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning

  155. Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks

  156. Distral: Robust Multitask Reinforcement Learning

  157. Biased Importance Sampling for Deep Neural Network Training

  158. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

  159. FractalNet: Ultra-Deep Neural Networks without Residuals

  160. Do Deep Convolutional Nets Really Need to be Deep and Convolutional?

  161. Face Model Compression by Distilling Knowledge from Neurons

  162. Policy Distillation

  163. Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

  164. Net2Net: Accelerating Learning via Knowledge Transfer

  165. Bayesian Dark Knowledge

  166. Distilling the Knowledge in a Neural Network

  167. FitNets: Hints for Thin Deep Nets

  168. Do Deep Nets Really Need to be Deep?

  169. Model Compression

  170. Learning Complex, Extended Sequences Using the Principle of History Compression

  171. Dota 2 With Large Scale Deep Reinforcement Learning § Pg11

  172. 2dcf2c6e7f5e36e4ae4e9e3a498d0b2124399287.pdf#page=11&org=openai

  173. Google DeepMind’s Grandmaster-Level Chess Without Search

  174. From Vision to Language: Semi-Supervised Learning in Action…at Scale

  175. design#future-tag-features

    [Transclude the forward-link's context]

  176. 2023-dehghani-figure8-shapebiasofvit22bmodelisalmosthumanlikeascomparedtopastnnmodels.png

  177. 2022-balaji-figure2-ediffiasmultipleunrolledmodelsduringdiffusionphases.png

  178. 2022-balaji-table1-zeroshotfidcomparisonbetweenediffiandothersotaimagegenerationmodelsshowingediffiwins.png

  179. 2021-beyer-figure3-knowledgedistillationover1millionepoches.png

  180. 2016-urban-figure1-mlpvscnnscaling.png

  181. http://www.cs.cornell.edu/~caruana/compression.kdd06.pdf

  182. 6724e4c6ef22fe386a2f7a53a814a6b23b5f4065.pdf

  183. https://blog.helix.ml/p/how-we-got-fine-tuning-mistral-7b

  184. https://blog.segmind.com/introducing-segmind-ssd-1b/

  185. https://discuss.luxonis.com/blog/3272-datadreamer-creating-custom-datasets-made-easy

  186. f876246c4c2f69af88c6507924fefb90e637662d.html

  187. https://eugeneyan.com/writing/synthetic/

  188. https://github.com/mbzuai-nlp/LaMini-LM

  189. https://github.com/nomic-ai/gpt4all

  190. https://medium.com/neuralmachine/knowledge-distillation-dc241d7c2322

  191. https://sander.ai/2024/02/28/paradox.html

  192. https://www.nature.com/articles/s41593-023-01382-9

  193. https://www.reddit.com/r/MachineLearning/comments/1fyb9jj/p_model2vec_distill_a_small_fast_model_from_any/

  194. https://www.theverge.com/2023/3/29/23662621/google-bard-chatgpt-sharegpt-training-denies

  195. https://x.com/EMostaque/status/1641796736879587329

  196. https://x.com/ESYudkowsky/status/1635577836525469697

  197. The Mamba in the Llama: Distilling and Accelerating Hybrid Models

  198. Junxiong Wang

  199. https://rush-nlp.com/

  200. Tri Dao

  201. https%253A%252F%252Farxiv.org%252Fabs%252F2408.15237.html

  202. Gemma 2: Improving Open Language Models at a Practical Size

  203. Behnam Neyshabur

  204. Koray Kavukcuoglu

  205. https%253A%252F%252Farxiv.org%252Fabs%252F2408.00118%2523google.html

  206. Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

  207. https%253A%252F%252Farxiv.org%252Fabs%252F2406.11837.html

  208. From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

  209. https%253A%252F%252Farxiv.org%252Fabs%252F2405.14838.html

  210. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReSTEM)

  211. Abhishek Kumar

  212. Igor Mordatch

  213. Behnam Neyshabur

  214. Jascha Sohl-Dickstein

  215. https%253A%252F%252Farxiv.org%252Fabs%252F2312.06585%2523deepmind.html

  216. Efficient Transformer Knowledge Distillation: A Performance Review

  217. https%253A%252F%252Farxiv.org%252Fabs%252F2311.13657.html

  218. Polynomial Time Cryptanalytic Extraction of Neural Network Models

  219. https%253A%252F%252Farxiv.org%252Fabs%252F2310.08708.html

  220. Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events

  221. https%253A%252F%252Farxiv.org%252Fabs%252F2307.06439%2523microsoft.html

  222. VanillaNet: the Power of Minimalism in Deep Learning

  223. https%253A%252F%252Farxiv.org%252Fabs%252F2305.12972.html

  224. Mimetic Initialization of Self-Attention Layers

  225. https%253A%252F%252Farxiv.org%252Fabs%252F2305.09828.html

  226. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

  227. https%253A%252F%252Farxiv.org%252Fabs%252F2305.07759%2523microsoft.html

  228. Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation

  229. https%253A%252F%252Farxiv.org%252Fabs%252F2305.07804.html

  230. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

  231. https%253A%252F%252Farxiv.org%252Fabs%252F2305.02301%2523google.html

  232. Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning

  233. Guy Lever

  234. Nicolas Heess

  235. https%253A%252F%252Farxiv.org%252Fabs%252F2304.13653%2523deepmind.html

  236. Consistency Models

  237. Speaker Details: EmTech MIT 2023

  238. https%253A%252F%252Farxiv.org%252Fabs%252F2303.01469%2523openai.html

  239. ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics

  240. https%253A%252F%252Farxiv.org%252Fabs%252F2302.12433.html

  241. Scaling Vision Transformers to 22 Billion Parameters

  242. Robert Geirhos

  243. Lucas Beyer

  244. Yi Tay

  245. Neil Houlsby

  246. https%253A%252F%252Farxiv.org%252Fabs%252F2302.05442%2523google.html

  247. BMT: Binarized Neural Machine Translation

  248. https%253A%252F%252Farxiv.org%252Fabs%252F2302.04907%2523google.html

  249. TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models

  250. https%253A%252F%252Farxiv.org%252Fabs%252F2301.01296%2523microsoft.html

  251. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

  252. Yi Tay

  253. Neil Houlsby

  254. https%253A%252F%252Farxiv.org%252Fabs%252F2212.05055%2523google.html

  255. MaskDistill: A Unified View of Masked Image Modeling

  256. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DwmGlMhaBe0.html

  257. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

  258. https%253A%252F%252Farxiv.org%252Fabs%252F2211.07636%2523baai.html

  259. Legged Locomotion in Challenging Terrains using Egocentric Vision

  260. https%253A%252F%252Farxiv.org%252Fabs%252F2211.07638.html

  261. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

  262. https%253A%252F%252Farxiv.org%252Fabs%252F2211.01324%2523nvidia.html

  263. Large Language Models Can Self-Improve

  264. https%253A%252F%252Farxiv.org%252Fabs%252F2210.11610%2523google.html

  265. On Distillation of Guided Diffusion Models

  266. Stefano Ermon

  267. Jonathan Ho

  268. Tim Salimans

  269. https%253A%252F%252Farxiv.org%252Fabs%252F2210.03142%2523google.html

  270. Omnigrok: Grokking Beyond Algorithmic Data

  271. https%253A%252F%252Farxiv.org%252Fabs%252F2210.01117.html

  272. Human-level Atari 200× faster

  273. https%253A%252F%252Farxiv.org%252Fabs%252F2209.07550%2523deepmind.html

  274. Re2G: Retrieve, Rerank, Generate

  275. https%253A%252F%252Farxiv.org%252Fabs%252F2207.06300%2523ibm.html

  276. Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

  277. https%253A%252F%252Farxiv.org%252Fabs%252F2206.07808%2523amazon.html

  278. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

  279. https%253A%252F%252Farxiv.org%252Fabs%252F2206.01861%2523microsoft.html

  280. Dialog Inpainting: Turning Documents into Dialogues

  281. https%253A%252F%252Farxiv.org%252Fabs%252F2205.09073%2523google.html

  282. Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results

  283. https%253A%252F%252Farxiv.org%252Fabs%252F2204.03475%2523alibaba.html

  284. Self-Distilled StyleGAN: Towards Generation from Internet Photos

  285. https%253A%252F%252Farxiv.org%252Fabs%252F2202.12211%2523google.html

  286. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

  287. https%253A%252F%252Farxiv.org%252Fabs%252F2201.05596%2523microsoft.html

  288. Prune Once for All: Sparse Pre-Trained Language Models

  289. https%253A%252F%252Farxiv.org%252Fabs%252F2111.05754.html

  290. Training Verifiers to Solve Math Word Problems

  291. Jacob Hilton's Homepage

  292. John Schulman’s Homepage

  293. https%253A%252F%252Farxiv.org%252Fabs%252F2110.14168%2523openai.html

  294. Language Modeling via Learning to Rank

  295. https%253A%252F%252Farxiv.org%252Fabs%252F2110.06961.html

  296. OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation

  297. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DG89-1yZLFHk.html

  298. ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation

  299. https%253A%252F%252Farxiv.org%252Fabs%252F2109.12066.html

  300. KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

  301. https%253A%252F%252Farxiv.org%252Fabs%252F2109.06243%2523huawei.html

  302. Knowledge distillation: A good teacher is patient and consistent

  303. Lucas Beyer

  304. https%253A%252F%252Farxiv.org%252Fabs%252F2106.05237%2523google.html

  305. DINO: Emerging Properties in Self-Supervised Vision Transformers

  306. https%253A%252F%252Farxiv.org%252Fabs%252F2104.14294%2523facebook.html

  307. Zero-Shot Detection via Vision and Language Knowledge Distillation

  308. https%253A%252F%252Farxiv.org%252Fabs%252F2104.13921%2523google.html

  309. Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

  310. https%253A%252F%252Farxiv.org%252Fabs%252F2104.08945%2523facebook.html

  311. China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.

  312. https%253A%252F%252Fsyncedreview.com%252F2021%252F03%252F23%252Fchinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0%252F%2523baai.html

  313. Training data-efficient image transformers & distillation through attention

  314. https%253A%252F%252Farxiv.org%252Fabs%252F2012.12877%2523facebook.html

  315. Towards Playing Full MOBA Games with Deep Reinforcement Learning

  316. https%253A%252F%252Farxiv.org%252Fabs%252F2011.12692%2523tencent.html

  317. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

  318. Furu Wei

  319. https%253A%252F%252Farxiv.org%252Fabs%252F2002.10957%2523microsoft.html

  320. Towards a Conversational Agent that Can Chat About…Anything

  321. https%253A%252F%252Fresearch.google%252Fblog%252Ftowards-a-conversational-agent-that-can-chat-aboutanything%252F.html

  322. Self-training with Noisy Student improves ImageNet classification

  323. https%253A%252F%252Farxiv.org%252Fabs%252F1911.04252%2523google.html

  324. TinyBERT: Distilling BERT for Natural Language Understanding

  325. https%253A%252F%252Farxiv.org%252Fabs%252F1909.10351.html

  326. ICML 2019 Notes

  327. https%253A%252F%252Fdavid-abel.github.io%252Fnotes%252Ficml_2019.pdf.html

  328. Distilling Policy Distillation

  329. https://sites.google.com/view/razp/home

  330. https%253A%252F%252Farxiv.org%252Fabs%252F1902.02186%2523deepmind.html

  331. Face Model Compression by Distilling Knowledge from Neurons

  332. %252Fdoc%252Fai%252Fnn%252Fsparsity%252Fknowledge-distillation%252F2016-luo.pdf.html