- See Also
-
Links
- “HyperFields: Towards Zero-Shot Generation of NeRFs from Text”, Babu et al 2023
- “Polynomial Time Cryptanalytic Extraction of Neural Network Models”, Shamir et al 2023
- “Composable Function-preserving Expansions for Transformer Architectures”, Gesmundo & Maile 2023
- “Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events”, Gu et al 2023
- “Explaining Competitive-Level Programming Solutions Using LLMs”, Li et al 2023
- “GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models”, Agarwal et al 2023
- “VanillaNet: the Power of Minimalism in Deep Learning”, Chen et al 2023
- “Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation”, Guo et al 2023
- “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?”, Eldan & Li 2023
- “LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions”, Wu et al 2023
- “Learning Agile Soccer Skills for a Bipedal Robot With Deep Reinforcement Learning”, Haarnoja et al 2023
- “A Cookbook of Self-Supervised Learning”, Balestriero et al 2023
- “KD-DLGAN: Data Limited Image Generation via Knowledge Distillation”, Cui et al 2023
- “Learning Humanoid Locomotion With Transformers”, Radosavovic et al 2023
- “Consistency Models”, Song et al 2023
- “ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”, Azerbayev et al 2023
- “Scaling Vision Transformers to 22 Billion Parameters”, Dehghani et al 2023
- “BMT: Binarized Neural Machine Translation”, Zhang et al 2023
- “Use GPT-3 Incorrectly: Reduce Costs 40× and Increase Speed by 5×”, Pullen 2023
- “TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”, Ren et al 2023
- “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, Komatsuzaki et al 2022
- “Solving Math Word Problems With Process & Outcome-based Feedback”, Uesato et al 2022
- “MaskDistill: A Unified View of Masked Image Modeling”, Anonymous 2022
- “Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, Belyaeva et al 2022
- “Legged Locomotion in Challenging Terrains Using Egocentric Vision”, Agarwal et al 2022
- “EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”, Fang et al 2022
- “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Balaji et al 2022
- “Fast DistilBERT on CPUs”, Shen et al 2022
- “Large Language Models Can Self-Improve”, Huang et al 2022
- “Exclusive Supermask Subnetwork Training for Continual Learning”, Yadav & Bansal 2022
- “The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”, Kocsis et al 2022
- “On Distillation of Guided Diffusion Models”, Meng et al 2022
- “Omnigrok: Grokking Beyond Algorithmic Data”, Liu et al 2022
- “Human-level Atari 200× Faster”, Kapturowski et al 2022
- “On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, Rohanian et al 2022
- “Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members”, Cornelisse et al 2022
- “Re2G: Retrieve, Rerank, Generate”, Glass et al 2022
- “Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems”, FitzGerald et al 2022
- “SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, Opitz & Frank 2022
- “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Yao et al 2022
- “Dataset Condensation via Efficient Synthetic-Data Parameterization”, Kim et al 2022
- “UViM: A Unified Modeling Approach for Vision With Learned Guiding Codes”, Kolesnikov et al 2022
- “Dialog Inpainting: Turning Documents into Dialogues”, Dai et al 2022
- “Solving ImageNet: a Unified Scheme for Training Any Backbone to Top Results”, Ridnik et al 2022
- “Knowledge Distillation: Bad Models Can Be Good Role Models”, Kaplun et al 2022
- “STaR: Bootstrapping Reasoning With Reasoning”, Zelikman et al 2022
- “PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression”, Vo et al 2022
- “Self-Distilled StyleGAN: Towards Generation from Internet Photos”, Mokady et al 2022
- “AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, Xu et al 2022
- “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, Rajbhandari et al 2022
- “Microdosing: Knowledge Distillation for GAN Based Compression”, Helminger et al 2022
- “ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation”, Wang et al 2021
- “Amortized Noisy Channel Neural Machine Translation”, Pang et al 2021
- “Causal Distillation for Language Models”, Wu et al 2021
- “Extrapolating from a Single Image to a Thousand Classes Using Distillation”, Asano & Saeed 2021
- “Prune Once for All: Sparse Pre-Trained Language Models”, Zafrir et al 2021
- “Training Verifiers to Solve Math Word Problems”, Cobbe et al 2021
- “Wav2CLIP: Learning Robust Audio Representations From CLIP”, Wu et al 2021
- “When in Doubt, Summon the Titans: Efficient Inference With Large Models”, Rawat et al 2021
- “Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, Jin et al 2021
- “Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”, West et al 2021
- “Language Modelling via Learning to Rank”, Frydenlund et al 2021
- “Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes”, Lee et al 2021
- “Unsupervised Neural Machine Translation With Generative Language Models Only”, Han et al 2021
- “Progressive Distillation for Fast Sampling of Diffusion Models”, Salimans & Ho 2021
- “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Wu et al 2021
- “On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”, Lai et al 2021
- “Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, Kudugunta et al 2021
- “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Xie & Zheng 2021
- “SPLADE V2: Sparse Lexical and Expansion Model for Information Retrieval”, Formal et al 2021
- “KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Tahaei et al 2021
- “Multi-Task Self-Training for Learning General Representations”, Ghiasi et al 2021
- “Dataset Distillation With Infinitely Wide Convolutional Networks”, Nguyen et al 2021
- “Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”, Menghani 2021
- “Knowledge-Adaptation Priors”, Khan & Swaroop 2021
- “Knowledge Distillation: A Good Teacher Is Patient and Consistent”, Beyer et al 2021
- “ResMLP: Feedforward Networks for Image Classification With Data-efficient Training”, Touvron et al 2021
- “DINO: Emerging Properties in Self-Supervised Vision Transformers”, Caron et al 2021
- “Zero-Shot Detection via Vision and Language Knowledge Distillation”, Gu et al 2021
- “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Cheng et al 2021
- “Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, Parisotto & Salakhutdinov 2021
- “KiloNeRF: Speeding up Neural Radiance Fields With Thousands of Tiny MLPs”, Reiser et al 2021
- “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced 2021
- “Distilling Large Language Models into Tiny and Effective Students Using PQRNN”, Kaliamoorthi et al 2021
- “Training Data-efficient Image Transformers & Distillation through Attention”, Touvron et al 2020
- “Towards Playing Full MOBA Games With Deep Reinforcement Learning”, Ye et al 2020
- “A Primer in BERTology: What We Know about How BERT Works”, Rogers et al 2020
- “Dataset Meta-Learning from Kernel Ridge-Regression”, Nguyen et al 2020
- “TernaryBERT: Distillation-aware Ultra-low Bit BERT”, Zhang et al 2020
- “SimCLRv2: Big Self-Supervised Models Are Strong Semi-Supervised Learners”, Chen et al 2020
- “Movement Pruning: Adaptive Sparsity by Fine-Tuning”, Sanh et al 2020
- “General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, Du et al 2020
- “Cryptanalytic Extraction of Neural Network Models”, Carlini et al 2020
- “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wang et al 2020
- “Towards a Conversational Agent That Can Chat About…Anything”, Adiwardana & Luong 2020
- “Understanding the Generalization of ‘lottery Tickets’ in Neural Networks”, Morcos & Tian 2019
- “Self-training With Noisy Student Improves ImageNet Classification”, Xie et al 2019
- “On Warm-Starting Neural Network Training”, Ash & Adams 2019
- “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Sanh et al 2019
- “TinyBERT: Distilling BERT for Natural Language Understanding”, Jiao et al 2019
- “Smaller, Faster, Cheaper, Lighter: Introducing DistilGPT, a Distilled Version of GPT”, Sanh 2019
- “ICML 2019 Notes”, Abel 2019
- “Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Ghazvininejad et al 2019
- “Distilling Policy Distillation”, Czarnecki et al 2019
- “Compressing GANs Using Knowledge Distillation”, Aguinaldo et al 2019
- “Neural Probabilistic Motor Primitives for Humanoid Control”, Merel et al 2018
- “Dataset Distillation”, Wang et al 2018
- “Exploration by Random Network Distillation”, Burda et al 2018
- “OCD: Optimal Completion Distillation for Sequence Learning”, Sabour et al 2018
- “Network Recasting: A Universal Method for Network Architecture Transformation”, Yu et al 2018
- “ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech”, Ping et al 2018
- “Self-Net: Lifelong Learning via Continual Self-Modeling”, Camp et al 2018
- “Kickstarting Deep Reinforcement Learning”, Schmitt et al 2018
- “Faster Gaze Prediction With Dense Networks and Fisher Pruning”, Theis et al 2018
- “Parallel WaveNet: Fast High-Fidelity Speech Synthesis”, Oord et al 2017
- “Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”, Gao et al 2017
- “Policy Optimization by Genetic Distillation”, Gangwani & Peng 2017
- “N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”, Ashok et al 2017
- “Training Shallow and Thin Networks for Acceleration via Knowledge Distillation With Conditional Adversarial Networks”, Xu et al 2017
- “Distral: Robust Multitask Reinforcement Learning”, Teh et al 2017
- “Biased Importance Sampling for Deep Neural Network Training”, Katharopoulos & Fleuret 2017
- “Research Ideas”, Gwern 2017
- “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, Zagoruyko & Komodakis 2016
- “Do Deep Convolutional Nets Really Need to Be Deep and Convolutional?”, Urban et al 2016
- “Face Model Compression by Distilling Knowledge from Neurons”, Luo et al 2016
- “Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”, Parisotto et al 2015
- “Policy Distillation”, Rusu et al 2015
- “Net2Net: Accelerating Learning via Knowledge Transfer”, Chen et al 2015
- “Bayesian Dark Knowledge”, Korattikara et al 2015
- “Distilling the Knowledge in a Neural Network”, Hinton et al 2015
- “FitNets: Hints for Thin Deep Nets”, Romero et al 2014
- “Do Deep Nets Really Need to Be Deep?”, Ba & Caruana 2013
- “Model Compression”, Bucila 2006
- “Learning Complex, Extended Sequences Using the Principle of History Compression”, Schmidhuber 1992
- “From Vision to Language: Semi-Supervised Learning in Action…at Scale”
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Links
“HyperFields: Towards Zero-Shot Generation of NeRFs from Text”, Babu et al 2023
“HyperFields: Towards Zero-Shot Generation of NeRFs from Text”
“Polynomial Time Cryptanalytic Extraction of Neural Network Models”, Shamir et al 2023
“Polynomial Time Cryptanalytic Extraction of Neural Network Models”
“Composable Function-preserving Expansions for Transformer Architectures”, Gesmundo & Maile 2023
“Composable Function-preserving Expansions for Transformer Architectures”
“Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events”, Gu et al 2023
“Explaining Competitive-Level Programming Solutions Using LLMs”, Li et al 2023
“Explaining Competitive-Level Programming Solutions using LLMs”
“GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models”, Agarwal et al 2023
“GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models”
“VanillaNet: the Power of Minimalism in Deep Learning”, Chen et al 2023
“Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation”, Guo et al 2023
“Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation”
“TinyStories: How Small Can Language Models Be and Still Speak Coherent English?”, Eldan & Li 2023
“TinyStories: How Small Can Language Models Be and Still Speak Coherent English?”
“LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions”, Wu et al 2023
“LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions”
“Learning Agile Soccer Skills for a Bipedal Robot With Deep Reinforcement Learning”, Haarnoja et al 2023
“Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning”
“A Cookbook of Self-Supervised Learning”, Balestriero et al 2023
“KD-DLGAN: Data Limited Image Generation via Knowledge Distillation”, Cui et al 2023
“KD-DLGAN: Data Limited Image Generation via Knowledge Distillation”
“Learning Humanoid Locomotion With Transformers”, Radosavovic et al 2023
“Consistency Models”, Song et al 2023
“ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”, Azerbayev et al 2023
“ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”
“Scaling Vision Transformers to 22 Billion Parameters”, Dehghani et al 2023
“BMT: Binarized Neural Machine Translation”, Zhang et al 2023
“Use GPT-3 Incorrectly: Reduce Costs 40× and Increase Speed by 5×”, Pullen 2023
“Use GPT-3 incorrectly: reduce costs 40× and increase speed by 5×”
“TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”, Ren et al 2023
“TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”
“Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, Komatsuzaki et al 2022
“Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”
“Solving Math Word Problems With Process & Outcome-based Feedback”, Uesato et al 2022
“Solving math word problems with process & outcome-based feedback”
“MaskDistill: A Unified View of Masked Image Modeling”, Anonymous 2022
“Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, Belyaeva et al 2022
“Distilled DeepConsensus: Knowledge distillation for fast and accurate DNA sequence correction”
“Legged Locomotion in Challenging Terrains Using Egocentric Vision”, Agarwal et al 2022
“Legged Locomotion in Challenging Terrains using Egocentric Vision”
“EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”, Fang et al 2022
“EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”
“EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Balaji et al 2022
“eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers”
“Fast DistilBERT on CPUs”, Shen et al 2022
“Large Language Models Can Self-Improve”, Huang et al 2022
“Exclusive Supermask Subnetwork Training for Continual Learning”, Yadav & Bansal 2022
“Exclusive Supermask Subnetwork Training for Continual Learning”
“The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”, Kocsis et al 2022
“The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”
“On Distillation of Guided Diffusion Models”, Meng et al 2022
“Omnigrok: Grokking Beyond Algorithmic Data”, Liu et al 2022
“Human-level Atari 200× Faster”, Kapturowski et al 2022
“On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, Rohanian et al 2022
“On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”
“Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members”, Cornelisse et al 2022
“Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members”
“Re2G: Retrieve, Rerank, Generate”, Glass et al 2022
“Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems”, FitzGerald et al 2022
“SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, Opitz & Frank 2022
“ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Yao et al 2022
“ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”
“Dataset Condensation via Efficient Synthetic-Data Parameterization”, Kim et al 2022
“Dataset Condensation via Efficient Synthetic-Data Parameterization”
“UViM: A Unified Modeling Approach for Vision With Learned Guiding Codes”, Kolesnikov et al 2022
“UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes”
“Dialog Inpainting: Turning Documents into Dialogues”, Dai et al 2022
“Solving ImageNet: a Unified Scheme for Training Any Backbone to Top Results”, Ridnik et al 2022
“Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results”
“Knowledge Distillation: Bad Models Can Be Good Role Models”, Kaplun et al 2022
“Knowledge Distillation: Bad Models Can Be Good Role Models”
“STaR: Bootstrapping Reasoning With Reasoning”, Zelikman et al 2022
“PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression”, Vo et al 2022
“Self-Distilled StyleGAN: Towards Generation from Internet Photos”, Mokady et al 2022
“Self-Distilled StyleGAN: Towards Generation from Internet Photos”
“AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, Xu et al 2022
“AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”
“DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, Rajbhandari et al 2022
“Microdosing: Knowledge Distillation for GAN Based Compression”, Helminger et al 2022
“Microdosing: Knowledge Distillation for GAN based Compression”
“ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation”, Wang et al 2021
“Amortized Noisy Channel Neural Machine Translation”, Pang et al 2021
“Causal Distillation for Language Models”, Wu et al 2021
“Extrapolating from a Single Image to a Thousand Classes Using Distillation”, Asano & Saeed 2021
“Extrapolating from a Single Image to a Thousand Classes using Distillation”
“Prune Once for All: Sparse Pre-Trained Language Models”, Zafrir et al 2021
“Training Verifiers to Solve Math Word Problems”, Cobbe et al 2021
“Wav2CLIP: Learning Robust Audio Representations From CLIP”, Wu et al 2021
“When in Doubt, Summon the Titans: Efficient Inference With Large Models”, Rawat et al 2021
“When in Doubt, Summon the Titans: Efficient Inference with Large Models”
“Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, Jin et al 2021
“Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”
“Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”, West et al 2021
“Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”
“Language Modelling via Learning to Rank”, Frydenlund et al 2021
“Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes”, Lee et al 2021
“Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes”
“Unsupervised Neural Machine Translation With Generative Language Models Only”, Han et al 2021
“Unsupervised Neural Machine Translation with Generative Language Models Only”
“Progressive Distillation for Fast Sampling of Diffusion Models”, Salimans & Ho 2021
“Progressive Distillation for Fast Sampling of Diffusion Models”
“OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Wu et al 2021
“On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”, Lai et al 2021
“On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”
“Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, Kudugunta et al 2021
“Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”
“ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Xie & Zheng 2021
“ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation”
“SPLADE V2: Sparse Lexical and Expansion Model for Information Retrieval”, Formal et al 2021
“SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval”
“KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Tahaei et al 2021
“Multi-Task Self-Training for Learning General Representations”, Ghiasi et al 2021
“Multi-Task Self-Training for Learning General Representations”
“Dataset Distillation With Infinitely Wide Convolutional Networks”, Nguyen et al 2021
“Dataset Distillation with Infinitely Wide Convolutional Networks”
“Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”, Menghani 2021
“Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”
“Knowledge-Adaptation Priors”, Khan & Swaroop 2021
“Knowledge Distillation: A Good Teacher Is Patient and Consistent”, Beyer et al 2021
“Knowledge distillation: A good teacher is patient and consistent”
“ResMLP: Feedforward Networks for Image Classification With Data-efficient Training”, Touvron et al 2021
“ResMLP: Feedforward networks for image classification with data-efficient training”
“DINO: Emerging Properties in Self-Supervised Vision Transformers”, Caron et al 2021
“DINO: Emerging Properties in Self-Supervised Vision Transformers”
“Zero-Shot Detection via Vision and Language Knowledge Distillation”, Gu et al 2021
“Zero-Shot Detection via Vision and Language Knowledge Distillation”
“Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Cheng et al 2021
“Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation”
“Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, Parisotto & Salakhutdinov 2021
“Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation”
“KiloNeRF: Speeding up Neural Radiance Fields With Thousands of Tiny MLPs”, Reiser et al 2021
“KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs”
“China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced 2021
“Distilling Large Language Models into Tiny and Effective Students Using PQRNN”, Kaliamoorthi et al 2021
“Distilling Large Language Models into Tiny and Effective Students using pQRNN”
“Training Data-efficient Image Transformers & Distillation through Attention”, Touvron et al 2020
“Training data-efficient image transformers & distillation through attention”
“Towards Playing Full MOBA Games With Deep Reinforcement Learning”, Ye et al 2020
“Towards Playing Full MOBA Games with Deep Reinforcement Learning”
“A Primer in BERTology: What We Know about How BERT Works”, Rogers et al 2020
“Dataset Meta-Learning from Kernel Ridge-Regression”, Nguyen et al 2020
“TernaryBERT: Distillation-aware Ultra-low Bit BERT”, Zhang et al 2020
“SimCLRv2: Big Self-Supervised Models Are Strong Semi-Supervised Learners”, Chen et al 2020
“SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners”
“Movement Pruning: Adaptive Sparsity by Fine-Tuning”, Sanh et al 2020
“General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, Du et al 2020
“General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”
“Cryptanalytic Extraction of Neural Network Models”, Carlini et al 2020
“MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wang et al 2020
“MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”
“Towards a Conversational Agent That Can Chat About…Anything”, Adiwardana & Luong 2020
“Towards a Conversational Agent that Can Chat About…Anything”
“Understanding the Generalization of ‘lottery Tickets’ in Neural Networks”, Morcos & Tian 2019
“Understanding the generalization of ‘lottery tickets’ in neural networks”
“Self-training With Noisy Student Improves ImageNet Classification”, Xie et al 2019
“Self-training with Noisy Student improves ImageNet classification”
“On Warm-Starting Neural Network Training”, Ash & Adams 2019
“DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Sanh et al 2019
“DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”
“TinyBERT: Distilling BERT for Natural Language Understanding”, Jiao et al 2019
“TinyBERT: Distilling BERT for Natural Language Understanding”
“Smaller, Faster, Cheaper, Lighter: Introducing DistilGPT, a Distilled Version of GPT”, Sanh 2019
“Smaller, faster, cheaper, lighter: Introducing DistilGPT, a distilled version of GPT”
“ICML 2019 Notes”, Abel 2019
“Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Ghazvininejad et al 2019
“Mask-Predict: Parallel Decoding of Conditional Masked Language Models”
“Distilling Policy Distillation”, Czarnecki et al 2019
“Compressing GANs Using Knowledge Distillation”, Aguinaldo et al 2019
“Neural Probabilistic Motor Primitives for Humanoid Control”, Merel et al 2018
“Neural probabilistic motor primitives for humanoid control”
“Dataset Distillation”, Wang et al 2018
“Exploration by Random Network Distillation”, Burda et al 2018
“OCD: Optimal Completion Distillation for Sequence Learning”, Sabour et al 2018
“OCD: Optimal Completion Distillation for Sequence Learning”
“Network Recasting: A Universal Method for Network Architecture Transformation”, Yu et al 2018
“Network Recasting: A Universal Method for Network Architecture Transformation”
“ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech”, Ping et al 2018
“ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech”
“Self-Net: Lifelong Learning via Continual Self-Modeling”, Camp et al 2018
“Kickstarting Deep Reinforcement Learning”, Schmitt et al 2018
“Faster Gaze Prediction With Dense Networks and Fisher Pruning”, Theis et al 2018
“Faster gaze prediction with dense networks and Fisher pruning”
“Parallel WaveNet: Fast High-Fidelity Speech Synthesis”, Oord et al 2017
“Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”, Gao et al 2017
“Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”
“Policy Optimization by Genetic Distillation”, Gangwani & Peng 2017
“N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”, Ashok et al 2017
“N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”
“Training Shallow and Thin Networks for Acceleration via Knowledge Distillation With Conditional Adversarial Networks”, Xu et al 2017
“Distral: Robust Multitask Reinforcement Learning”, Teh et al 2017
“Biased Importance Sampling for Deep Neural Network Training”, Katharopoulos & Fleuret 2017
“Biased Importance Sampling for Deep Neural Network Training”
“Research Ideas”, Gwern 2017
“Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, Zagoruyko & Komodakis 2016
“Do Deep Convolutional Nets Really Need to Be Deep and Convolutional?”, Urban et al 2016
“Do Deep Convolutional Nets Really Need to be Deep and Convolutional?”
“Face Model Compression by Distilling Knowledge from Neurons”, Luo et al 2016
“Face Model Compression by Distilling Knowledge from Neurons”
“Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”, Parisotto et al 2015
“Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”
“Policy Distillation”, Rusu et al 2015
“Net2Net: Accelerating Learning via Knowledge Transfer”, Chen et al 2015
“Bayesian Dark Knowledge”, Korattikara et al 2015
“Distilling the Knowledge in a Neural Network”, Hinton et al 2015
“FitNets: Hints for Thin Deep Nets”, Romero et al 2014
“Do Deep Nets Really Need to Be Deep?”, Ba & Caruana 2013
“Model Compression”, Bucila 2006
“Learning Complex, Extended Sequences Using the Principle of History Compression”, Schmidhuber 1992
“Learning Complex, Extended Sequences Using the Principle of History Compression”
“From Vision to Language: Semi-Supervised Learning in Action…at Scale”
“From Vision to Language: Semi-Supervised Learning in Action…at Scale”
Wikipedia
Miscellaneous
Link Bibliography
-
https://arxiv.org/abs/2310.08708
: “Polynomial Time Cryptanalytic Extraction of Neural Network Models”, Adi Shamir, Isaac Canales-Martinez, Anna Hambitzer, Jorge Chavez-Saab, Francisco Rodrigez-Henriquez, Nitin Satpute -
https://arxiv.org/abs/2307.06439#microsoft
: “Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events”, -
https://arxiv.org/abs/2305.12972
: “VanillaNet: the Power of Minimalism in Deep Learning”, Hanting Chen, Yunhe Wang, Jianyuan Guo, Dacheng Tao -
https://arxiv.org/abs/2305.07804
: “Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation”, Zhen Guo, Peiqi Wang, Yanwei Wang, Shangdi Yu -
https://arxiv.org/abs/2305.07759#microsoft
: “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?”, Ronen Eldan, Yuanzhi Li -
https://arxiv.org/abs/2304.13653#deepmind
: “Learning Agile Soccer Skills for a Bipedal Robot With Deep Reinforcement Learning”, -
https://arxiv.org/abs/2303.01469#openai
: “Consistency Models”, Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever -
https://arxiv.org/abs/2302.12433
: “ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”, Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, Jeremy Avigad -
https://arxiv.org/abs/2302.05442#google
: “Scaling Vision Transformers to 22 Billion Parameters”, -
https://arxiv.org/abs/2302.04907#google
: “BMT: Binarized Neural Machine Translation”, Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat -
https://arxiv.org/abs/2301.01296#microsoft
: “TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”, Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu -
https://arxiv.org/abs/2212.05055#google
: “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, -
https://openreview.net/forum?id=wmGlMhaBe0
: “MaskDistill: A Unified View of Masked Image Modeling”, Anonymous -
https://arxiv.org/abs/2211.07638
: “Legged Locomotion in Challenging Terrains Using Egocentric Vision”, Ananye Agarwal, Ashish Kumar, Jitendra Malik, Deepak Pathak -
https://arxiv.org/abs/2211.07636#baai
: “EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”, Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao -
https://arxiv.org/abs/2211.01324#nvidia
: “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, -
https://arxiv.org/abs/2210.11610#google
: “Large Language Models Can Self-Improve”, Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han -
https://arxiv.org/abs/2210.03142#google
: “On Distillation of Guided Diffusion Models”, Chenlin Meng, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, Tim Salimans -
https://arxiv.org/abs/2209.07550#deepmind
: “Human-level Atari 200× Faster”, Steven Kapturowski, Víctor Campos, Ray Jiang, Nemanja Rakićević, Hado van Hasselt, Charles Blundell, Adrià Puigdomènech Badia -
https://arxiv.org/abs/2207.06300#ibm
: “Re2G: Retrieve, Rerank, Generate”, Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, Alfio Gliozzo -
https://arxiv.org/abs/2206.07808#amazon
: “Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems”, -
https://arxiv.org/abs/2206.01861#microsoft
: “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He -
https://arxiv.org/abs/2205.09073#google
: “Dialog Inpainting: Turning Documents into Dialogues”, Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu -
https://arxiv.org/abs/2204.03475#alibaba
: “Solving ImageNet: a Unified Scheme for Training Any Backbone to Top Results”, Tal Ridnik, Hussam Lawen, Emanuel Ben-Baruch, Asaf Noy -
https://arxiv.org/abs/2202.12211#google
: “Self-Distilled StyleGAN: Towards Generation from Internet Photos”, Ron Mokady, Michal Yarom, Omer Tov, Oran Lang, Daniel Cohen-Or, Tali Dekel, Michal Irani, Inbar Mosseri -
https://arxiv.org/abs/2201.05596#microsoft
: “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, -
https://arxiv.org/abs/2111.05754
: “Prune Once for All: Sparse Pre-Trained Language Models”, Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat -
https://arxiv.org/abs/2110.14168#openai
: “Training Verifiers to Solve Math Word Problems”, Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman -
https://arxiv.org/abs/2110.06961
: “Language Modelling via Learning to Rank”, Arvid Frydenlund, Gagandeep Singh, Frank Rudzicz -
https://openreview.net/forum?id=G89-1yZLFHk
: “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Bichen Wu, Ruizhe Cheng, Peizhao Zhang, Peter Vajda, Joseph E. Gonzalez -
https://arxiv.org/abs/2109.12066
: “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Johnathan Xie, Shuai Zheng -
https://arxiv.org/abs/2109.06243#huawei
: “KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Marzieh S. Tahaei, Ella Charlaix, Vahid Partovi Nia, Ali Ghodsi, Mehdi Rezagholizadeh -
https://arxiv.org/abs/2106.05237#google
: “Knowledge Distillation: A Good Teacher Is Patient and Consistent”, Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, Alexander Kolesnikov -
https://arxiv.org/abs/2104.13921#google
: “Zero-Shot Detection via Vision and Language Knowledge Distillation”, Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui -
https://arxiv.org/abs/2104.08945#facebook
: “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Ruizhe Cheng, Bichen Wu, Peizhao Zhang, Peter Vajda, Joseph E. Gonzalez -
https://syncedreview.com/2021/03/23/chinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0/#baai
: “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced -
https://arxiv.org/abs/2012.12877#facebook
: “Training Data-efficient Image Transformers & Distillation through Attention”, Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou -
https://arxiv.org/abs/2011.12692#tencent
: “Towards Playing Full MOBA Games With Deep Reinforcement Learning”, -
https://arxiv.org/abs/2002.10957#microsoft
: “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou -
https://blog.research.google/2020/01/towards-conversational-agent-that-can.html
: “Towards a Conversational Agent That Can Chat About…Anything”, Daniel Adiwardana, Thang Luong -
https://arxiv.org/abs/1911.04252#google
: “Self-training With Noisy Student Improves ImageNet Classification”, Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le -
https://arxiv.org/abs/1909.10351
: “TinyBERT: Distilling BERT for Natural Language Understanding”, Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu -
https://david-abel.github.io/notes/icml_2019.pdf
: “ICML 2019 Notes”, David Abel -
https://arxiv.org/abs/1902.02186#deepmind
: “Distilling Policy Distillation”, Wojciech Marian Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant M. Jayakumar, Grzegorz Swirszcz, Max Jaderberg -
idea
: “Research Ideas”, Gwern -
2016-luo.pdf
: “Face Model Compression by Distilling Knowledge from Neurons”, Ping Luo, Zhenyao Zhu, Ziwei Liu, Xiaogang Wang, Xiaoou Tang