- See Also
-
Links
- “Learning Humanoid Locomotion With Transformers”, Et Al 2023
- “ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”, Et Al 2023
- “Scaling Vision Transformers to 22 Billion Parameters”, Et Al 2023
- “BMT: Binarized Neural Machine Translation”, Et Al 2023
- “Use GPT-3 Incorrectly: Reduce Costs 40× and Increase Speed by 5×”, 2023
- “TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”, Et Al 2023
- “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, Et Al 2022
- “MaskDistill: A Unified View of Masked Image Modeling”, 2022
- “Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, Et Al 2022
- “Legged Locomotion in Challenging Terrains Using Egocentric Vision”, Et Al 2022
- “EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”, Et Al 2022
- “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Et Al 2022
- “Fast DistilBERT on CPUs”, Et Al 2022
- “Large Language Models Can Self-Improve”, Et Al 2022
- “Exclusive Supermask Subnetwork Training for Continual Learning”, 2022
- “The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”, Et Al 2022
- “On Distillation of Guided Diffusion Models”, Et Al 2022
- “Human-level Atari 200× Faster”, Et Al 2022
- “On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, Et Al 2022
- “Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members”, Et Al 2022
- “Re2G: Retrieve, Rerank, Generate”, Et Al 2022
- “Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems”, FitzEt Al 2022
- “SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, 2022
- “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Et Al 2022
- “Dataset Condensation via Efficient Synthetic-Data Parameterization”, Et Al 2022
- “Dialog Inpainting: Turning Documents into Dialogues”, Et Al 2022
- “Solving ImageNet: a Unified Scheme for Training Any Backbone to Top Results”, Et Al 2022
- “Knowledge Distillation: Bad Models Can Be Good Role Models”, Et Al 2022
- “STaR: Bootstrapping Reasoning With Reasoning”, Et Al 2022
- “PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression”, Et Al 2022
- “Self-Distilled StyleGAN: Towards Generation from Internet Photos”, Et Al 2022
- “AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, Et Al 2022
- “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, Et Al 2022
- “Microdosing: Knowledge Distillation for GAN Based Compression”, Et Al 2022
- “ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation”, Et Al 2021
- “Amortized Noisy Channel Neural Machine Translation”, Et Al 2021
- “Causal Distillation for Language Models”, Et Al 2021
- “Extrapolating from a Single Image to a Thousand Classes Using Distillation”, 2021
- “Prune Once for All: Sparse Pre-Trained Language Models”, Et Al 2021
- “Training Verifiers to Solve Math Word Problems”, Et Al 2021
- “Wav2CLIP: Learning Robust Audio Representations From CLIP”, Et Al 2021
- “When in Doubt, Summon the Titans: Efficient Inference With Large Models”, Et Al 2021
- “Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, Et Al 2021
- “Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”, Et Al 2021
- “Language Modelling via Learning to Rank”, Et Al 2021
- “Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes”, Et Al 2021
- “Unsupervised Neural Machine Translation With Generative Language Models Only”, Et Al 2021
- “Progressive Distillation for Fast Sampling of Diffusion Models”, 2021
- “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Et Al 2021
- “On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”, Et Al 2021
- “Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, Et Al 2021
- “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, 2021
- “SPLADE V2: Sparse Lexical and Expansion Model for Information Retrieval”, Et Al 2021
- “KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Et Al 2021
- “Multi-Task Self-Training for Learning General Representations”, Et Al 2021
- “Dataset Distillation With Infinitely Wide Convolutional Networks”, Et Al 2021
- “Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”, 2021
- “Knowledge Distillation: A Good Teacher Is Patient and Consistent”, Et Al 2021
- “ResMLP: Feedforward Networks for Image Classification With Data-efficient Training”, Et Al 2021
- “DINO: Emerging Properties in Self-Supervised Vision Transformers”, Et Al 2021
- “Zero-Shot Detection via Vision and Language Knowledge Distillation”, Et Al 2021
- “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Et Al 2021
- “Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, 2021
- “KiloNeRF: Speeding up Neural Radiance Fields With Thousands of Tiny MLPs”, Et Al 2021
- “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, 2021
- “Distilling Large Language Models into Tiny and Effective Students Using PQRNN”, Et Al 2021
- “Training Data-efficient Image Transformers & Distillation through Attention”, Et Al 2020
- “Towards Playing Full MOBA Games With Deep Reinforcement Learning”, Et Al 2020
- “A Primer in BERTology: What We Know about How BERT Works”, Et Al 2020
- “Dataset Meta-Learning from Kernel Ridge-Regression”, Et Al 2020
- “TernaryBERT: Distillation-aware Ultra-low Bit BERT”, Et Al 2020
- “SimCLRv2: Big Self-Supervised Models Are Strong Semi-Supervised Learners”, Et Al 2020
- “Movement Pruning: Adaptive Sparsity by Fine-Tuning”, Et Al 2020
- “General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, Et Al 2020
- “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Et Al 2020
- “Towards a Conversational Agent That Can Chat About…Anything”, 2020
- “Understanding the Generalization Of ‘Lottery Tickets’ In Neural Networks”, 2019
- “Self-training With Noisy Student Improves ImageNet Classification”, Et Al 2019
- “On Warm-Starting Neural Network Training”, 2019
- “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Et Al 2019
- “TinyBERT: Distilling BERT for Natural Language Understanding”, Et Al 2019
- “Smaller, Faster, Cheaper, Lighter: Introducing DistilGPT, a Distilled Version of GPT”, 2019
- “ICML 2019 Notes”, 2019
- “Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Et Al 2019
- “Distilling Policy Distillation”, Et Al 2019
- “Compressing GANs Using Knowledge Distillation”, Et Al 2019
- “Neural Probabilistic Motor Primitives for Humanoid Control”, Et Al 2018
- “Dataset Distillation”, Et Al 2018
- “Exploration by Random Network Distillation”, Et Al 2018
- “OCD: Optimal Completion Distillation for Sequence Learning”, Et Al 2018
- “Network Recasting: A Universal Method for Network Architecture Transformation”, Et Al 2018
- “ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech”, Et Al 2018
- “Self-Net: Lifelong Learning via Continual Self-Modeling”, Et Al 2018
- “Kickstarting Deep Reinforcement Learning”, Et Al 2018
- “Faster Gaze Prediction With Dense Networks and Fisher Pruning”, Et Al 2018
- “Parallel WaveNet: Fast High-Fidelity Speech Synthesis”, Et Al 2017
- “Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”, Et Al 2017
- “Policy Optimization by Genetic Distillation”, 2017
- “N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”, Et Al 2017
- “Training Shallow and Thin Networks for Acceleration via Knowledge Distillation With Conditional Adversarial Networks”, Et Al 2017
- “Distral: Robust Multitask Reinforcement Learning”, Et Al 2017
- “Biased Importance Sampling for Deep Neural Network Training”, 2017
- “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, 2016
- “Do Deep Convolutional Nets Really Need to Be Deep and Convolutional?”, Et Al 2016
- “Face Model Compression by Distilling Knowledge from Neurons”, Et Al 2016
- “Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”, Et Al 2015
- “Policy Distillation”, Et Al 2015
- “Net2Net: Accelerating Learning via Knowledge Transfer”, Et Al 2015
- “Bayesian Dark Knowledge”, Et Al 2015
- “Distilling the Knowledge in a Neural Network”, Et Al 2015
- “FitNets: Hints for Thin Deep Nets”, Et Al 2014
- “Do Deep Nets Really Need to Be Deep?”, 2013
- “Model Compression”, Bucila & Al 2006
- “Learning Complex, Extended Sequences Using the Principle of History Compression”, 1992
- “From Vision to Language: Semi-Supervised Learning in Action…at Scale”
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Links
“Learning Humanoid Locomotion With Transformers”, Et Al 2023
“Learning Humanoid Locomotion with Transformers”, 2023-03-06 ( ; similar)
“ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”, Et Al 2023
“ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”, 2023-02-24 ( ; similar; bibliography)
“Scaling Vision Transformers to 22 Billion Parameters”, Et Al 2023
“Scaling Vision Transformers to 22 Billion Parameters”, 2023-02-10 ( ; similar; bibliography)
“BMT: Binarized Neural Machine Translation”, Et Al 2023
“BMT: Binarized Neural Machine Translation”, 2023-02-09 ( ; similar; bibliography)
“Use GPT-3 Incorrectly: Reduce Costs 40× and Increase Speed by 5×”, 2023
“Use GPT-3 incorrectly: reduce costs 40× and increase speed by 5×”, 2023-02-06 ( ; backlinks; similar)
“TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”, Et Al 2023
“TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”, 2023-01-03 ( ; similar; bibliography)
“Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, Et Al 2022
“Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, 2022-12-09 ( ; similar; bibliography)
“MaskDistill: A Unified View of Masked Image Modeling”, 2022
“MaskDistill: A Unified View of Masked Image Modeling”, 2022-11-17 ( ; similar; bibliography)
“Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, Et Al 2022
“Distilled DeepConsensus: Knowledge distillation for fast and accurate DNA sequence correction”, 2022-11-17 ( ; similar)
“Legged Locomotion in Challenging Terrains Using Egocentric Vision”, Et Al 2022
“Legged Locomotion in Challenging Terrains using Egocentric Vision”, 2022-11-14 ( ; similar; bibliography)
“EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”, Et Al 2022
“EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”, 2022-11-14 ( ; similar; bibliography)
“EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Et Al 2022
“eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers”, 2022-11-02 ( ; similar; bibliography)
“Fast DistilBERT on CPUs”, Et Al 2022
“Fast DistilBERT on CPUs”, 2022-10-27 ( ; similar)
“Large Language Models Can Self-Improve”, Et Al 2022
“Large Language Models Can Self-Improve”, 2022-10-20 ( ; similar; bibliography)
“Exclusive Supermask Subnetwork Training for Continual Learning”, 2022
“Exclusive Supermask Subnetwork Training for Continual Learning”, 2022-10-18 (similar)
“The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”, Et Al 2022
“The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”, 2022-10-11 ( ; similar)
“On Distillation of Guided Diffusion Models”, Et Al 2022
“On Distillation of Guided Diffusion Models”, 2022-10-06 ( ; similar; bibliography)
“Human-level Atari 200× Faster”, Et Al 2022
“Human-level Atari 200× faster”, 2022-09-15 ( ; similar; bibliography)
“On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, Et Al 2022
“On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, 2022-09-07 ( ; similar)
“Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members”, Et Al 2022
“Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members”, 2022-08-18 ( ; similar)
“Re2G: Retrieve, Rerank, Generate”, Et Al 2022
“Re2G: Retrieve, Rerank, Generate”, 2022-07-13 ( ; similar; bibliography)
“Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems”, FitzEt Al 2022
“Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems”, 2022-06-15 ( ; similar; bibliography)
“SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, 2022
“SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, 2022-06-14 ( ; similar)
“ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Et Al 2022
“ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, 2022-06-04 ( ; similar; bibliography)
“Dataset Condensation via Efficient Synthetic-Data Parameterization”, Et Al 2022
“Dataset Condensation via Efficient Synthetic-Data Parameterization”, 2022-05-30 ( ; similar)
“Dialog Inpainting: Turning Documents into Dialogues”, Et Al 2022
“Dialog Inpainting: Turning Documents into Dialogues”, 2022-05-18 ( ; similar; bibliography)
“Solving ImageNet: a Unified Scheme for Training Any Backbone to Top Results”, Et Al 2022
“Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results”, 2022-04-07 (similar; bibliography)
“Knowledge Distillation: Bad Models Can Be Good Role Models”, Et Al 2022
“Knowledge Distillation: Bad Models Can Be Good Role Models”, 2022-03-28 (similar)
“STaR: Bootstrapping Reasoning With Reasoning”, Et Al 2022
“STaR: Bootstrapping Reasoning With Reasoning”, 2022-03-28 ( ; backlinks; similar)
“PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression”, Et Al 2022
“PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression”, 2022-03-16 ( ; similar)
“Self-Distilled StyleGAN: Towards Generation from Internet Photos”, Et Al 2022
“Self-Distilled StyleGAN: Towards Generation from Internet Photos”, 2022-02-24 ( ; similar; bibliography)
“AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, Et Al 2022
“AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, 2022-01-29 ( ; similar)
“DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, Et Al 2022
“DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, 2022-01-14 ( ; similar; bibliography)
“Microdosing: Knowledge Distillation for GAN Based Compression”, Et Al 2022
“Microdosing: Knowledge Distillation for GAN based Compression”, 2022-01-07 ( ; similar)
“ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation”, Et Al 2021
“ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation”, 2021-12-23 ( ; similar)
“Amortized Noisy Channel Neural Machine Translation”, Et Al 2021
“Amortized Noisy Channel Neural Machine Translation”, 2021-12-16 ( ; similar)
“Causal Distillation for Language Models”, Et Al 2021
“Causal Distillation for Language Models”, 2021-12-05 (similar)
“Extrapolating from a Single Image to a Thousand Classes Using Distillation”, 2021
“Extrapolating from a Single Image to a Thousand Classes using Distillation”, 2021-12-01 (similar)
“Prune Once for All: Sparse Pre-Trained Language Models”, Et Al 2021
“Prune Once for All: Sparse Pre-Trained Language Models”, 2021-11-10 ( ; similar; bibliography)
“Training Verifiers to Solve Math Word Problems”, Et Al 2021
“Training Verifiers to Solve Math Word Problems”, 2021-10-27 ( ; similar)
“Wav2CLIP: Learning Robust Audio Representations From CLIP”, Et Al 2021
“Wav2CLIP: Learning Robust Audio Representations From CLIP”, 2021-10-21 ( ; similar)
“When in Doubt, Summon the Titans: Efficient Inference With Large Models”, Et Al 2021
“When in Doubt, Summon the Titans: Efficient Inference with Large Models”, 2021-10-19 ( ; similar)
“Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, Et Al 2021
“Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, 2021-10-16 ( ; similar)
“Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”, Et Al 2021
“Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”, 2021-10-14 ( ; similar)
“Language Modelling via Learning to Rank”, Et Al 2021
“Language Modelling via Learning to Rank”, 2021-10-13 (similar; bibliography)
“Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes”, Et Al 2021
“Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes”, 2021-10-12 ( ; similar)
“Unsupervised Neural Machine Translation With Generative Language Models Only”, Et Al 2021
“Unsupervised Neural Machine Translation with Generative Language Models Only”, 2021-10-11 ( ; similar)
“Progressive Distillation for Fast Sampling of Diffusion Models”, 2021
“Progressive Distillation for Fast Sampling of Diffusion Models”, 2021-10-05 ( ; similar)
“OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Et Al 2021
“OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation”, 2021-10-05 ( ; similar; bibliography)
“On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”, Et Al 2021
“On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”, 2021-10-04 ( ; similar)
“Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, Et Al 2021
“Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, 2021-09-24 ( ; similar)
“ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, 2021
“ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation”, 2021-09-24 ( ; similar; bibliography)
“SPLADE V2: Sparse Lexical and Expansion Model for Information Retrieval”, Et Al 2021
“SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval”, 2021-09-21 ( ; similar)
“KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Et Al 2021
“KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, 2021-09-13 ( ; similar; bibliography)
“Multi-Task Self-Training for Learning General Representations”, Et Al 2021
“Multi-Task Self-Training for Learning General Representations”, 2021-08-25 ( ; similar)
“Dataset Distillation With Infinitely Wide Convolutional Networks”, Et Al 2021
“Dataset Distillation with Infinitely Wide Convolutional Networks”, 2021-07-27 ( ; similar)
“Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”, 2021
“Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”, 2021-06-16 ( ; similar)
“Knowledge Distillation: A Good Teacher Is Patient and Consistent”, Et Al 2021
“Knowledge distillation: A good teacher is patient and consistent”, 2021-06-09 ( ; similar; bibliography)
“ResMLP: Feedforward Networks for Image Classification With Data-efficient Training”, Et Al 2021
“ResMLP: Feedforward networks for image classification with data-efficient training”, 2021-05-07 ( ; similar)
“DINO: Emerging Properties in Self-Supervised Vision Transformers”, Et Al 2021
“DINO: Emerging Properties in Self-Supervised Vision Transformers”, 2021-04-29 ( ; similar)
“Zero-Shot Detection via Vision and Language Knowledge Distillation”, Et Al 2021
“Zero-Shot Detection via Vision and Language Knowledge Distillation”, 2021-04-28 ( ; similar; bibliography)
“Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Et Al 2021
“Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation”, 2021-04-18 ( ; similar; bibliography)
“Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, 2021
“Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation”, 2021-04-04 ( ; backlinks; similar)
“KiloNeRF: Speeding up Neural Radiance Fields With Thousands of Tiny MLPs”, Et Al 2021
“KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs”, 2021-03-25 ( ; backlinks; similar)
“China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, 2021
“China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.”, 2021-03-23 ( ; similar; bibliography)
“Distilling Large Language Models into Tiny and Effective Students Using PQRNN”, Et Al 2021
“Distilling Large Language Models into Tiny and Effective Students using pQRNN”, 2021-01-21 ( ; similar)
“Training Data-efficient Image Transformers & Distillation through Attention”, Et Al 2020
“Training data-efficient image transformers & distillation through attention”, 2020-12-23 ( ; similar; bibliography)
“Towards Playing Full MOBA Games With Deep Reinforcement Learning”, Et Al 2020
“Towards Playing Full MOBA Games with Deep Reinforcement Learning”, 2020-11-25 ( ; similar; bibliography)
“A Primer in BERTology: What We Know about How BERT Works”, Et Al 2020
“A Primer in BERTology: What we know about how BERT works”, 2020-11-09 ( ; similar)
“Dataset Meta-Learning from Kernel Ridge-Regression”, Et Al 2020
“Dataset Meta-Learning from Kernel Ridge-Regression”, 2020-10-30 ( ; similar)
“TernaryBERT: Distillation-aware Ultra-low Bit BERT”, Et Al 2020
“TernaryBERT: Distillation-aware Ultra-low Bit BERT”, 2020-09-27 ( ; similar)
“SimCLRv2: Big Self-Supervised Models Are Strong Semi-Supervised Learners”, Et Al 2020
“SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners”, 2020-06-17 ( ; similar)
“Movement Pruning: Adaptive Sparsity by Fine-Tuning”, Et Al 2020
“Movement Pruning: Adaptive Sparsity by Fine-Tuning”, 2020-05-15 ( ; similar)
“General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, Et Al 2020
“General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, 2020-04-29 ( ; similar)
“MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Et Al 2020
“MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, 2020-02-25 ( ; similar; bibliography)
“Towards a Conversational Agent That Can Chat About…Anything”, 2020
“Towards a Conversational Agent that Can Chat About…Anything”, 2020-01-28 ( ; similar; bibliography)
“Understanding the Generalization Of ‘Lottery Tickets’ In Neural Networks”, 2019
“Understanding the generalization of ‘lottery tickets’ in neural networks”, 2019-11-25 ( ; backlinks; similar)
“Self-training With Noisy Student Improves ImageNet Classification”, Et Al 2019
“Self-training with Noisy Student improves ImageNet classification”, 2019-11-11 ( ; similar; bibliography)
“On Warm-Starting Neural Network Training”, 2019
“On Warm-Starting Neural Network Training”, 2019-10-18 ( ; similar)
“DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Et Al 2019
“DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, 2019-10-02 ( ; backlinks; similar)
“TinyBERT: Distilling BERT for Natural Language Understanding”, Et Al 2019
“TinyBERT: Distilling BERT for Natural Language Understanding”, 2019-09-23 ( ; backlinks; similar; bibliography)
“Smaller, Faster, Cheaper, Lighter: Introducing DistilGPT, a Distilled Version of GPT”, 2019
“Smaller, faster, cheaper, lighter: Introducing DistilGPT, a distilled version of GPT”, 2019-08-28 ( ; similar)
“ICML 2019 Notes”, 2019
“ICML 2019 Notes”, 2019-06 ( ; similar; bibliography)
“Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Et Al 2019
“Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, 2019-04-19 ( ; similar)
“Distilling Policy Distillation”, Et Al 2019
“Distilling Policy Distillation”, 2019-02-06 ( ; similar; bibliography)
“Compressing GANs Using Knowledge Distillation”, Et Al 2019
“Compressing GANs using Knowledge Distillation”, 2019-02-01 ( ; similar)
“Neural Probabilistic Motor Primitives for Humanoid Control”, Et Al 2018
“Neural probabilistic motor primitives for humanoid control”, 2018-11-28 ( ; similar)
“Dataset Distillation”, Et Al 2018
“Dataset Distillation”, 2018-11-27 ( ; backlinks; similar)
“Exploration by Random Network Distillation”, Et Al 2018
“Exploration by Random Network Distillation”, 2018-10-30 ( ; similar)
“OCD: Optimal Completion Distillation for Sequence Learning”, Et Al 2018
“OCD: Optimal Completion Distillation for Sequence Learning”, 2018-10-02 (backlinks; similar)
“Network Recasting: A Universal Method for Network Architecture Transformation”, Et Al 2018
“Network Recasting: A Universal Method for Network Architecture Transformation”, 2018-09-14 (similar)
“ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech”, Et Al 2018
“ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech”, 2018-07-19 (similar)
“Self-Net: Lifelong Learning via Continual Self-Modeling”, Et Al 2018
“Self-Net: Lifelong Learning via Continual Self-Modeling”, 2018-05-25 ( ; similar)
“Kickstarting Deep Reinforcement Learning”, Et Al 2018
“Kickstarting Deep Reinforcement Learning”, 2018-03-10 ( ; similar)
“Faster Gaze Prediction With Dense Networks and Fisher Pruning”, Et Al 2018
“Faster gaze prediction with dense networks and Fisher pruning”, 2018-01-17 ( ; similar)
“Parallel WaveNet: Fast High-Fidelity Speech Synthesis”, Et Al 2017
“Parallel WaveNet: Fast High-Fidelity Speech Synthesis”, 2017-11-28 (similar)
“Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”, Et Al 2017
“Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”, 2017-11-21 ( ; similar)
“Policy Optimization by Genetic Distillation”, 2017
“Policy Optimization by Genetic Distillation”, 2017-11-03 ( ; backlinks; similar)
“N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”, Et Al 2017
“N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”, 2017-09-18 ( ; backlinks; similar)
“Training Shallow and Thin Networks for Acceleration via Knowledge Distillation With Conditional Adversarial Networks”, Et Al 2017
“Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks”, 2017-09-02 ( ; similar)
“Distral: Robust Multitask Reinforcement Learning”, Et Al 2017
“Distral: Robust Multitask Reinforcement Learning”, 2017-07-13 ( ; similar)
“Biased Importance Sampling for Deep Neural Network Training”, 2017
“Biased Importance Sampling for Deep Neural Network Training”, 2017-05-31 ( ; backlinks; similar)
“Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, 2016
“Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, 2016-12-12 ( ; similar)
“Do Deep Convolutional Nets Really Need to Be Deep and Convolutional?”, Et Al 2016
“Do Deep Convolutional Nets Really Need to be Deep and Convolutional?”, 2016-03-17 ( ; backlinks; similar)
“Face Model Compression by Distilling Knowledge from Neurons”, Et Al 2016
“Face Model Compression by Distilling Knowledge from Neurons”, 2016-03-05 (similar; bibliography)
“Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”, Et Al 2015
“Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”, 2015-11-19 ( ; similar)
“Policy Distillation”, Et Al 2015
“Policy Distillation”, 2015-11-19 ( ; similar)
“Net2Net: Accelerating Learning via Knowledge Transfer”, Et Al 2015
“Net2Net: Accelerating Learning via Knowledge Transfer”, 2015-11-18 ( ; backlinks; similar)
“Bayesian Dark Knowledge”, Et Al 2015
“Bayesian Dark Knowledge”, 2015-06-14 ( ; similar)
“Distilling the Knowledge in a Neural Network”, Et Al 2015
“Distilling the Knowledge in a Neural Network”, 2015-03-09 ( ; similar)
“FitNets: Hints for Thin Deep Nets”, Et Al 2014
“FitNets: Hints for Thin Deep Nets”, 2014-12-19 (similar)
“Do Deep Nets Really Need to Be Deep?”, 2013
“Do Deep Nets Really Need to be Deep?”, 2013-12-21 ( ; backlinks; similar)
“Model Compression”, Bucila & Al 2006
“Model Compression”, 2006 (backlinks)
“Learning Complex, Extended Sequences Using the Principle of History Compression”, 1992
“Learning Complex, Extended Sequences Using the Principle of History Compression”, 1992 ( ; similar)
“From Vision to Language: Semi-Supervised Learning in Action…at Scale”
Wikipedia
Miscellaneous
Link Bibliography
-
https://arxiv.org/abs/2302.12433
: “ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”, Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, Jeremy Avigad: -
https://arxiv.org/abs/2302.05442#google
: “Scaling Vision Transformers to 22 Billion Parameters”, : -
https://arxiv.org/abs/2302.04907#google
: “BMT: Binarized Neural Machine Translation”, Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat: -
https://arxiv.org/abs/2301.01296#microsoft
: “TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”, Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu: -
https://arxiv.org/abs/2212.05055#google
: “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, : -
https://openreview.net/forum?id=wmGlMhaBe0
: “MaskDistill: A Unified View of Masked Image Modeling”, Anonymous: -
https://arxiv.org/abs/2211.07638
: “Legged Locomotion in Challenging Terrains Using Egocentric Vision”, Ananye Agarwal, Ashish Kumar, Jitendra Malik, Deepak Pathak: -
https://arxiv.org/abs/2211.07636#baai
: “EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”, Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao: -
https://arxiv.org/abs/2211.01324#nvidia
: “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, : -
https://arxiv.org/abs/2210.11610#google
: “Large Language Models Can Self-Improve”, Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han: -
https://arxiv.org/abs/2210.03142#google
: “On Distillation of Guided Diffusion Models”, Chenlin Meng, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, Tim Salimans: -
https://arxiv.org/abs/2209.07550#deepmind
: “Human-level Atari 200× Faster”, Steven Kapturowski, Víctor Campos, Ray Jiang, Nemanja Rakićević, Hado van Hasselt, Charles Blundell, Adrià Puigdomènech Badia: -
https://arxiv.org/abs/2207.06300#ibm
: “Re2G: Retrieve, Rerank, Generate”, Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, Alfio Gliozzo: -
https://arxiv.org/abs/2206.07808#amazon
: “Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems”, : -
https://arxiv.org/abs/2206.01861#microsoft
: “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He: -
https://arxiv.org/abs/2205.09073#google
: “Dialog Inpainting: Turning Documents into Dialogues”, Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu: -
https://arxiv.org/abs/2204.03475#alibaba
: “Solving ImageNet: a Unified Scheme for Training Any Backbone to Top Results”, Tal Ridnik, Hussam Lawen, Emanuel Ben-Baruch, Asaf Noy: -
https://arxiv.org/abs/2202.12211#google
: “Self-Distilled StyleGAN: Towards Generation from Internet Photos”, Ron Mokady, Michal Yarom, Omer Tov, Oran Lang, Daniel Cohen-Or, Tali Dekel, Michal Irani, Inbar Mosseri: -
https://arxiv.org/abs/2201.05596#microsoft
: “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, : -
https://arxiv.org/abs/2111.05754
: “Prune Once for All: Sparse Pre-Trained Language Models”, Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat: -
https://arxiv.org/abs/2110.06961
: “Language Modelling via Learning to Rank”, Arvid Frydenlund, Gagandeep Singh, Frank Rudzicz: -
https://openreview.net/forum?id=G89-1yZLFHk
: “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Bichen Wu, Ruizhe Cheng, Peizhao Zhang, Peter Vajda, Joseph E. Gonzalez: -
https://arxiv.org/abs/2109.12066
: “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Johnathan Xie, Shuai Zheng: -
https://arxiv.org/abs/2109.06243#huawei
: “KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Marzieh S. Tahaei, Ella Charlaix, Vahid Partovi Nia, Ali Ghodsi, Mehdi Rezagholizadeh: -
https://arxiv.org/abs/2106.05237#google
: “Knowledge Distillation: A Good Teacher Is Patient and Consistent”, Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, Alexander Kolesnikov: -
https://arxiv.org/abs/2104.13921#google
: “Zero-Shot Detection via Vision and Language Knowledge Distillation”, Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui: -
https://arxiv.org/abs/2104.08945#facebook
: “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Ruizhe Cheng, Bichen Wu, Peizhao Zhang, Peter Vajda, Joseph E. Gonzalez: -
https://syncedreview.com/2021/03/23/chinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0/#baai
: “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced: -
https://arxiv.org/abs/2012.12877#facebook
: “Training Data-efficient Image Transformers & Distillation through Attention”, Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou: -
https://arxiv.org/abs/2011.12692#tencent
: “Towards Playing Full MOBA Games With Deep Reinforcement Learning”, : -
https://arxiv.org/abs/2002.10957#microsoft
: “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou: -
https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html
: “Towards a Conversational Agent That Can Chat About…Anything”, Daniel Adiwardana, Thang Luong: -
https://arxiv.org/abs/1911.04252#google
: “Self-training With Noisy Student Improves ImageNet Classification”, Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le: -
https://arxiv.org/abs/1909.10351
: “TinyBERT: Distilling BERT for Natural Language Understanding”, Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu: -
https://david-abel.github.io/notes/icml_2019.pdf
: “ICML 2019 Notes”, David Abel: -
https://arxiv.org/abs/1902.02186#deepmind
: “Distilling Policy Distillation”, Wojciech Marian Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant M. Jayakumar, Grzegorz Swirszcz, Max Jaderberg: -
2016-luo.pdf
: “Face Model Compression by Distilling Knowledge from Neurons”, Ping Luo, Zhenyao Zhu, Ziwei Liu, Xiaogang Wang, Xiaoou Tang: