A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Gemma 2: Improving Open Language Models at a Practical Size
Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
Streamlining Redundant Layers to Compress Large Language Models
SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese
Bridging the Gap: Sketch to Color Diffusion Model with Semantic Prompt Learning
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
ByteDance is secretly using OpenAI’s tech to build a competitor
SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReSTEM)
Generative Models: What do they know? Do they know things? Let’s find out!
Efficient Transformer Knowledge Distillation: A Performance Review
Implicit Chain-of-Thought Reasoning via Knowledge Distillation
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
HyperFields: Towards Zero-Shot Generation of NeRFs from Text
Polynomial Time Cryptanalytic Extraction of Neural Network Models
ReST: Reinforced Self-Training (ReST) for Language Modeling
Composable Function-preserving Expansions for Transformer Architectures
Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events
Explaining Competitive-Level Programming Solutions using LLMs
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models
WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning
KD-DLGAN: Data Limited Image Generation via Knowledge Distillation
TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation
ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics
Use GPT-3 incorrectly: reduce costs 40× and increase speed by 5×
TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Solving math word problems with process & outcome-based feedback
Distilled DeepConsensus: Knowledge distillation for fast and accurate DNA sequence correction
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Legged Locomotion in Challenging Terrains using Egocentric Vision
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Exclusive Supermask Subnetwork Training for Continual Learning
The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes
Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints
On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)
Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems
SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
Dataset Condensation via Efficient Synthetic-Data Parameterization
UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes
Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results
Knowledge Distillation: Bad Models Can Be Good Role Models
PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression
Self-Distilled StyleGAN: Towards Generation from Internet Photos
AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
Microdosing: Knowledge Distillation for GAN based Compression
ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
Extrapolating from a Single Image to a Thousand Classes using Distillation
When in Doubt, Summon the Titans: Efficient Inference with Large Models
Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes
Unsupervised Neural Machine Translation with Generative Language Models Only
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation
Progressive Distillation for Fast Sampling of Diffusion Models
On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis
ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation
Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval
KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation
Multi-Task Self-Training for Learning General Representations
Dataset Distillation with Infinitely Wide Convolutional Networks
Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better
Knowledge distillation: A good teacher is patient and consistent
ResMLP: Feedforward networks for image classification with data-efficient training
DINO: Emerging Properties in Self-Supervised Vision Transformers
Zero-Shot Detection via Vision and Language Knowledge Distillation
Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation
ALD: Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs
China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.
Distilling Large Language Models into Tiny and Effective Students using pQRNN
Training data-efficient image transformers & distillation through attention
Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning
Towards Playing Full MOBA Games with Deep Reinforcement Learning
SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners
General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Towards a Conversational Agent that Can Chat About…Anything
Understanding the generalization of ‘lottery tickets’ in neural networks
Self-training with Noisy Student improves ImageNet classification
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TinyBERT: Distilling BERT for Natural Language Understanding
Smaller, faster, cheaper, lighter: Introducing DistilGPT, a distilled version of GPT
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
NoGAN: Decrappification, DeOldification, and Super Resolution
Mask-Predict: Parallel Decoding of Conditional Masked Language Models
Neural probabilistic motor primitives for humanoid control
OCD: Optimal Completion Distillation for Sequence Learning
Network Recasting: A Universal Method for Network Architecture Transformation
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
Faster gaze prediction with dense networks and Fisher pruning
Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN
N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning
Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks
Biased Importance Sampling for Deep Neural Network Training
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer
Do Deep Convolutional Nets Really Need to be Deep and Convolutional?
Face Model Compression by Distilling Knowledge from Neurons
Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning
Learning Complex, Extended Sequences Using the Principle of History Compression
Dota 2 With Large Scale Deep Reinforcement Learning § Pg11
2dcf2c6e7f5e36e4ae4e9e3a498d0b2124399287.pdf#page=11&org=openai
From Vision to Language: Semi-Supervised Learning in Action…at Scale
2023-dehghani-figure8-shapebiasofvit22bmodelisalmosthumanlikeascomparedtopastnnmodels.png
2022-balaji-figure2-ediffiasmultipleunrolledmodelsduringdiffusionphases.png
2022-balaji-table1-zeroshotfidcomparisonbetweenediffiandothersotaimagegenerationmodelsshowingediffiwins.png
2021-beyer-figure3-knowledgedistillationover1millionepoches.png
https://discuss.luxonis.com/blog/3272-datadreamer-creating-custom-datasets-made-easy
https://medium.com/neuralmachine/knowledge-distillation-dc241d7c2322
https://www.reddit.com/r/MachineLearning/comments/1fyb9jj/p_model2vec_distill_a_small_fast_model_from_any/
https://www.theverge.com/2023/3/29/23662621/google-bard-chatgpt-sharegpt-training-denies
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Gemma 2: Improving Open Language Models at a Practical Size
https%253A%252F%252Farxiv.org%252Fabs%252F2408.00118%2523google.html
Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReSTEM)
https%253A%252F%252Farxiv.org%252Fabs%252F2312.06585%2523deepmind.html
Efficient Transformer Knowledge Distillation: A Performance Review
Polynomial Time Cryptanalytic Extraction of Neural Network Models
Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events
https%253A%252F%252Farxiv.org%252Fabs%252F2307.06439%2523microsoft.html
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
https%253A%252F%252Farxiv.org%252Fabs%252F2305.07759%2523microsoft.html
Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
https%253A%252F%252Farxiv.org%252Fabs%252F2305.02301%2523google.html
Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning
https%253A%252F%252Farxiv.org%252Fabs%252F2304.13653%2523deepmind.html
https%253A%252F%252Farxiv.org%252Fabs%252F2303.01469%2523openai.html
ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics
https%253A%252F%252Farxiv.org%252Fabs%252F2302.05442%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2302.04907%2523google.html
TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
https%253A%252F%252Farxiv.org%252Fabs%252F2301.01296%2523microsoft.html
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
https%253A%252F%252Farxiv.org%252Fabs%252F2212.05055%2523google.html
https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DwmGlMhaBe0.html
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
https%253A%252F%252Farxiv.org%252Fabs%252F2211.07636%2523baai.html
Legged Locomotion in Challenging Terrains using Egocentric Vision
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
https%253A%252F%252Farxiv.org%252Fabs%252F2211.01324%2523nvidia.html
https%253A%252F%252Farxiv.org%252Fabs%252F2210.11610%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2210.03142%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2209.07550%2523deepmind.html
https%253A%252F%252Farxiv.org%252Fabs%252F2207.06300%2523ibm.html
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems
https%253A%252F%252Farxiv.org%252Fabs%252F2206.07808%2523amazon.html
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2206.01861%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2205.09073%2523google.html
Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results
https%253A%252F%252Farxiv.org%252Fabs%252F2204.03475%2523alibaba.html
Self-Distilled StyleGAN: Towards Generation from Internet Photos
https%253A%252F%252Farxiv.org%252Fabs%252F2202.12211%2523google.html
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
https%253A%252F%252Farxiv.org%252Fabs%252F2201.05596%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2110.14168%2523openai.html
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation
https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DG89-1yZLFHk.html
ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation
KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation
https%253A%252F%252Farxiv.org%252Fabs%252F2109.06243%2523huawei.html
Knowledge distillation: A good teacher is patient and consistent
https%253A%252F%252Farxiv.org%252Fabs%252F2106.05237%2523google.html
DINO: Emerging Properties in Self-Supervised Vision Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2104.14294%2523facebook.html
Zero-Shot Detection via Vision and Language Knowledge Distillation
https%253A%252F%252Farxiv.org%252Fabs%252F2104.13921%2523google.html
Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation
https%253A%252F%252Farxiv.org%252Fabs%252F2104.08945%2523facebook.html
China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.
https%253A%252F%252Fsyncedreview.com%252F2021%252F03%252F23%252Fchinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0%252F%2523baai.html
Training data-efficient image transformers & distillation through attention
https%253A%252F%252Farxiv.org%252Fabs%252F2012.12877%2523facebook.html
Towards Playing Full MOBA Games with Deep Reinforcement Learning
https%253A%252F%252Farxiv.org%252Fabs%252F2011.12692%2523tencent.html
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2002.10957%2523microsoft.html
Towards a Conversational Agent that Can Chat About…Anything
https%253A%252F%252Fresearch.google%252Fblog%252Ftowards-a-conversational-agent-that-can-chat-aboutanything%252F.html
Self-training with Noisy Student improves ImageNet classification
https%253A%252F%252Farxiv.org%252Fabs%252F1911.04252%2523google.html
TinyBERT: Distilling BERT for Natural Language Understanding
https%253A%252F%252Fdavid-abel.github.io%252Fnotes%252Ficml_2019.pdf.html
https%253A%252F%252Farxiv.org%252Fabs%252F1902.02186%2523deepmind.html
Face Model Compression by Distilling Knowledge from Neurons
%252Fdoc%252Fai%252Fnn%252Fsparsity%252Fknowledge-distillation%252F2016-luo.pdf.html
Wikipedia Bibliography: