- See Also
- Gwern
-
Links
- “Improving Text Embeddings With Large Language Models”, Wang et al 2023
- “ReST Meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent”, Aksitov et al 2023
- “ByteDance Is Secretly Using OpenAI’s Tech to Build a Competitor”, Heath 2023
- “SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration”, Duckworth et al 2023
- “Beyond Human Data: Scaling Self-Training for Problem-Solving With Language Models (ReSTEM)”, Singh et al 2023
- “Generative Models: What Do They Know? Do They Know Things? Let’s Find Out!”, Du et al 2023
- “Efficient Transformer Knowledge Distillation: A Performance Review”, Brown et al 2023
- “Implicit Chain-of-Thought Reasoning via Knowledge Distillation”, Deng et al 2023
- “Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling”, Gandhi et al 2023
- “HyperFields: Towards Zero-Shot Generation of NeRFs from Text”, Babu et al 2023
- “Polynomial Time Cryptanalytic Extraction of Neural Network Models”, Shamir et al 2023
- “OSD: Online Speculative Decoding”, Liu et al 2023
- “ReST: Reinforced Self-Training (ReST) for Language Modeling”, Gulcehre et al 2023
- “Composable Function-preserving Expansions for Transformer Architectures”, Gesmundo & Maile 2023
- “Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events”, Gu et al 2023
- “Explaining Competitive-Level Programming Solutions Using LLMs”, Li et al 2023
- “GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models”, Agarwal et al 2023
- “WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia”, Semnani et al 2023
- “VanillaNet: the Power of Minimalism in Deep Learning”, Chen et al 2023
- “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?”, Eldan & Li 2023
- “Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation”, Guo et al 2023
- “Distilling Step-by-Step! Outperforming Larger Language Models With Less Training Data and Smaller Model Sizes”, Hsieh et al 2023
- “LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions”, Wu et al 2023
- “Learning Agile Soccer Skills for a Bipedal Robot With Deep Reinforcement Learning”, Haarnoja et al 2023
- “A Cookbook of Self-Supervised Learning”, Balestriero et al 2023
- “KD-DLGAN: Data Limited Image Generation via Knowledge Distillation”, Cui et al 2023
- “TRACT: Denoising Diffusion Models With Transitive Closure Time-Distillation”, Berthelot et al 2023
- “Learning Humanoid Locomotion With Transformers”, Radosavovic et al 2023
- “Consistency Models”, Song et al 2023
- “ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”, Azerbayev et al 2023
- “Scaling Vision Transformers to 22 Billion Parameters”, Dehghani et al 2023
- “BMT: Binarized Neural Machine Translation”, Zhang et al 2023
- “Use GPT-3 Incorrectly: Reduce Costs 40× and Increase Speed by 5×”, Pullen 2023
- “TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”, Ren et al 2023
- “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, Komatsuzaki et al 2022
- “Solving Math Word Problems With Process & Outcome-based Feedback”, Uesato et al 2022
- “Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, Belyaeva et al 2022
- “MaskDistill: A Unified View of Masked Image Modeling”, Anonymous 2022
- “EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”, Fang et al 2022
- “Legged Locomotion in Challenging Terrains Using Egocentric Vision”, Agarwal et al 2022
- “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Balaji et al 2022
- “Fast DistilBERT on CPUs”, Shen et al 2022
- “Large Language Models Can Self-Improve”, Huang et al 2022
- “Exclusive Supermask Subnetwork Training for Continual Learning”, Yadav & Bansal 2022
- “The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”, Kocsis et al 2022
- “On Distillation of Guided Diffusion Models”, Meng et al 2022
- “Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”, Jawahar et al 2022
- “Omnigrok: Grokking Beyond Algorithmic Data”, Liu et al 2022
- “Human-level Atari 200× Faster”, Kapturowski et al 2022
- “On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, Rohanian et al 2022
- “Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members”, Cornelisse et al 2022
- “Re2G: Retrieve, Rerank, Generate”, Glass et al 2022
- “Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems”, FitzGerald et al 2022
- “SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, Opitz & Frank 2022
- “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Yao et al 2022
- “Dataset Condensation via Efficient Synthetic-Data Parameterization”, Kim et al 2022
- “UViM: A Unified Modeling Approach for Vision With Learned Guiding Codes”, Kolesnikov et al 2022
- “Dialog Inpainting: Turning Documents into Dialogues”, Dai et al 2022
- “Solving ImageNet: a Unified Scheme for Training Any Backbone to Top Results”, Ridnik et al 2022
- “STaR: Bootstrapping Reasoning With Reasoning”, Zelikman et al 2022
- “Knowledge Distillation: Bad Models Can Be Good Role Models”, Kaplun et al 2022
- “PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression”, Vo et al 2022
- “Self-Distilled StyleGAN: Towards Generation from Internet Photos”, Mokady et al 2022
- “AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, Xu et al 2022
- “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, Rajbhandari et al 2022
- “Microdosing: Knowledge Distillation for GAN Based Compression”, Helminger et al 2022
- “ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation”, Wang et al 2021
- “Amortized Noisy Channel Neural Machine Translation”, Pang et al 2021
- “Causal Distillation for Language Models”, Wu et al 2021
- “Extrapolating from a Single Image to a Thousand Classes Using Distillation”, Asano & Saeed 2021
- “Prune Once for All: Sparse Pre-Trained Language Models”, Zafrir et al 2021
- “Training Verifiers to Solve Math Word Problems”, Cobbe et al 2021
- “Wav2CLIP: Learning Robust Audio Representations From CLIP”, Wu et al 2021
- “When in Doubt, Summon the Titans: Efficient Inference With Large Models”, Rawat et al 2021
- “Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, Jin et al 2021
- “Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”, West et al 2021
- “Language Modelling via Learning to Rank”, Frydenlund et al 2021
- “Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes”, Lee et al 2021
- “Unsupervised Neural Machine Translation With Generative Language Models Only”, Han et al 2021
- “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Wu et al 2021
- “Progressive Distillation for Fast Sampling of Diffusion Models”, Salimans & Ho 2021
- “On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”, Lai et al 2021
- “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Xie & Zheng 2021
- “Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, Kudugunta et al 2021
- “SPLADE V2: Sparse Lexical and Expansion Model for Information Retrieval”, Formal et al 2021
- “KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Tahaei et al 2021
- “Multi-Task Self-Training for Learning General Representations”, Ghiasi et al 2021
- “Dataset Distillation With Infinitely Wide Convolutional Networks”, Nguyen et al 2021
- “Knowledge-Adaptation Priors”, Khan & Swaroop 2021
- “Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”, Menghani 2021
- “Knowledge Distillation: A Good Teacher Is Patient and Consistent”, Beyer et al 2021
- “ResMLP: Feedforward Networks for Image Classification With Data-efficient Training”, Touvron et al 2021
- “DINO: Emerging Properties in Self-Supervised Vision Transformers”, Caron et al 2021
- “Zero-Shot Detection via Vision and Language Knowledge Distillation”, Gu et al 2021
- “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Cheng et al 2021
- “ALD: Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, Parisotto & Salakhutdinov 2021
- “KiloNeRF: Speeding up Neural Radiance Fields With Thousands of Tiny MLPs”, Reiser et al 2021
- “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced 2021
- “Distilling Large Language Models into Tiny and Effective Students Using PQRNN”, Kaliamoorthi et al 2021
- “Training Data-efficient Image Transformers & Distillation through Attention”, Touvron et al 2020
- “Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning”, Allen-Zhu & Li 2020
- “Towards Playing Full MOBA Games With Deep Reinforcement Learning”, Ye et al 2020
- “A Primer in BERTology: What We Know about How BERT Works”, Rogers et al 2020
- “Dataset Meta-Learning from Kernel Ridge-Regression”, Nguyen et al 2020
- “TernaryBERT: Distillation-aware Ultra-low Bit BERT”, Zhang et al 2020
- “SimCLRv2: Big Self-Supervised Models Are Strong Semi-Supervised Learners”, Chen et al 2020
- “Movement Pruning: Adaptive Sparsity by Fine-Tuning”, Sanh et al 2020
- “General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, Du et al 2020
- “Cryptanalytic Extraction of Neural Network Models”, Carlini et al 2020
- “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wang et al 2020
- “Towards a Conversational Agent That Can Chat About…Anything”, Adiwardana & Luong 2020
- “Understanding the Generalization of ‘lottery Tickets’ in Neural Networks”, Morcos & Tian 2019
- “Self-training With Noisy Student Improves ImageNet Classification”, Xie et al 2019
- “On Warm-Starting Neural Network Training”, Ash & Adams 2019
- “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Sanh et al 2019
- “TinyBERT: Distilling BERT for Natural Language Understanding”, Jiao et al 2019
- “Smaller, Faster, Cheaper, Lighter: Introducing DistilGPT, a Distilled Version of GPT”, Sanh 2019
- “Well-Read Students Learn Better: On the Importance of Pre-training Compact Models”, Turc et al 2019
- “ICML 2019 Notes”, Abel 2019
- “Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Ghazvininejad et al 2019
- “Distilling Policy Distillation”, Czarnecki et al 2019
- “Compressing GANs Using Knowledge Distillation”, Aguinaldo et al 2019
- “Neural Probabilistic Motor Primitives for Humanoid Control”, Merel et al 2018
- “Dataset Distillation”, Wang et al 2018
- “Exploration by Random Network Distillation”, Burda et al 2018
- “OCD: Optimal Completion Distillation for Sequence Learning”, Sabour et al 2018
- “Network Recasting: A Universal Method for Network Architecture Transformation”, Yu et al 2018
- “ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech”, Ping et al 2018
- “Self-Net: Lifelong Learning via Continual Self-Modeling”, Camp et al 2018
- “Self-distillation: Born Again Neural Networks”, Furlanello et al 2018
- “Kickstarting Deep Reinforcement Learning”, Schmitt et al 2018
- “Faster Gaze Prediction With Dense Networks and Fisher Pruning”, Theis et al 2018
- “Parallel WaveNet: Fast High-Fidelity Speech Synthesis”, Oord et al 2017
- “Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”, Gao et al 2017
- “Policy Optimization by Genetic Distillation”, Gangwani & Peng 2017
- “N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”, Ashok et al 2017
- “Training Shallow and Thin Networks for Acceleration via Knowledge Distillation With Conditional Adversarial Networks”, Xu et al 2017
- “Distral: Robust Multitask Reinforcement Learning”, Teh et al 2017
- “Biased Importance Sampling for Deep Neural Network Training”, Katharopoulos & Fleuret 2017
- “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, Zagoruyko & Komodakis 2016
- “Do Deep Convolutional Nets Really Need to Be Deep and Convolutional?”, Urban et al 2016
- “Face Model Compression by Distilling Knowledge from Neurons”, Luo et al 2016
- “Policy Distillation”, Rusu et al 2015
- “Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”, Parisotto et al 2015
- “Net2Net: Accelerating Learning via Knowledge Transfer”, Chen et al 2015
- “Bayesian Dark Knowledge”, Korattikara et al 2015
- “Distilling the Knowledge in a Neural Network”, Hinton et al 2015
- “FitNets: Hints for Thin Deep Nets”, Romero et al 2014
- “Do Deep Nets Really Need to Be Deep?”, Ba & Caruana 2013
- “Model Compression”, Bucila 2006
- “Learning Complex, Extended Sequences Using the Principle of History Compression”, Schmidhuber 1992
- “Dota 2 With Large Scale Deep Reinforcement Learning § Pg11”, Rerun 2024 (page 11 org openai)
- “From Vision to Language: Semi-Supervised Learning in Action…at Scale”
- Sort By Magic
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Gwern
“Research Ideas”, Gwern 2017
Links
“Improving Text Embeddings With Large Language Models”, Wang et al 2023
“ReST Meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent”, Aksitov et al 2023
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
“ByteDance Is Secretly Using OpenAI’s Tech to Build a Competitor”, Heath 2023
ByteDance is secretly using OpenAI’s tech to build a competitor
“SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration”, Duckworth et al 2023
SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration
“Beyond Human Data: Scaling Self-Training for Problem-Solving With Language Models (ReSTEM)”, Singh et al 2023
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReSTEM)
“Generative Models: What Do They Know? Do They Know Things? Let’s Find Out!”, Du et al 2023
Generative Models: What do they know? Do they know things? Let’s find out!
“Efficient Transformer Knowledge Distillation: A Performance Review”, Brown et al 2023
Efficient Transformer Knowledge Distillation: A Performance Review
“Implicit Chain-of-Thought Reasoning via Knowledge Distillation”, Deng et al 2023
Implicit Chain-of-Thought Reasoning via Knowledge Distillation
“Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling”, Gandhi et al 2023
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
“HyperFields: Towards Zero-Shot Generation of NeRFs from Text”, Babu et al 2023
HyperFields: Towards Zero-Shot Generation of NeRFs from Text
“Polynomial Time Cryptanalytic Extraction of Neural Network Models”, Shamir et al 2023
Polynomial Time Cryptanalytic Extraction of Neural Network Models
“OSD: Online Speculative Decoding”, Liu et al 2023
“ReST: Reinforced Self-Training (ReST) for Language Modeling”, Gulcehre et al 2023
“Composable Function-preserving Expansions for Transformer Architectures”, Gesmundo & Maile 2023
Composable Function-preserving Expansions for Transformer Architectures
“Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events”, Gu et al 2023
“Explaining Competitive-Level Programming Solutions Using LLMs”, Li et al 2023
Explaining Competitive-Level Programming Solutions using LLMs
“GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models”, Agarwal et al 2023
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models
“WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia”, Semnani et al 2023
“VanillaNet: the Power of Minimalism in Deep Learning”, Chen et al 2023
“TinyStories: How Small Can Language Models Be and Still Speak Coherent English?”, Eldan & Li 2023
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
“Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation”, Guo et al 2023
Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation
“Distilling Step-by-Step! Outperforming Larger Language Models With Less Training Data and Smaller Model Sizes”, Hsieh et al 2023
“LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions”, Wu et al 2023
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
“Learning Agile Soccer Skills for a Bipedal Robot With Deep Reinforcement Learning”, Haarnoja et al 2023
Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning
“A Cookbook of Self-Supervised Learning”, Balestriero et al 2023
“KD-DLGAN: Data Limited Image Generation via Knowledge Distillation”, Cui et al 2023
KD-DLGAN: Data Limited Image Generation via Knowledge Distillation
“TRACT: Denoising Diffusion Models With Transitive Closure Time-Distillation”, Berthelot et al 2023
TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation
“Learning Humanoid Locomotion With Transformers”, Radosavovic et al 2023
“Consistency Models”, Song et al 2023
“ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”, Azerbayev et al 2023
ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics
“Scaling Vision Transformers to 22 Billion Parameters”, Dehghani et al 2023
“BMT: Binarized Neural Machine Translation”, Zhang et al 2023
“Use GPT-3 Incorrectly: Reduce Costs 40× and Increase Speed by 5×”, Pullen 2023
Use GPT-3 incorrectly: reduce costs 40× and increase speed by 5×
“TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”, Ren et al 2023
TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
“Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, Komatsuzaki et al 2022
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
“Solving Math Word Problems With Process & Outcome-based Feedback”, Uesato et al 2022
Solving math word problems with process & outcome-based feedback
“Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, Belyaeva et al 2022
Distilled DeepConsensus: Knowledge distillation for fast and accurate DNA sequence correction
“MaskDistill: A Unified View of Masked Image Modeling”, Anonymous 2022
“EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”, Fang et al 2022
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
“Legged Locomotion in Challenging Terrains Using Egocentric Vision”, Agarwal et al 2022
Legged Locomotion in Challenging Terrains using Egocentric Vision
“EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Balaji et al 2022
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
“Fast DistilBERT on CPUs”, Shen et al 2022
“Large Language Models Can Self-Improve”, Huang et al 2022
“Exclusive Supermask Subnetwork Training for Continual Learning”, Yadav & Bansal 2022
Exclusive Supermask Subnetwork Training for Continual Learning
“The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”, Kocsis et al 2022
The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes
“On Distillation of Guided Diffusion Models”, Meng et al 2022
“Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”, Jawahar et al 2022
Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints
“Omnigrok: Grokking Beyond Algorithmic Data”, Liu et al 2022
“Human-level Atari 200× Faster”, Kapturowski et al 2022
“On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, Rohanian et al 2022
On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)
“Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members”, Cornelisse et al 2022
Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members
“Re2G: Retrieve, Rerank, Generate”, Glass et al 2022
“Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems”, FitzGerald et al 2022
“SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, Opitz & Frank 2022
“ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Yao et al 2022
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
“Dataset Condensation via Efficient Synthetic-Data Parameterization”, Kim et al 2022
Dataset Condensation via Efficient Synthetic-Data Parameterization
“UViM: A Unified Modeling Approach for Vision With Learned Guiding Codes”, Kolesnikov et al 2022
UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes
“Dialog Inpainting: Turning Documents into Dialogues”, Dai et al 2022
“Solving ImageNet: a Unified Scheme for Training Any Backbone to Top Results”, Ridnik et al 2022
Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results
“STaR: Bootstrapping Reasoning With Reasoning”, Zelikman et al 2022
“Knowledge Distillation: Bad Models Can Be Good Role Models”, Kaplun et al 2022
“PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression”, Vo et al 2022
“Self-Distilled StyleGAN: Towards Generation from Internet Photos”, Mokady et al 2022
Self-Distilled StyleGAN: Towards Generation from Internet Photos
“AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, Xu et al 2022
AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models
“DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, Rajbhandari et al 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
“Microdosing: Knowledge Distillation for GAN Based Compression”, Helminger et al 2022
Microdosing: Knowledge Distillation for GAN based Compression
“ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation”, Wang et al 2021
“Amortized Noisy Channel Neural Machine Translation”, Pang et al 2021
“Causal Distillation for Language Models”, Wu et al 2021
“Extrapolating from a Single Image to a Thousand Classes Using Distillation”, Asano & Saeed 2021
Extrapolating from a Single Image to a Thousand Classes using Distillation
“Prune Once for All: Sparse Pre-Trained Language Models”, Zafrir et al 2021
“Training Verifiers to Solve Math Word Problems”, Cobbe et al 2021
“Wav2CLIP: Learning Robust Audio Representations From CLIP”, Wu et al 2021
“When in Doubt, Summon the Titans: Efficient Inference With Large Models”, Rawat et al 2021
When in Doubt, Summon the Titans: Efficient Inference with Large Models
“Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, Jin et al 2021
Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora
“Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”, West et al 2021
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
“Language Modelling via Learning to Rank”, Frydenlund et al 2021
“Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes”, Lee et al 2021
Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes
“Unsupervised Neural Machine Translation With Generative Language Models Only”, Han et al 2021
Unsupervised Neural Machine Translation with Generative Language Models Only
“OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Wu et al 2021
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation
“Progressive Distillation for Fast Sampling of Diffusion Models”, Salimans & Ho 2021
Progressive Distillation for Fast Sampling of Diffusion Models
“On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”, Lai et al 2021
On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis
“ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Xie & Zheng 2021
ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation
“Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference”, Kudugunta et al 2021
Beyond Distillation: Task-level Mixture-of-Experts (TaskMoE) for Efficient Inference
“SPLADE V2: Sparse Lexical and Expansion Model for Information Retrieval”, Formal et al 2021
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval
“KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Tahaei et al 2021
“Multi-Task Self-Training for Learning General Representations”, Ghiasi et al 2021
Multi-Task Self-Training for Learning General Representations
“Dataset Distillation With Infinitely Wide Convolutional Networks”, Nguyen et al 2021
Dataset Distillation with Infinitely Wide Convolutional Networks
“Knowledge-Adaptation Priors”, Khan & Swaroop 2021
“Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”, Menghani 2021
Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better
“Knowledge Distillation: A Good Teacher Is Patient and Consistent”, Beyer et al 2021
Knowledge distillation: A good teacher is patient and consistent
“ResMLP: Feedforward Networks for Image Classification With Data-efficient Training”, Touvron et al 2021
ResMLP: Feedforward networks for image classification with data-efficient training
“DINO: Emerging Properties in Self-Supervised Vision Transformers”, Caron et al 2021
DINO: Emerging Properties in Self-Supervised Vision Transformers
“Zero-Shot Detection via Vision and Language Knowledge Distillation”, Gu et al 2021
Zero-Shot Detection via Vision and Language Knowledge Distillation
“Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Cheng et al 2021
Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation
“ALD: Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, Parisotto & Salakhutdinov 2021
ALD: Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
“KiloNeRF: Speeding up Neural Radiance Fields With Thousands of Tiny MLPs”, Reiser et al 2021
KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs
“China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced 2021
“Distilling Large Language Models into Tiny and Effective Students Using PQRNN”, Kaliamoorthi et al 2021
Distilling Large Language Models into Tiny and Effective Students using pQRNN
“Training Data-efficient Image Transformers & Distillation through Attention”, Touvron et al 2020
Training data-efficient image transformers & distillation through attention
“Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning”, Allen-Zhu & Li 2020
Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning
“Towards Playing Full MOBA Games With Deep Reinforcement Learning”, Ye et al 2020
Towards Playing Full MOBA Games with Deep Reinforcement Learning
“A Primer in BERTology: What We Know about How BERT Works”, Rogers et al 2020
“Dataset Meta-Learning from Kernel Ridge-Regression”, Nguyen et al 2020
“TernaryBERT: Distillation-aware Ultra-low Bit BERT”, Zhang et al 2020
“SimCLRv2: Big Self-Supervised Models Are Strong Semi-Supervised Learners”, Chen et al 2020
SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners
“Movement Pruning: Adaptive Sparsity by Fine-Tuning”, Sanh et al 2020
“General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, Du et al 2020
General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference
“Cryptanalytic Extraction of Neural Network Models”, Carlini et al 2020
“MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wang et al 2020
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
“Towards a Conversational Agent That Can Chat About…Anything”, Adiwardana & Luong 2020
“Understanding the Generalization of ‘lottery Tickets’ in Neural Networks”, Morcos & Tian 2019
Understanding the generalization of ‘lottery tickets’ in neural networks
“Self-training With Noisy Student Improves ImageNet Classification”, Xie et al 2019
Self-training with Noisy Student improves ImageNet classification
“On Warm-Starting Neural Network Training”, Ash & Adams 2019
“DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Sanh et al 2019
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
“TinyBERT: Distilling BERT for Natural Language Understanding”, Jiao et al 2019
TinyBERT: Distilling BERT for Natural Language Understanding
“Smaller, Faster, Cheaper, Lighter: Introducing DistilGPT, a Distilled Version of GPT”, Sanh 2019
Smaller, faster, cheaper, lighter: Introducing DistilGPT, a distilled version of GPT
“Well-Read Students Learn Better: On the Importance of Pre-training Compact Models”, Turc et al 2019
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
“ICML 2019 Notes”, Abel 2019
“Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Ghazvininejad et al 2019
Mask-Predict: Parallel Decoding of Conditional Masked Language Models
“Distilling Policy Distillation”, Czarnecki et al 2019
“Compressing GANs Using Knowledge Distillation”, Aguinaldo et al 2019
“Neural Probabilistic Motor Primitives for Humanoid Control”, Merel et al 2018
“Dataset Distillation”, Wang et al 2018
“Exploration by Random Network Distillation”, Burda et al 2018
“OCD: Optimal Completion Distillation for Sequence Learning”, Sabour et al 2018
“Network Recasting: A Universal Method for Network Architecture Transformation”, Yu et al 2018
Network Recasting: A Universal Method for Network Architecture Transformation
“ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech”, Ping et al 2018
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
“Self-Net: Lifelong Learning via Continual Self-Modeling”, Camp et al 2018
“Self-distillation: Born Again Neural Networks”, Furlanello et al 2018
“Kickstarting Deep Reinforcement Learning”, Schmitt et al 2018
“Faster Gaze Prediction With Dense Networks and Fisher Pruning”, Theis et al 2018
Faster gaze prediction with dense networks and Fisher pruning
“Parallel WaveNet: Fast High-Fidelity Speech Synthesis”, Oord et al 2017
“Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”, Gao et al 2017
Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN
“Policy Optimization by Genetic Distillation”, Gangwani & Peng 2017
“N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”, Ashok et al 2017
N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning
“Training Shallow and Thin Networks for Acceleration via Knowledge Distillation With Conditional Adversarial Networks”, Xu et al 2017
“Distral: Robust Multitask Reinforcement Learning”, Teh et al 2017
“Biased Importance Sampling for Deep Neural Network Training”, Katharopoulos & Fleuret 2017
“Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, Zagoruyko & Komodakis 2016
“Do Deep Convolutional Nets Really Need to Be Deep and Convolutional?”, Urban et al 2016
Do Deep Convolutional Nets Really Need to be Deep and Convolutional?
“Face Model Compression by Distilling Knowledge from Neurons”, Luo et al 2016
“Policy Distillation”, Rusu et al 2015
“Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”, Parisotto et al 2015
Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning
“Net2Net: Accelerating Learning via Knowledge Transfer”, Chen et al 2015
“Bayesian Dark Knowledge”, Korattikara et al 2015
“Distilling the Knowledge in a Neural Network”, Hinton et al 2015
“FitNets: Hints for Thin Deep Nets”, Romero et al 2014
“Do Deep Nets Really Need to Be Deep?”, Ba & Caruana 2013
“Model Compression”, Bucila 2006
“Learning Complex, Extended Sequences Using the Principle of History Compression”, Schmidhuber 1992
Learning Complex, Extended Sequences Using the Principle of History Compression
“Dota 2 With Large Scale Deep Reinforcement Learning § Pg11”, Rerun 2024 (page 11 org openai)
“From Vision to Language: Semi-Supervised Learning in Action…at Scale”
From Vision to Language: Semi-Supervised Learning in Action…at Scale
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
neural-efficiency
neural-advancement
distilled-learning
Wikipedia
Miscellaneous
-
/doc/ai/nn/sparsity/knowledge-distillation/2016-urban-figure1-mlpvscnnscaling.png
: -
https://discuss.luxonis.com/blog/3272-datadreamer-creating-custom-datasets-made-easy
: -
https://medium.com/neuralmachine/knowledge-distillation-dc241d7c2322
-
https://www.theverge.com/2023/3/29/23662621/google-bard-chatgpt-sharegpt-training-denies
Link Bibliography
-
https://arxiv.org/abs/2312.06585#deepmind
: “Beyond Human Data: Scaling Self-Training for Problem-Solving With Language Models (ReSTEM)”, -
https://arxiv.org/abs/2311.13657
: “Efficient Transformer Knowledge Distillation: A Performance Review”, Nathan Brown, Ashton Williamson, Tahj Anderson, Logan Lawrence -
https://arxiv.org/abs/2311.00430
: “Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling”, Sanchit Gandhi, Patrick von Platen, Alexander M. Rush -
https://arxiv.org/abs/2310.08708
: “Polynomial Time Cryptanalytic Extraction of Neural Network Models”, Adi Shamir, Isaac Canales-Martinez, Anna Hambitzer, Jorge Chavez-Saab, Francisco Rodrigez-Henriquez, Nitin Satpute -
https://arxiv.org/abs/2307.06439#microsoft
: “Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events”, -
https://arxiv.org/abs/2305.12972
: “VanillaNet: the Power of Minimalism in Deep Learning”, Hanting Chen, Yunhe Wang, Jianyuan Guo, Dacheng Tao -
https://arxiv.org/abs/2305.07759#microsoft
: “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?”, Ronen Eldan, Yuanzhi Li -
https://arxiv.org/abs/2305.07804
: “Dr. LLaMa: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation”, Zhen Guo, Peiqi Wang, Yanwei Wang, Shangdi Yu -
https://arxiv.org/abs/2305.02301#google
: “Distilling Step-by-Step! Outperforming Larger Language Models With Less Training Data and Smaller Model Sizes”, -
https://arxiv.org/abs/2304.13653#deepmind
: “Learning Agile Soccer Skills for a Bipedal Robot With Deep Reinforcement Learning”, -
https://arxiv.org/abs/2303.01469#openai
: “Consistency Models”, Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever -
https://arxiv.org/abs/2302.12433
: “ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics”, Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, Jeremy Avigad -
https://arxiv.org/abs/2302.05442#google
: “Scaling Vision Transformers to 22 Billion Parameters”, -
https://arxiv.org/abs/2302.04907#google
: “BMT: Binarized Neural Machine Translation”, Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat -
https://arxiv.org/abs/2301.01296#microsoft
: “TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models”, Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu -
https://arxiv.org/abs/2212.05055#google
: “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints”, -
https://openreview.net/forum?id=wmGlMhaBe0
: “MaskDistill: A Unified View of Masked Image Modeling”, Anonymous -
https://arxiv.org/abs/2211.07636#baai
: “EVA: Exploring the Limits of Masked Visual Representation Learning at Scale”, Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao -
https://arxiv.org/abs/2211.07638
: “Legged Locomotion in Challenging Terrains Using Egocentric Vision”, Ananye Agarwal, Ashish Kumar, Jitendra Malik, Deepak Pathak -
https://arxiv.org/abs/2211.01324#nvidia
: “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, -
https://arxiv.org/abs/2210.11610#google
: “Large Language Models Can Self-Improve”, Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han -
https://arxiv.org/abs/2210.03142#google
: “On Distillation of Guided Diffusion Models”, Chenlin Meng, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, Tim Salimans -
https://arxiv.org/abs/2209.07550#deepmind
: “Human-level Atari 200× Faster”, Steven Kapturowski, Víctor Campos, Ray Jiang, Nemanja Rakićević, Hado van Hasselt, Charles Blundell, Adrià Puigdomènech Badia -
https://arxiv.org/abs/2207.06300#ibm
: “Re2G: Retrieve, Rerank, Generate”, Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, Alfio Gliozzo -
https://arxiv.org/abs/2206.07808#amazon
: “Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems”, -
https://arxiv.org/abs/2206.01861#microsoft
: “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”, Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He -
https://arxiv.org/abs/2205.09073#google
: “Dialog Inpainting: Turning Documents into Dialogues”, Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu -
https://arxiv.org/abs/2204.03475#alibaba
: “Solving ImageNet: a Unified Scheme for Training Any Backbone to Top Results”, Tal Ridnik, Hussam Lawen, Emanuel Ben-Baruch, Asaf Noy -
https://arxiv.org/abs/2202.12211#google
: “Self-Distilled StyleGAN: Towards Generation from Internet Photos”, Ron Mokady, Michal Yarom, Omer Tov, Oran Lang, Daniel Cohen-Or, Tali Dekel, Michal Irani, Inbar Mosseri -
https://arxiv.org/abs/2201.05596#microsoft
: “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale”, -
https://arxiv.org/abs/2111.05754
: “Prune Once for All: Sparse Pre-Trained Language Models”, Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat -
https://arxiv.org/abs/2110.14168#openai
: “Training Verifiers to Solve Math Word Problems”, Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman -
https://arxiv.org/abs/2110.06961
: “Language Modelling via Learning to Rank”, Arvid Frydenlund, Gagandeep Singh, Frank Rudzicz -
https://openreview.net/forum?id=G89-1yZLFHk
: “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Bichen Wu, Ruizhe Cheng, Peizhao Zhang, Peter Vajda, Joseph E. Gonzalez -
https://arxiv.org/abs/2109.12066
: “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Johnathan Xie, Shuai Zheng -
https://arxiv.org/abs/2109.06243#huawei
: “KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Marzieh S. Tahaei, Ella Charlaix, Vahid Partovi Nia, Ali Ghodsi, Mehdi Rezagholizadeh -
https://arxiv.org/abs/2106.05237#google
: “Knowledge Distillation: A Good Teacher Is Patient and Consistent”, Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, Alexander Kolesnikov -
https://arxiv.org/abs/2104.14294#facebook
: “DINO: Emerging Properties in Self-Supervised Vision Transformers”, Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Arm, Joulin -
https://arxiv.org/abs/2104.13921#google
: “Zero-Shot Detection via Vision and Language Knowledge Distillation”, Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui -
https://arxiv.org/abs/2104.08945#facebook
: “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Ruizhe Cheng, Bichen Wu, Peizhao Zhang, Peter Vajda, Joseph E. Gonzalez -
https://syncedreview.com/2021/03/23/chinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0/#baai
: “China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced -
https://arxiv.org/abs/2012.12877#facebook
: “Training Data-efficient Image Transformers & Distillation through Attention”, Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou -
https://arxiv.org/abs/2011.12692#tencent
: “Towards Playing Full MOBA Games With Deep Reinforcement Learning”, -
https://arxiv.org/abs/2002.10957#microsoft
: “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou -
https://blog.research.google/2020/01/towards-conversational-agent-that-can.html
: “Towards a Conversational Agent That Can Chat About…Anything”, Daniel Adiwardana, Thang Luong -
https://arxiv.org/abs/1911.04252#google
: “Self-training With Noisy Student Improves ImageNet Classification”, Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le -
https://arxiv.org/abs/1909.10351
: “TinyBERT: Distilling BERT for Natural Language Understanding”, Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu -
https://david-abel.github.io/notes/icml_2019.pdf
: “ICML 2019 Notes”, David Abel -
https://arxiv.org/abs/1902.02186#deepmind
: “Distilling Policy Distillation”, Wojciech Marian Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant M. Jayakumar, Grzegorz Swirszcz, Max Jaderberg -
2016-luo.pdf
: “Face Model Compression by Distilling Knowledge from Neurons”, Ping Luo, Zhenyao Zhu, Ziwei Liu, Xiaogang Wang, Xiaoou Tang