- See Also
-
Links
- “GLaMM: Pixel Grounding Large Multimodal Model”, Rasheed et al 2023
- “CogVLM: Visual Expert for Pretrained Language Models”, Wang et al 2023
- “ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models”, Luo et al 2023
- “Don’t Make Your LLM an Evaluation Benchmark Cheater”, Zhou et al 2023
- “Will Releasing the Weights of Large Language Models Grant Widespread Access to Pandemic Agents?”, Gopal et al 2023
- “Sparse Universal Transformer”, Tan et al 2023
- “Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning”, Xia et al 2023
- “Language Models Represent Space and Time”, Gurnee & Tegmark 2023
- “Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions”, Chebotar et al 2023
- “Predicting Brain Activity Using Transformers”, Adeli et al 2023
- “Copy Is All You Need”, Lan et al 2023
- “Whisper-AT: Noise-Robust Automatic Speech Recognizers Are Also Strong General Audio Event Taggers”, Gong et al 2023
- “HEADLINES: A Massive Scale Semantic Similarity Dataset of Historical English”, Silcock & Dell 2023
- “OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents”, Laurençon et al 2023
- “Expanding the Methodological Toolbox: Machine-based Item Desirability Ratings As an Alternative to Human-based Ratings”, Hommel 2023
- “RGD: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization”, Kumar et al 2023
- “SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling With Backtracking”, Cundy & Ermon 2023
- “Binary and Ternary Natural Language Generation”, Liu et al 2023
- “The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora With Web Data, and Web Data Only”, Penedo et al 2023
- “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”, Lin et al 2023
- “Deep Learning Based Forecasting: a Case Study from the Online Fashion Industry”, Kunz et al 2023
- “Scaling Laws for Language Encoding Models in FMRI”, Antonello et al 2023
- “DarkBERT: A Language Model for the Dark Side of the Internet”, Jin et al 2023
- “Mitigating Lies in Vision-Language Models”, Li et al 2023
- “VendorLink: An NLP Approach for Identifying & Linking Vendor Migrants & Potential Aliases on Darknet Markets”, Saxena et al 2023
- “Visual Instruction Tuning”, Liu et al 2023
- “ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification”, Taesiri et al 2023
- “Segment Anything”, Kirillov et al 2023
- “When and How Artificial Intelligence Augments Employee Creativity”, Jia et al 2023
- “Mitigating YouTube Recommendation Polarity Using BERT and K-Means Clustering”, Ahmad et al 2023
- “The Man of Your Dreams For $300, Replika Sells an AI Companion Who Will Never Die, Argue, or Cheat—until His Algorithm Is Updated”, Singh-Kurtz 2023
- “Tag2Text: Guiding Vision-Language Model via Image Tagging”, Huang et al 2023
- “Towards Democratizing Joint-Embedding Self-Supervised Learning”, Bordes et al 2023
- “MUX-PLMs: Pre-training Language Models With Data Multiplexing”, Murahari et al 2023
- “Optical Transformers”, Anderson et al 2023
- “Scaling Vision Transformers to 22 Billion Parameters”, Dehghani et al 2023
- “BMT: Binarized Neural Machine Translation”, Zhang et al 2023
- “V1T: Large-scale Mouse V1 Response Prediction Using a Vision Transformer”, Li et al 2023
- “The BabyLM Challenge: Sample-efficient Pretraining on a Developmentally Plausible Corpus”, Warstadt et al 2023
- “XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models”, Liang et al 2023
- “ClimaX: A Foundation Model for Weather and Climate”, Nguyen et al 2023
- “DataMUX: Data Multiplexing for Neural Networks”, Murahari et al 2023
- “Tracr: Compiled Transformers As a Laboratory for Interpretability”, Lindner et al 2023
- “Vision Transformers Are Good Mask Auto-Labelers”, Lan et al 2023
- “Scaling Laws for Generative Mixed-Modal Language Models”, Aghajanyan et al 2023
- “Why Do Nearest Neighbor Language Models Work?”, Xu et al 2023
- “Less Is More: Parameter-Free Text Classification With Gzip”, Jiang et al 2022
- “POM: A Principal Odor Map Unifies Diverse Tasks in Human Olfactory Perception”, Lee et al 2022
- “What Do Vision Transformers Learn? A Visual Exploration”, Ghiasi et al 2022
- “MAGVIT: Masked Generative Video Transformer”, Yu et al 2022
- “VindLU: A Recipe for Effective Video-and-Language Pretraining”, Cheng et al 2022
- “Discovering Latent Knowledge in Language Models Without Supervision”, Burns et al 2022
- “Text Embeddings by Weakly-Supervised Contrastive Pre-training”, Wang et al 2022
- “Robust Speech Recognition via Large-Scale Weak Supervision”, Radford et al 2022
- “RGB No More: Minimally-decoded JPEG Vision Transformers”, Park & Johnson 2022
- “BARTSmiles: Generative Masked Language Models for Molecular Representations”, Chilingaryan et al 2022
- “What Learning Algorithm Is In-context Learning? Investigations With Linear Models”, Akyürek et al 2022
- “A Deep Learning and Digital Archaeology Approach for Mosquito Repellent Discovery”, Wei et al 2022
- “GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation”, Guo et al 2022
- “Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, Belyaeva et al 2022
- “Uni-Perceiver V2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks”, Li et al 2022
- “UniSumm: Unified Few-shot Summarization With Multi-Task Pre-Training and Prefix-Tuning”, Chen et al 2022
- “OneFormer: One Transformer to Rule Universal Image Segmentation”, Jain et al 2022
- “Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities”, Tjandra et al 2022
- “Characterizing Intrinsic Compositionality in Transformers With Tree Projections”, Murty et al 2022
- “Fast DistilBERT on CPUs”, Shen et al 2022
- “n-gram Is Back: Residual Learning of Neural Text Generation With n-gram Language Model”, Li et al 2022
- “Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models”, Liu et al 2022
- “Noise-Robust De-Duplication at Scale”, Silcock et al 2022
- “Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”, Jawahar et al 2022
- “Improving Sample Quality of Diffusion Models Using Self-Attention Guidance”, Hong et al 2022
- “Semantic Scene Descriptions As an Objective of Human Vision”, Doerig et al 2022
- “A Generalist Neural Algorithmic Learner”, Ibarz et al 2022
- “SetFit: Efficient Few-Shot Learning Without Prompts”, Tunstall et al 2022
- “Machine Reading, Fast and Slow: When Do Models "Understand" Language?”, Choudhury et al 2022
- “On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, Rohanian et al 2022
- “ASR2K: Speech Recognition for Around 2000 Languages without Audio”, Li et al 2022
- “Analyzing Transformers in Embedding Space”, Dar et al 2022
- “MeloForm: Generating Melody With Musical Form Based on Expert Systems and Neural Networks”, Lu et al 2022
- “CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks”, Chen et al 2022
- “PatchDropout: Economizing Vision Transformers Using Patch Dropout”, Liu et al 2022
- “Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data?”, Grinsztajn et al 2022
- “Re2G: Retrieve, Rerank, Generate”, Glass et al 2022
- “Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling”, Nguyen & Grover 2022
- “Neural Networks and the Chomsky Hierarchy”, Delétang et al 2022
- “TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data”, Hollmann et al 2022
- “Do Loyal Users Enjoy Better Recommendations? Understanding Recommender Accuracy from a Time Perspective”, Ji et al 2022
- “Transfer Learning With Deep Tabular Models”, Levin et al 2022
- “BertNet: Harvesting Knowledge Graphs from Pretrained Language Models”, Hao et al 2022
- “ProGen2: Exploring the Boundaries of Protein Language Models”, Nijkamp et al 2022
- “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Li et al 2022
- “RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt”, Mindermann et al 2022
- “SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, Opitz & Frank 2022
- “Language Models Are General-Purpose Interfaces”, Hao et al 2022
- “Reconstructing the Cascade of Language Processing in the Brain Using the Internal Computations of a Transformer-based Language Model”, Kumar et al 2022
- “Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Zhu et al 2022
- “A Neural Corpus Indexer for Document Retrieval”, Wang et al 2022
- “XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient”, Wu et al 2022
- “Toward a Realistic Model of Speech Processing in the Brain With Self-supervised Learning”, Millet et al 2022
- “Text2Human: Text-Driven Controllable Human Image Generation”, Jiang et al 2022
- “Anime Character Recognition Using Intermediate Features Aggregation”, Rios et al 2022
- “Towards Learning Universal Hyperparameter Optimizers With Transformers”, Chen et al 2022
- “On the Paradox of Learning to Reason from Data”, Zhang et al 2022
- “HTPS: HyperTree Proof Search for Neural Theorem Proving”, Lample et al 2022
- “Housekeep: Tidying Virtual Households Using Commonsense Reasoning”, Kant et al 2022
- “Tradformer: A Transformer Model of Traditional Music Transcriptions”, Casini & Sturm 2022
- “UViM: A Unified Modeling Approach for Vision With Learned Guiding Codes”, Kolesnikov et al 2022
- “PLAID: An Efficient Engine for Late Interaction Retrieval”, Santhanam et al 2022
- “Continual Pre-Training Mitigates Forgetting in Language and Vision”, Cossu et al 2022
- “Few-Shot Parameter-Efficient Fine-Tuning Is Better and Cheaper Than In-Context Learning”, Liu et al 2022
- “SymphonyNet: Symphony Generation With Permutation Invariant Language Model”, Liu et al 2022
- “When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet”, Vasudevan et al 2022
- “A Challenging Benchmark of Anime Style Recognition”, Li et al 2022
- “Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers”, Chan et al 2022
- “Masked Siamese Networks for Label-Efficient Learning”, Assran et al 2022
- “DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning”, Wang et al 2022
- “Language Models That Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion”, Shuster et al 2022
- “On Embeddings for Numerical Features in Tabular Deep Learning”, Gorishniy et al 2022
- “In-context Learning and Induction Heads”, Olsson et al 2022
- “LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models”, Javaheripi et al 2022
- “Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words”, Feng et al 2022
- “TACTiS: Transformer-Attentional Copulas for Time Series”, Drouin et al 2022
- “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework”, Wang et al 2022
- “AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, Xu et al 2022
- “FIGARO: Generating Symbolic Music With Fine-Grained Artistic Control”, Rütte et al 2022
- “Robust Contrastive Learning against Noisy Views”, Chuang et al 2022
- “HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning”, Zhmoginov et al 2022
- “A Mathematical Framework for Transformer Circuits”, Elhage et al 2021
- “XGLM: Few-shot Learning With Multilingual Language Models”, Lin et al 2021
- “PFNs: Transformers Can Do Bayesian Inference”, Müller et al 2021
- “AI Improvements in Chemical Calculations”, Lowe 2021
- “An Empirical Investigation of the Role of Pre-training in Lifelong Learning”, Mehta et al 2021
- “You Only Need One Model for Open-domain Question Answering”, Lee et al 2021
- “Human Parity on CommonsenseQA: Augmenting Self-Attention With External Attention”, Xu et al 2021
- “Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks”, Zhu et al 2021
- “Inducing Causal Structure for Interpretable Neural Networks (IIT)”, Geiger et al 2021
- “OCR-free Document Understanding Transformer”, Kim et al 2021
- “FQ-ViT: Fully Quantized Vision Transformer without Retraining”, Lin et al 2021
- “Semi-Supervised Music Tagging Transformer”, Won et al 2021
- “LEMON: Scaling Up Vision-Language Pre-training for Image Captioning”, Hu et al 2021
- “UNICORN: Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling”, Yang et al 2021
- “Compositional Transformers for Scene Generation”, Hudson & Zitnick 2021
- “A Survey of Visual Transformers”, Liu et al 2021
- “Improving Visual Quality of Image Synthesis by A Token-based Generator With Transformers”, Zeng et al 2021
- “STransGAN: An Empirical Study on Transformer in GANs”, Xu et al 2021
- “The Efficiency Misnomer”, Dehghani et al 2021
- “Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, Jin et al 2021
- “The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail”, Bowman 2021
- “Palette: Image-to-Image Diffusion Models”, Saharia et al 2021
- “Autoregressive Latent Video Prediction With High-Fidelity Image Generator”, Seo et al 2021
- “Transformers Are Meta-Reinforcement Learners”, Anonymous 2021
- “Skill Induction and Planning With Latent Language”, Sharma et al 2021
- “Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query”, Ngo et al 2021
- “BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition”, Zhang et al 2021
- “Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, Bondarenko et al 2021
- “TrOCR: Transformer-based Optical Character Recognition With Pre-trained Models”, Li et al 2021
- “KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Tahaei et al 2021
- “Block Pruning For Faster Transformers”, Lagunas et al 2021
- “The Sensory Neuron As a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning”, Tang & Ha 2021
- “DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction”, Baid et al 2021
- “ImageBART: Bidirectional Context With Multinomial Diffusion for Autoregressive Image Synthesis”, Esser et al 2021
- “Modeling Protein Using Large-scale Pretrain Language Model”, Xiao et al 2021
- “Billion-Scale Pretraining With Vision Transformers for Multi-Task Visual Representations”, Beal et al 2021
- “EVA: An Open-Domain Chinese Dialogue System With Large-Scale Generative Pre-Training”, Zhou et al 2021
- “Internet-Augmented Dialogue Generation”, Komeili et al 2021
- “ViTGAN: Training GANs With Vision Transformers”, Lee et al 2021
- “ARM-Net: Adaptive Relation Modeling Network for Structured Data”, Cai et al 2021
- “SCARF: Self-Supervised Contrastive Learning Using Random Feature Corruption”, Bahri et al 2021
- “Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”, Tay et al 2021
- “BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models”, Zaken et al 2021
- “Revisiting the Calibration of Modern Neural Networks”, Minderer et al 2021
- “Scaling Laws for Acoustic Models”, Droppo & Elibol 2021
- “CoAtNet: Marrying Convolution and Attention for All Data Sizes”, Dai et al 2021
- “Chasing Sparsity in Vision Transformers: An End-to-End Exploration”, Chen et al 2021
- “Tabular Data: Deep Learning Is Not All You Need”, Shwartz-Ziv & Armon 2021
- “Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning”, Kossen et al 2021
- “SegFormer: Simple and Efficient Design for Semantic Segmentation With Transformers”, Xie et al 2021
- “Exploring Transfer Learning Techniques for Named Entity Recognition in Noisy User-Generated Text”, Bogensperger 2021
- “Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks”, Bian et al 2021
- “One4all User Representation for Recommender Systems in E-commerce”, Shin et al 2021
- “MathBERT: A Pre-Trained Model for Mathematical Formula Understanding”, Peng et al 2021
- “MDETR—Modulated Detection for End-to-End Multi-Modal Understanding”, Kamath et al 2021
- “XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond”, Barbieri et al 2021
- “SimCSE: Simple Contrastive Learning of Sentence Embeddings”, Gao et al 2021
- “Memorization versus Generalisation in Pre-trained Language Models”, Tänzer et al 2021
- “Robust Open-Vocabulary Translation from Visual Text Representations”, Salesky et al 2021
- “Gradient-based Adversarial Attacks against Text Transformers”, Guo et al 2021
- “Retrieval Augmentation Reduces Hallucination in Conversation”, Shuster et al 2021
- “Machine Translation Decoding beyond Beam Search”, Leblond et al 2021
- “ChinAI #137: Year 3 of ChinAI: Reflections on the Newsworthiness of Machine Translation”, Ding 2021
- “SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network”, Chan et al 2021
- “GPV-1: Towards General Purpose Vision Systems”, Gupta et al 2021
- “DeepViT: Towards Deeper Vision Transformer”, Zhou et al 2021
- “ConViT: Improving Vision Transformers With Soft Convolutional Inductive Biases”, d’Ascoli et al 2021
- “Get Your Vitamin C! Robust Fact Verification With Contrastive Evidence (VitaminC)”, Schuster et al 2021
- “Are NLP Models Really Able to Solve Simple Math Word Problems?”, Patel et al 2021
- “Learning from Videos to Understand the World”, Zweig et al 2021
- “CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation”, Clark et al 2021
- “TransGAN: Two Transformers Can Make One Strong GAN”, Jiang et al 2021
- “ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”, Kim et al 2021
- “Baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling”, Alcorn & Nguyen 2021
- “Video Transformer Network”, Neimark et al 2021
- “BENDR: Using Transformers and a Contrastive Self-supervised Learning Task to Learn from Massive Amounts of EEG Data”, Kostas et al 2021
- “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, Yuan et al 2021
- “Bottleneck Transformers for Visual Recognition”, Srinivas et al 2021
- “DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition”, Rios et al 2021
- “UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling With Transformers”, Hu et al 2021
- “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, Xu et al 2021
- “XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation”, Zhang et al 2021
- “Transformer Feed-Forward Layers Are Key-Value Memories”, Geva et al 2020
- “Training Data-efficient Image Transformers & Distillation through Attention”, Touvron et al 2020
- “VQ-GAN: Taming Transformers for High-Resolution Image Synthesis”, Esser et al 2020
- “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences”, Rives et al 2020
- “Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, Ding et al 2020
- “TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game”, Han et al 2020
- “A Recurrent Vision-and-Language BERT for Navigation”, Hong et al 2020
- “A Primer in BERTology: What We Know about How BERT Works”, Rogers et al 2020
- “CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters”, Boukkouri et al 2020
- “TernaryBERT: Distillation-aware Ultra-low Bit BERT”, Zhang et al 2020
- “Weird AI Yankovic: Generating Parody Lyrics”, Riedl 2020
- “It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners”, Schick & Schütze 2020
- “DeepSpeed: Extreme-scale Model Training for Everyone”, Team et al 2020
- “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing”, Gu et al 2020
- “Hopfield Networks Is All You Need”, Ramsauer et al 2020
- “Modern Hopfield Networks and Attention for Immune Repertoire Classification”, Widrich et al 2020
- “DeepSinger: Singing Voice Synthesis With Data Mined From the Web”, Ren et al 2020
- “Data Movement Is All You Need: A Case Study on Optimizing Transformers”, Ivanov et al 2020
- “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”, Baevski et al 2020
- “Learning to Learn With Feedback and Local Plasticity”, Lindsey & Litwin-Kumar 2020
- “PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training”, Narayanan et al 2020
- “Improving GAN Training With Probability Ratio Clipping and Sample Reweighting”, Wu et al 2020
- “DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations”, Giorgi et al 2020
- “DeBERTa: Decoding-enhanced BERT With Disentangled Attention”, He et al 2020
- “DETR: End-to-End Object Detection With Transformers”, Carion et al 2020
- “TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data”, Yin et al 2020
- “ForecastQA: A Question Answering Challenge for Event Forecasting With Temporal Text Data”, Jin et al 2020
- “VLN-BERT: Improving Vision-and-Language Navigation With Image-Text Pairs from the Web”, Majumdar et al 2020
- “General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, Du et al 2020
- “Blender: A State-of-the-art Open Source Chatbot”, Roller et al 2020
- “Recipes for Building an Open-domain Chatbot”, Roller et al 2020
- “Rapformer: Conditional Rap Lyrics Generation With Denoising Autoencoders”, Nikolov et al 2020
- “On the Effect of Dropping Layers of Pre-trained Transformer Models”, Sajjad et al 2020
- “TAPAS: Weakly Supervised Table Parsing via Pre-training”, Herzig et al 2020
- “A Hundred Visions and Revisions”, Binder 2020
- “Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited”, Maddox et al 2020
- “AraBERT: Transformer-based Model for Arabic Language Understanding”, Antoun et al 2020
- “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wang et al 2020
- “GNS: Learning to Simulate Complex Physics With Graph Networks”, Sanchez-Gonzalez et al 2020
- “Bayesian Deep Learning and a Probabilistic Perspective of Generalization”, Wilson & Izmailov 2020
- “Do We Need Zero Training Loss After Achieving Zero Training Error?”, Ishida et al 2020
- “Transformers As Soft Reasoners over Language”, Clark et al 2020
- “Towards a Conversational Agent That Can Chat About…Anything”, Adiwardana & Luong 2020
- “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference”, Schick & Schütze 2020
- “VIME: Extending the Success of Self-supervised and Semi-supervised Learning to Tabular Domain”, Yoon et al 2020
- “Mastering Complex Control in MOBA Games With Deep Reinforcement Learning”, Ye et al 2019
- “Measuring Compositional Generalization: A Comprehensive Method on Realistic Data”, Keysers et al 2019
- “PEGASUS: Pre-training With Extracted Gap-sentences for Abstractive Summarization”, Zhang et al 2019
- “Encoding Musical Style With Transformer Autoencoders”, Choi et al 2019
- “Deep Double Descent: We Show That the Double Descent Phenomenon Occurs in CNNs, ResNets, and Transformers: Performance First Improves, Then Gets Worse, and Then Improves Again With Increasing Model Size, Data Size, or Training Time. This Effect Is Often Avoided through Careful Regularization. While This Behavior Appears to Be Fairly Universal, We Don’t yet Fully Understand Why It Happens, and View Further Study of This Phenomenon As an Important Research Direction.”, Nakkiran et al 2019
- “Detecting GAN Generated Errors”, Zhu et al 2019
- “SimpleBooks: Long-term Dependency Book Dataset With Simplified English Vocabulary for Word-level Language Modeling”, Nguyen 2019
- “Unsupervised Cross-lingual Representation Learning at Scale”, Conneau et al 2019
- “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Sanh et al 2019
- “Multiplicative Interactions and Where to Find Them”, Jayakumar et al 2019
- “TinyBERT: Distilling BERT for Natural Language Understanding”, Jiao et al 2019
- “Do NLP Models Know Numbers? Probing Numeracy in Embeddings”, Wallace et al 2019
- “PubMedQA: A Dataset for Biomedical Research Question Answering”, Jin et al 2019
- “Frustratingly Easy Natural Question Answering”, Pan et al 2019
- “Distributionally Robust Language Modeling”, Oren et al 2019
- “The Bottom-up Evolution of Representations in the Transformer: A Study With Machine Translation and Language Modeling Objectives”, Voita et al 2019
- “Encode, Tag, Realize: High-Precision Text Editing”, Malmi et al 2019
- “Language Models As Knowledge Bases?”, Petroni et al 2019
- “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks”, Reimers & Gurevych 2019
- “TabNet: Attentive Interpretable Tabular Learning”, Arik & Pfister 2019
- “StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding”, Wang et al 2019
- “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, Liu et al 2019
- “Theoretical Limitations of Self-Attention in Neural Sequence Models”, Hahn 2019
- “Energy and Policy Considerations for Deep Learning in NLP”, Strubell et al 2019
- “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned”, Voita et al 2019
- “HellaSwag: Can a Machine Really Finish Your Sentence?”, Zellers et al 2019
- “UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation”, Dong et al 2019
- “MASS: Masked Sequence to Sequence Pre-training for Language Generation”, Song et al 2019
- “Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Ghazvininejad et al 2019
- “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes”, You et al 2019
- “LIGHT: Learning to Speak and Act in a Fantasy Text Adventure Game”, Urbanek et al 2019
- “Insertion Transformer: Flexible Sequence Generation via Insertion Operations”, Stern et al 2019
- “Adapter: Parameter-Efficient Transfer Learning for NLP”, Houlsby et al 2019
- “BioBERT: a Pre-trained Biomedical Language Representation Model for Biomedical Text Mining”, Lee et al 2019
- “Bayesian Layers: A Module for Neural Network Uncertainty”, Tran et al 2018
- “Blockwise Parallel Decoding for Deep Autoregressive Models”, Stern et al 2018
- “Object Hallucination in Image Captioning”, Rohrbach et al 2018
- “Self-Attention Generative Adversarial Networks”, Zhang et al 2018
- “Universal Sentence Encoder”, Cer et al 2018
- “Self-Attention With Relative Position Representations”, Shaw et al 2018
- “Learning Longer-term Dependencies in RNNs With Auxiliary Losses”, Trinh et al 2018
- “Generating Structured Music through Self-Attention”, Huang et al 2018
- “A Simple Neural Attentive Meta-Learner”, Mishra et al 2017
- “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, Zagoruyko & Komodakis 2016
- “QRNNs: Quasi-Recurrent Neural Networks”, Bradbury et al 2016
- “Gaussian Error Linear Units (GELUs)”, Hendrycks & Gimpel 2016
- “Pointer Networks”, Vinyals et al 2015
-
“Huggingface:
transformers
Repo”, Huggingface 2023 - “Transformers Are a Very Exciting Family of Machine Learning Architectures. Many Good Tutorials Exist (eg. [1, 2]) but in the Last Few Years, Transformers Have Mostly Become Simpler, so That It Is Now Much More Straightforward to Explain How Modern Architectures Work. This Post Is an Attempt to Explain Directly [in PyTorch] How Modern Transformers Work, and Why, without Some of the Historical Baggage.”
- Sort By Magic
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Links
“GLaMM: Pixel Grounding Large Multimodal Model”, Rasheed et al 2023
“CogVLM: Visual Expert for Pretrained Language Models”, Wang et al 2023
“ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models”, Luo et al 2023
“ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models”
“Don’t Make Your LLM an Evaluation Benchmark Cheater”, Zhou et al 2023
“Will Releasing the Weights of Large Language Models Grant Widespread Access to Pandemic Agents?”, Gopal et al 2023
“Will releasing the weights of large language models grant widespread access to pandemic agents?”
“Sparse Universal Transformer”, Tan et al 2023
“Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning”, Xia et al 2023
“Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning”
“Language Models Represent Space and Time”, Gurnee & Tegmark 2023
“Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions”, Chebotar et al 2023
“Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions”
“Predicting Brain Activity Using Transformers”, Adeli et al 2023
“Copy Is All You Need”, Lan et al 2023
“Whisper-AT: Noise-Robust Automatic Speech Recognizers Are Also Strong General Audio Event Taggers”, Gong et al 2023
“Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers”
“HEADLINES: A Massive Scale Semantic Similarity Dataset of Historical English”, Silcock & Dell 2023
“HEADLINES: A Massive Scale Semantic Similarity Dataset of Historical English”
“OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents”, Laurençon et al 2023
“OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents”
“Expanding the Methodological Toolbox: Machine-based Item Desirability Ratings As an Alternative to Human-based Ratings”, Hommel 2023
“RGD: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization”, Kumar et al 2023
“RGD: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization”
“SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling With Backtracking”, Cundy & Ermon 2023
“SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking”
“Binary and Ternary Natural Language Generation”, Liu et al 2023
“The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora With Web Data, and Web Data Only”, Penedo et al 2023
“AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”, Lin et al 2023
“AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”
“Deep Learning Based Forecasting: a Case Study from the Online Fashion Industry”, Kunz et al 2023
“Deep Learning based Forecasting: a case study from the online fashion industry”
“Scaling Laws for Language Encoding Models in FMRI”, Antonello et al 2023
“DarkBERT: A Language Model for the Dark Side of the Internet”, Jin et al 2023
“DarkBERT: A Language Model for the Dark Side of the Internet”
“Mitigating Lies in Vision-Language Models”, Li et al 2023
“VendorLink: An NLP Approach for Identifying & Linking Vendor Migrants & Potential Aliases on Darknet Markets”, Saxena et al 2023
“Visual Instruction Tuning”, Liu et al 2023
“ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification”, Taesiri et al 2023
“Segment Anything”, Kirillov et al 2023
“When and How Artificial Intelligence Augments Employee Creativity”, Jia et al 2023
“When and How Artificial Intelligence Augments Employee Creativity”
“Mitigating YouTube Recommendation Polarity Using BERT and K-Means Clustering”, Ahmad et al 2023
“Mitigating YouTube Recommendation Polarity using BERT and K-Means Clustering”
“The Man of Your Dreams For $300, Replika Sells an AI Companion Who Will Never Die, Argue, or Cheat—until His Algorithm Is Updated”, Singh-Kurtz 2023
“Tag2Text: Guiding Vision-Language Model via Image Tagging”, Huang et al 2023
“Towards Democratizing Joint-Embedding Self-Supervised Learning”, Bordes et al 2023
“Towards Democratizing Joint-Embedding Self-Supervised Learning”
“MUX-PLMs: Pre-training Language Models With Data Multiplexing”, Murahari et al 2023
“MUX-PLMs: Pre-training Language Models with Data Multiplexing”
“Optical Transformers”, Anderson et al 2023
“Scaling Vision Transformers to 22 Billion Parameters”, Dehghani et al 2023
“BMT: Binarized Neural Machine Translation”, Zhang et al 2023
“V1T: Large-scale Mouse V1 Response Prediction Using a Vision Transformer”, Li et al 2023
“V1T: large-scale mouse V1 response prediction using a Vision Transformer”
“The BabyLM Challenge: Sample-efficient Pretraining on a Developmentally Plausible Corpus”, Warstadt et al 2023
“The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus”
“XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models”, Liang et al 2023
“XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models”
“ClimaX: A Foundation Model for Weather and Climate”, Nguyen et al 2023
“DataMUX: Data Multiplexing for Neural Networks”, Murahari et al 2023
“Tracr: Compiled Transformers As a Laboratory for Interpretability”, Lindner et al 2023
“Tracr: Compiled Transformers as a Laboratory for Interpretability”
“Vision Transformers Are Good Mask Auto-Labelers”, Lan et al 2023
“Scaling Laws for Generative Mixed-Modal Language Models”, Aghajanyan et al 2023
“Why Do Nearest Neighbor Language Models Work?”, Xu et al 2023
“Less Is More: Parameter-Free Text Classification With Gzip”, Jiang et al 2022
“Less is More: Parameter-Free Text Classification with Gzip”
“POM: A Principal Odor Map Unifies Diverse Tasks in Human Olfactory Perception”, Lee et al 2022
“POM: A Principal Odor Map Unifies Diverse Tasks in Human Olfactory Perception”
“What Do Vision Transformers Learn? A Visual Exploration”, Ghiasi et al 2022
“MAGVIT: Masked Generative Video Transformer”, Yu et al 2022
“VindLU: A Recipe for Effective Video-and-Language Pretraining”, Cheng et al 2022
“VindLU: A Recipe for Effective Video-and-Language Pretraining”
“Discovering Latent Knowledge in Language Models Without Supervision”, Burns et al 2022
“Discovering Latent Knowledge in Language Models Without Supervision”
“Text Embeddings by Weakly-Supervised Contrastive Pre-training”, Wang et al 2022
“Text Embeddings by Weakly-Supervised Contrastive Pre-training”
“Robust Speech Recognition via Large-Scale Weak Supervision”, Radford et al 2022
“Robust Speech Recognition via Large-Scale Weak Supervision”
“RGB No More: Minimally-decoded JPEG Vision Transformers”, Park & Johnson 2022
“BARTSmiles: Generative Masked Language Models for Molecular Representations”, Chilingaryan et al 2022
“BARTSmiles: Generative Masked Language Models for Molecular Representations”
“What Learning Algorithm Is In-context Learning? Investigations With Linear Models”, Akyürek et al 2022
“What learning algorithm is in-context learning? Investigations with linear models”
“A Deep Learning and Digital Archaeology Approach for Mosquito Repellent Discovery”, Wei et al 2022
“A deep learning and digital archaeology approach for mosquito repellent discovery”
“GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation”, Guo et al 2022
“Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, Belyaeva et al 2022
“Distilled DeepConsensus: Knowledge distillation for fast and accurate DNA sequence correction”
“Uni-Perceiver V2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks”, Li et al 2022
“Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks”
“UniSumm: Unified Few-shot Summarization With Multi-Task Pre-Training and Prefix-Tuning”, Chen et al 2022
“UniSumm: Unified Few-shot Summarization with Multi-Task Pre-Training and Prefix-Tuning”
“OneFormer: One Transformer to Rule Universal Image Segmentation”, Jain et al 2022
“OneFormer: One Transformer to Rule Universal Image Segmentation”
“Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities”, Tjandra et al 2022
“Characterizing Intrinsic Compositionality in Transformers With Tree Projections”, Murty et al 2022
“Characterizing Intrinsic Compositionality in Transformers with Tree Projections”
“Fast DistilBERT on CPUs”, Shen et al 2022
“n-gram Is Back: Residual Learning of Neural Text Generation With n-gram Language Model”, Li et al 2022
“n-gram Is Back: Residual Learning of Neural Text Generation with n-gram Language Model”
“Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models”, Liu et al 2022
“Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models”
“Noise-Robust De-Duplication at Scale”, Silcock et al 2022
“Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”, Jawahar et al 2022
“Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”
“Improving Sample Quality of Diffusion Models Using Self-Attention Guidance”, Hong et al 2022
“Improving Sample Quality of Diffusion Models Using Self-Attention Guidance”
“Semantic Scene Descriptions As an Objective of Human Vision”, Doerig et al 2022
“Semantic scene descriptions as an objective of human vision”
“A Generalist Neural Algorithmic Learner”, Ibarz et al 2022
“SetFit: Efficient Few-Shot Learning Without Prompts”, Tunstall et al 2022
“Machine Reading, Fast and Slow: When Do Models "Understand" Language?”, Choudhury et al 2022
“Machine Reading, Fast and Slow: When Do Models "Understand" Language?”
“On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, Rohanian et al 2022
“On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”
“ASR2K: Speech Recognition for Around 2000 Languages without Audio”, Li et al 2022
“ASR2K: Speech Recognition for Around 2000 Languages without Audio”
“Analyzing Transformers in Embedding Space”, Dar et al 2022
“MeloForm: Generating Melody With Musical Form Based on Expert Systems and Neural Networks”, Lu et al 2022
“MeloForm: Generating Melody with Musical Form based on Expert Systems and Neural Networks”
“CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks”, Chen et al 2022
“CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks”
“PatchDropout: Economizing Vision Transformers Using Patch Dropout”, Liu et al 2022
“PatchDropout: Economizing Vision Transformers Using Patch Dropout”
“Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data?”, Grinsztajn et al 2022
“Why do tree-based models still outperform deep learning on tabular data?”
“Re2G: Retrieve, Rerank, Generate”, Glass et al 2022
“Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling”, Nguyen & Grover 2022
“Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling”
“Neural Networks and the Chomsky Hierarchy”, Delétang et al 2022
“TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data”, Hollmann et al 2022
“TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data”
“Do Loyal Users Enjoy Better Recommendations? Understanding Recommender Accuracy from a Time Perspective”, Ji et al 2022
“Transfer Learning With Deep Tabular Models”, Levin et al 2022
“BertNet: Harvesting Knowledge Graphs from Pretrained Language Models”, Hao et al 2022
“BertNet: Harvesting Knowledge Graphs from Pretrained Language Models”
“ProGen2: Exploring the Boundaries of Protein Language Models”, Nijkamp et al 2022
“ProGen2: Exploring the Boundaries of Protein Language Models”
“LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Li et al 2022
“LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling”
“RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt”, Mindermann et al 2022
“RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt”
“SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, Opitz & Frank 2022
“Language Models Are General-Purpose Interfaces”, Hao et al 2022
“Reconstructing the Cascade of Language Processing in the Brain Using the Internal Computations of a Transformer-based Language Model”, Kumar et al 2022
“Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Zhu et al 2022
“Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs”
“A Neural Corpus Indexer for Document Retrieval”, Wang et al 2022
“XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient”, Wu et al 2022
“XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient”
“Toward a Realistic Model of Speech Processing in the Brain With Self-supervised Learning”, Millet et al 2022
“Toward a realistic model of speech processing in the brain with self-supervised learning”
“Text2Human: Text-Driven Controllable Human Image Generation”, Jiang et al 2022
“Text2Human: Text-Driven Controllable Human Image Generation”
“Anime Character Recognition Using Intermediate Features Aggregation”, Rios et al 2022
“Anime Character Recognition using Intermediate Features Aggregation”
“Towards Learning Universal Hyperparameter Optimizers With Transformers”, Chen et al 2022
“Towards Learning Universal Hyperparameter Optimizers with Transformers”
“On the Paradox of Learning to Reason from Data”, Zhang et al 2022
“HTPS: HyperTree Proof Search for Neural Theorem Proving”, Lample et al 2022
“Housekeep: Tidying Virtual Households Using Commonsense Reasoning”, Kant et al 2022
“Housekeep: Tidying Virtual Households using Commonsense Reasoning”
“Tradformer: A Transformer Model of Traditional Music Transcriptions”, Casini & Sturm 2022
“Tradformer: A Transformer Model of Traditional Music Transcriptions”
“UViM: A Unified Modeling Approach for Vision With Learned Guiding Codes”, Kolesnikov et al 2022
“UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes”
“PLAID: An Efficient Engine for Late Interaction Retrieval”, Santhanam et al 2022
“Continual Pre-Training Mitigates Forgetting in Language and Vision”, Cossu et al 2022
“Continual Pre-Training Mitigates Forgetting in Language and Vision”
“Few-Shot Parameter-Efficient Fine-Tuning Is Better and Cheaper Than In-Context Learning”, Liu et al 2022
“Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning”
“SymphonyNet: Symphony Generation With Permutation Invariant Language Model”, Liu et al 2022
“SymphonyNet: Symphony Generation with Permutation Invariant Language Model”
“When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet”, Vasudevan et al 2022
“When does dough become a bagel? Analyzing the remaining mistakes on ImageNet”
“A Challenging Benchmark of Anime Style Recognition”, Li et al 2022
“Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers”, Chan et al 2022
“Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers”
“Masked Siamese Networks for Label-Efficient Learning”, Assran et al 2022
“DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning”, Wang et al 2022
“DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning”
“Language Models That Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion”, Shuster et al 2022
“On Embeddings for Numerical Features in Tabular Deep Learning”, Gorishniy et al 2022
“On Embeddings for Numerical Features in Tabular Deep Learning”
“In-context Learning and Induction Heads”, Olsson et al 2022
“LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models”, Javaheripi et al 2022
“LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models”
“Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words”, Feng et al 2022
“Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words”
“TACTiS: Transformer-Attentional Copulas for Time Series”, Drouin et al 2022
“OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework”, Wang et al 2022
“AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, Xu et al 2022
“AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”
“FIGARO: Generating Symbolic Music With Fine-Grained Artistic Control”, Rütte et al 2022
“FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control”
“Robust Contrastive Learning against Noisy Views”, Chuang et al 2022
“HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning”, Zhmoginov et al 2022
“HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning”
“A Mathematical Framework for Transformer Circuits”, Elhage et al 2021
“XGLM: Few-shot Learning With Multilingual Language Models”, Lin et al 2021
“PFNs: Transformers Can Do Bayesian Inference”, Müller et al 2021
“AI Improvements in Chemical Calculations”, Lowe 2021
“An Empirical Investigation of the Role of Pre-training in Lifelong Learning”, Mehta et al 2021
“An Empirical Investigation of the Role of Pre-training in Lifelong Learning”
“You Only Need One Model for Open-domain Question Answering”, Lee et al 2021
“You Only Need One Model for Open-domain Question Answering”
“Human Parity on CommonsenseQA: Augmenting Self-Attention With External Attention”, Xu et al 2021
“Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention”
“Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks”, Zhu et al 2021
“Inducing Causal Structure for Interpretable Neural Networks (IIT)”, Geiger et al 2021
“Inducing Causal Structure for Interpretable Neural Networks (IIT)”
“OCR-free Document Understanding Transformer”, Kim et al 2021
“FQ-ViT: Fully Quantized Vision Transformer without Retraining”, Lin et al 2021
“FQ-ViT: Fully Quantized Vision Transformer without Retraining”
“Semi-Supervised Music Tagging Transformer”, Won et al 2021
“LEMON: Scaling Up Vision-Language Pre-training for Image Captioning”, Hu et al 2021
“LEMON: Scaling Up Vision-Language Pre-training for Image Captioning”
“UNICORN: Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling”, Yang et al 2021
“UNICORN: Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling”
“Compositional Transformers for Scene Generation”, Hudson & Zitnick 2021
“A Survey of Visual Transformers”, Liu et al 2021
“Improving Visual Quality of Image Synthesis by A Token-based Generator With Transformers”, Zeng et al 2021
“Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers”
“STransGAN: An Empirical Study on Transformer in GANs”, Xu et al 2021
“The Efficiency Misnomer”, Dehghani et al 2021
“Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, Jin et al 2021
“Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”
“The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail”, Bowman 2021
“The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail”
“Palette: Image-to-Image Diffusion Models”, Saharia et al 2021
“Autoregressive Latent Video Prediction With High-Fidelity Image Generator”, Seo et al 2021
“Autoregressive Latent Video Prediction with High-Fidelity Image Generator”
“Transformers Are Meta-Reinforcement Learners”, Anonymous 2021
“Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query”, Ngo et al 2021
“Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query”
“BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition”, Zhang et al 2021
“Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, Bondarenko et al 2021
“Understanding and Overcoming the Challenges of Efficient Transformer Quantization”
“TrOCR: Transformer-based Optical Character Recognition With Pre-trained Models”, Li et al 2021
“TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models”
“KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Tahaei et al 2021
“Block Pruning For Faster Transformers”, Lagunas et al 2021
“The Sensory Neuron As a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning”, Tang & Ha 2021
“DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction”, Baid et al 2021
“DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction”
“ImageBART: Bidirectional Context With Multinomial Diffusion for Autoregressive Image Synthesis”, Esser et al 2021
“ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis”
“Modeling Protein Using Large-scale Pretrain Language Model”, Xiao et al 2021
“Modeling Protein Using Large-scale Pretrain Language Model”
“Billion-Scale Pretraining With Vision Transformers for Multi-Task Visual Representations”, Beal et al 2021
“Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations”
“EVA: An Open-Domain Chinese Dialogue System With Large-Scale Generative Pre-Training”, Zhou et al 2021
“EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training”
“Internet-Augmented Dialogue Generation”, Komeili et al 2021
“ViTGAN: Training GANs With Vision Transformers”, Lee et al 2021
“ARM-Net: Adaptive Relation Modeling Network for Structured Data”, Cai et al 2021
“ARM-Net: Adaptive Relation Modeling Network for Structured Data”
“SCARF: Self-Supervised Contrastive Learning Using Random Feature Corruption”, Bahri et al 2021
“SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption”
“Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”, Tay et al 2021
“Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”
“BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models”, Zaken et al 2021
“BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models”
“Revisiting the Calibration of Modern Neural Networks”, Minderer et al 2021
“Scaling Laws for Acoustic Models”, Droppo & Elibol 2021
“CoAtNet: Marrying Convolution and Attention for All Data Sizes”, Dai et al 2021
“CoAtNet: Marrying Convolution and Attention for All Data Sizes”
“Chasing Sparsity in Vision Transformers: An End-to-End Exploration”, Chen et al 2021
“Chasing Sparsity in Vision Transformers: An End-to-End Exploration”
“Tabular Data: Deep Learning Is Not All You Need”, Shwartz-Ziv & Armon 2021
“Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning”, Kossen et al 2021
“Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning”
“SegFormer: Simple and Efficient Design for Semantic Segmentation With Transformers”, Xie et al 2021
“SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers”
“Exploring Transfer Learning Techniques for Named Entity Recognition in Noisy User-Generated Text”, Bogensperger 2021
“Exploring Transfer Learning techniques for Named Entity Recognition in Noisy User-Generated Text”
“Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks”, Bian et al 2021
“Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks”
“One4all User Representation for Recommender Systems in E-commerce”, Shin et al 2021
“One4all User Representation for Recommender Systems in E-commerce”
“MathBERT: A Pre-Trained Model for Mathematical Formula Understanding”, Peng et al 2021
“MathBERT: A Pre-Trained Model for Mathematical Formula Understanding”
“MDETR—Modulated Detection for End-to-End Multi-Modal Understanding”, Kamath et al 2021
“MDETR—Modulated Detection for End-to-End Multi-Modal Understanding”
“XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond”, Barbieri et al 2021
“XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond”
“SimCSE: Simple Contrastive Learning of Sentence Embeddings”, Gao et al 2021
“SimCSE: Simple Contrastive Learning of Sentence Embeddings”
“Memorization versus Generalisation in Pre-trained Language Models”, Tänzer et al 2021
“Memorization versus Generalisation in Pre-trained Language Models”
“Robust Open-Vocabulary Translation from Visual Text Representations”, Salesky et al 2021
“Robust Open-Vocabulary Translation from Visual Text Representations”
“Gradient-based Adversarial Attacks against Text Transformers”, Guo et al 2021
“Gradient-based Adversarial Attacks against Text Transformers”
“Retrieval Augmentation Reduces Hallucination in Conversation”, Shuster et al 2021
“Retrieval Augmentation Reduces Hallucination in Conversation”
“Machine Translation Decoding beyond Beam Search”, Leblond et al 2021
“ChinAI #137: Year 3 of ChinAI: Reflections on the Newsworthiness of Machine Translation”, Ding 2021
“ChinAI #137: Year 3 of ChinAI: Reflections on the newsworthiness of machine translation”
“SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network”, Chan et al 2021
“SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network”
“GPV-1: Towards General Purpose Vision Systems”, Gupta et al 2021
“DeepViT: Towards Deeper Vision Transformer”, Zhou et al 2021
“ConViT: Improving Vision Transformers With Soft Convolutional Inductive Biases”, d’Ascoli et al 2021
“ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases”
“Get Your Vitamin C! Robust Fact Verification With Contrastive Evidence (VitaminC)”, Schuster et al 2021
“Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence (VitaminC)”
“Are NLP Models Really Able to Solve Simple Math Word Problems?”, Patel et al 2021
“Are NLP Models really able to Solve Simple Math Word Problems?”
“Learning from Videos to Understand the World”, Zweig et al 2021
“CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation”, Clark et al 2021
“CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation”
“TransGAN: Two Transformers Can Make One Strong GAN”, Jiang et al 2021
“ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”, Kim et al 2021
“ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”
“Baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling”, Alcorn & Nguyen 2021
“baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling”
“Video Transformer Network”, Neimark et al 2021
“BENDR: Using Transformers and a Contrastive Self-supervised Learning Task to Learn from Massive Amounts of EEG Data”, Kostas et al 2021
“Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, Yuan et al 2021
“Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”
“Bottleneck Transformers for Visual Recognition”, Srinivas et al 2021
“DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition”, Rios et al 2021
“UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling With Transformers”, Hu et al 2021
“UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers”
“MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, Xu et al 2021
“MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”
“XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation”, Zhang et al 2021
“XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation”
“Transformer Feed-Forward Layers Are Key-Value Memories”, Geva et al 2020
“Training Data-efficient Image Transformers & Distillation through Attention”, Touvron et al 2020
“Training data-efficient image transformers & distillation through attention”
“VQ-GAN: Taming Transformers for High-Resolution Image Synthesis”, Esser et al 2020
“VQ-GAN: Taming Transformers for High-Resolution Image Synthesis”
“Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences”, Rives et al 2020
“Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, Ding et al 2020
“TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game”, Han et al 2020
“A Recurrent Vision-and-Language BERT for Navigation”, Hong et al 2020
“A Primer in BERTology: What We Know about How BERT Works”, Rogers et al 2020
“CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters”, Boukkouri et al 2020
“TernaryBERT: Distillation-aware Ultra-low Bit BERT”, Zhang et al 2020
“Weird AI Yankovic: Generating Parody Lyrics”, Riedl 2020
“It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners”, Schick & Schütze 2020
“It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners”
“DeepSpeed: Extreme-scale Model Training for Everyone”, Team et al 2020
“Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing”, Gu et al 2020
“Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing”
“Hopfield Networks Is All You Need”, Ramsauer et al 2020
“Modern Hopfield Networks and Attention for Immune Repertoire Classification”, Widrich et al 2020
“Modern Hopfield Networks and Attention for Immune Repertoire Classification”
“DeepSinger: Singing Voice Synthesis With Data Mined From the Web”, Ren et al 2020
“DeepSinger: Singing Voice Synthesis with Data Mined From the Web”
“Data Movement Is All You Need: A Case Study on Optimizing Transformers”, Ivanov et al 2020
“Data Movement Is All You Need: A Case Study on Optimizing Transformers”
“Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”, Baevski et al 2020
“wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”
“Learning to Learn With Feedback and Local Plasticity”, Lindsey & Litwin-Kumar 2020
“PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training”, Narayanan et al 2020
“PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training”
“Improving GAN Training With Probability Ratio Clipping and Sample Reweighting”, Wu et al 2020
“Improving GAN Training with Probability Ratio Clipping and Sample Reweighting”
“DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations”, Giorgi et al 2020
“DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations”
“DeBERTa: Decoding-enhanced BERT With Disentangled Attention”, He et al 2020
“DeBERTa: Decoding-enhanced BERT with Disentangled Attention”
“DETR: End-to-End Object Detection With Transformers”, Carion et al 2020
“TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data”, Yin et al 2020
“TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data”
“ForecastQA: A Question Answering Challenge for Event Forecasting With Temporal Text Data”, Jin et al 2020
“ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data”
“VLN-BERT: Improving Vision-and-Language Navigation With Image-Text Pairs from the Web”, Majumdar et al 2020
“VLN-BERT: Improving Vision-and-Language Navigation with Image-Text Pairs from the Web”
“General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, Du et al 2020
“General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”
“Blender: A State-of-the-art Open Source Chatbot”, Roller et al 2020
“Recipes for Building an Open-domain Chatbot”, Roller et al 2020
“Rapformer: Conditional Rap Lyrics Generation With Denoising Autoencoders”, Nikolov et al 2020
“Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders”
“On the Effect of Dropping Layers of Pre-trained Transformer Models”, Sajjad et al 2020
“On the Effect of Dropping Layers of Pre-trained Transformer Models”
“TAPAS: Weakly Supervised Table Parsing via Pre-training”, Herzig et al 2020
“A Hundred Visions and Revisions”, Binder 2020
“Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited”, Maddox et al 2020
“Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited”
“AraBERT: Transformer-based Model for Arabic Language Understanding”, Antoun et al 2020
“AraBERT: Transformer-based Model for Arabic Language Understanding”
“MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wang et al 2020
“MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”
“GNS: Learning to Simulate Complex Physics With Graph Networks”, Sanchez-Gonzalez et al 2020
“GNS: Learning to Simulate Complex Physics with Graph Networks”
“Bayesian Deep Learning and a Probabilistic Perspective of Generalization”, Wilson & Izmailov 2020
“Bayesian Deep Learning and a Probabilistic Perspective of Generalization”
“Do We Need Zero Training Loss After Achieving Zero Training Error?”, Ishida et al 2020
“Do We Need Zero Training Loss After Achieving Zero Training Error?”
“Transformers As Soft Reasoners over Language”, Clark et al 2020
“Towards a Conversational Agent That Can Chat About…Anything”, Adiwardana & Luong 2020
“Towards a Conversational Agent that Can Chat About…Anything”
“Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference”, Schick & Schütze 2020
“Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference”
“VIME: Extending the Success of Self-supervised and Semi-supervised Learning to Tabular Domain”, Yoon et al 2020
“VIME: Extending the Success of Self-supervised and Semi-supervised Learning to Tabular Domain”
“Mastering Complex Control in MOBA Games With Deep Reinforcement Learning”, Ye et al 2019
“Mastering Complex Control in MOBA Games with Deep Reinforcement Learning”
“Measuring Compositional Generalization: A Comprehensive Method on Realistic Data”, Keysers et al 2019
“Measuring Compositional Generalization: A Comprehensive Method on Realistic Data”
“PEGASUS: Pre-training With Extracted Gap-sentences for Abstractive Summarization”, Zhang et al 2019
“PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization”
“Encoding Musical Style With Transformer Autoencoders”, Choi et al 2019
“Deep Double Descent: We Show That the Double Descent Phenomenon Occurs in CNNs, ResNets, and Transformers: Performance First Improves, Then Gets Worse, and Then Improves Again With Increasing Model Size, Data Size, or Training Time. This Effect Is Often Avoided through Careful Regularization. While This Behavior Appears to Be Fairly Universal, We Don’t yet Fully Understand Why It Happens, and View Further Study of This Phenomenon As an Important Research Direction.”, Nakkiran et al 2019
“Detecting GAN Generated Errors”, Zhu et al 2019
“SimpleBooks: Long-term Dependency Book Dataset With Simplified English Vocabulary for Word-level Language Modeling”, Nguyen 2019
“Unsupervised Cross-lingual Representation Learning at Scale”, Conneau et al 2019
“Unsupervised Cross-lingual Representation Learning at Scale”
“DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Sanh et al 2019
“DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”
“Multiplicative Interactions and Where to Find Them”, Jayakumar et al 2019
“TinyBERT: Distilling BERT for Natural Language Understanding”, Jiao et al 2019
“TinyBERT: Distilling BERT for Natural Language Understanding”
“Do NLP Models Know Numbers? Probing Numeracy in Embeddings”, Wallace et al 2019
“Do NLP Models Know Numbers? Probing Numeracy in Embeddings”
“PubMedQA: A Dataset for Biomedical Research Question Answering”, Jin et al 2019
“PubMedQA: A Dataset for Biomedical Research Question Answering”
“Frustratingly Easy Natural Question Answering”, Pan et al 2019
“Distributionally Robust Language Modeling”, Oren et al 2019
“The Bottom-up Evolution of Representations in the Transformer: A Study With Machine Translation and Language Modeling Objectives”, Voita et al 2019
“Encode, Tag, Realize: High-Precision Text Editing”, Malmi et al 2019
“Language Models As Knowledge Bases?”, Petroni et al 2019
“Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks”, Reimers & Gurevych 2019
“Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”
“TabNet: Attentive Interpretable Tabular Learning”, Arik & Pfister 2019
“StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding”, Wang et al 2019
“StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding”
“RoBERTa: A Robustly Optimized BERT Pretraining Approach”, Liu et al 2019
“Theoretical Limitations of Self-Attention in Neural Sequence Models”, Hahn 2019
“Theoretical Limitations of Self-Attention in Neural Sequence Models”
“Energy and Policy Considerations for Deep Learning in NLP”, Strubell et al 2019
“Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned”, Voita et al 2019
“HellaSwag: Can a Machine Really Finish Your Sentence?”, Zellers et al 2019
“UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation”, Dong et al 2019
“UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation”
“MASS: Masked Sequence to Sequence Pre-training for Language Generation”, Song et al 2019
“MASS: Masked Sequence to Sequence Pre-training for Language Generation”
“Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Ghazvininejad et al 2019
“Mask-Predict: Parallel Decoding of Conditional Masked Language Models”
“Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes”, You et al 2019
“Large Batch Optimization for Deep Learning: Training BERT in 76 minutes”
“LIGHT: Learning to Speak and Act in a Fantasy Text Adventure Game”, Urbanek et al 2019
“LIGHT: Learning to Speak and Act in a Fantasy Text Adventure Game”
“Insertion Transformer: Flexible Sequence Generation via Insertion Operations”, Stern et al 2019
“Insertion Transformer: Flexible Sequence Generation via Insertion Operations”
“Adapter: Parameter-Efficient Transfer Learning for NLP”, Houlsby et al 2019
“BioBERT: a Pre-trained Biomedical Language Representation Model for Biomedical Text Mining”, Lee et al 2019
“BioBERT: a pre-trained biomedical language representation model for biomedical text mining”
“Bayesian Layers: A Module for Neural Network Uncertainty”, Tran et al 2018
“Blockwise Parallel Decoding for Deep Autoregressive Models”, Stern et al 2018
“Blockwise Parallel Decoding for Deep Autoregressive Models”
“Object Hallucination in Image Captioning”, Rohrbach et al 2018
“Self-Attention Generative Adversarial Networks”, Zhang et al 2018
“Universal Sentence Encoder”, Cer et al 2018
“Self-Attention With Relative Position Representations”, Shaw et al 2018
“Learning Longer-term Dependencies in RNNs With Auxiliary Losses”, Trinh et al 2018
“Learning Longer-term Dependencies in RNNs with Auxiliary Losses”
“Generating Structured Music through Self-Attention”, Huang et al 2018
“A Simple Neural Attentive Meta-Learner”, Mishra et al 2017
“Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, Zagoruyko & Komodakis 2016
“QRNNs: Quasi-Recurrent Neural Networks”, Bradbury et al 2016
“Gaussian Error Linear Units (GELUs)”, Hendrycks & Gimpel 2016
“Pointer Networks”, Vinyals et al 2015
“Huggingface: transformers
Repo”, Huggingface 2023
“Transformers Are a Very Exciting Family of Machine Learning Architectures. Many Good Tutorials Exist (eg. [1, 2]) but in the Last Few Years, Transformers Have Mostly Become Simpler, so That It Is Now Much More Straightforward to Explain How Modern Architectures Work. This Post Is an Attempt to Explain Directly [in PyTorch] How Modern Transformers Work, and Why, without Some of the Historical Baggage.”
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
few-shot-learning
language-models
pretraining
transformer-applications
Wikipedia
Miscellaneous
-
/doc/ai/nn/transformer/2021-hu-figure6-largerlemoncaptionmodelsaremoresampleefficient.png
-
/doc/ai/nn/transformer/2021-hu-figure2-b-datascalingfinetuningperformanceonnocaps.png
-
https://blog.research.google/2023/09/on-device-content-distillation-with.html
-
https://iaml-it.github.io/posts/2021-04-28-transformers-in-vision/
-
https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/
-
https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
-
https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/
-
https://sander.ai/2023/01/09/diffusion-language.html#deepmind
-
https://twitter.com/stephenroller/status/1579993017234382849
-
https://www.quantamagazine.org/how-ai-transformers-mimic-parts-of-the-brain-20220912/
Link Bibliography
-
https://arxiv.org/abs/2311.03079#zhipu
: “CogVLM: Visual Expert for Pretrained Language Models”, -
https://arxiv.org/abs/2310.07096#ibm
: “Sparse Universal Transformer”, Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan -
https://arxiv.org/abs/2310.06694
: “Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning”, Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen -
https://arxiv.org/abs/2310.02207
: “Language Models Represent Space and Time”, Wes Gurnee, Max Tegmark -
https://arxiv.org/abs/2306.09222#google
: “RGD: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization”, Ramnath Kumar, Kushal Majmundar, Dheeraj Nagaraj, Arun Sai Suggala -
https://arxiv.org/abs/2306.05426
: “SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling With Backtracking”, Chris Cundy, Stefano Ermon -
https://arxiv.org/abs/2305.11863
: “Scaling Laws for Language Encoding Models in FMRI”, Richard Antonello, Aditya Vaidya, Alexander G. Huth -
https://arxiv.org/abs/2304.05538
: “ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification”, Mohammad Reza Taesiri, Giang Nguyen, Sarra Habchi, Cor-Paul Bezemer, Anh Nguyen -
2023-jia.pdf
: “When and How Artificial Intelligence Augments Employee Creativity”, Nan Jia, Xueming Luo, Zheng Fang, Chengcheng Liao -
https://arxiv.org/abs/2302.12441
: “MUX-PLMs: Pre-training Language Models With Data Multiplexing”, Vishvak Murahari, Ameet Deshpande, Carlos E. Jimenez, Izhak Shafran, Mingqiu Wang, Yuan Cao, Karthik Narasimhan -
https://arxiv.org/abs/2302.05442#google
: “Scaling Vision Transformers to 22 Billion Parameters”, -
https://arxiv.org/abs/2302.04907#google
: “BMT: Binarized Neural Machine Translation”, Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat -
https://arxiv.org/abs/2301.03992#nvidia
: “Vision Transformers Are Good Mask Auto-Labelers”, Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar -
https://arxiv.org/abs/2301.03728#facebook
: “Scaling Laws for Generative Mixed-Modal Language Models”, -
https://arxiv.org/abs/2212.09410
: “Less Is More: Parameter-Free Text Classification With Gzip”, Zhiying Jiang, Matthew Y. R. Yang, Mikhail Tsirlin, Raphael Tang, Jimmy Lin -
https://arxiv.org/abs/2212.06727
: “What Do Vision Transformers Learn? A Visual Exploration”, -
https://arxiv.org/abs/2212.05199#google
: “MAGVIT: Masked Generative Video Transformer”, -
https://arxiv.org/abs/2212.05051
: “VindLU: A Recipe for Effective Video-and-Language Pretraining”, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius -
https://arxiv.org/abs/2212.03533#microsoft
: “Text Embeddings by Weakly-Supervised Contrastive Pre-training”, Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei -
https://arxiv.org/abs/2211.09808
: “Uni-Perceiver V2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks”, -
https://arxiv.org/abs/2211.06220
: “OneFormer: One Transformer to Rule Universal Image Segmentation”, Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi -
https://arxiv.org/abs/2209.11737
: “Semantic Scene Descriptions As an Objective of Human Vision”, Adrien Doerig, Tim C. Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, Ian Charest -
https://arxiv.org/abs/2209.11055
: “SetFit: Efficient Few-Shot Learning Without Prompts”, Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, Oren Pereg -
https://arxiv.org/abs/2209.02535
: “Analyzing Transformers in Embedding Space”, Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant -
https://arxiv.org/abs/2207.06300#ibm
: “Re2G: Retrieve, Rerank, Generate”, Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, Alfio Gliozzo -
https://arxiv.org/abs/2207.01848
: “TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data”, Noah Hollmann, Samuel Müller, Katharina Eggensperger, Frank Hutter -
https://arxiv.org/abs/2204.05927
: “Do Loyal Users Enjoy Better Recommendations? Understanding Recommender Accuracy from a Time Perspective”, Yitong Ji, Aixin Sun, Jie Zhang, Chenliang Li -
https://arxiv.org/abs/2206.07160#microsoft
: “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang -
https://arxiv.org/abs/2206.07137
: “RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt”, -
https://www.biorxiv.org/content/10.1101/2022.06.08.495348.full
: “Reconstructing the Cascade of Language Processing in the Brain Using the Internal Computations of a Transformer-based Language Model”, -
https://arxiv.org/abs/2206.01859#microsoft
: “XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient”, Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He -
https://arxiv.org/abs/2206.01685
: “Toward a Realistic Model of Speech Processing in the Brain With Self-supervised Learning”, -
2022-rios.pdf
: “Anime Character Recognition Using Intermediate Features Aggregation”, Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai -
https://arxiv.org/abs/2205.13320#google
: “Towards Learning Universal Hyperparameter Optimizers With Transformers”, -
https://arxiv.org/abs/2205.11491#facebook
: “HTPS: HyperTree Proof Search for Neural Theorem Proving”, -
https://arxiv.org/abs/2205.04596#google
: “When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet”, Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes, Sara Fridovich-Keil, Rebecca Roelofs -
https://arxiv.org/abs/2203.13224#facebook
: “Language Models That Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion”, Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, Jason Weston -
https://arxiv.org/abs/2203.02094#microsoft
: “LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models”, -
https://arxiv.org/abs/2202.03052#alibaba
: “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework”, -
https://arxiv.org/abs/2112.10510
: “PFNs: Transformers Can Do Bayesian Inference”, Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, Frank Hutter -
https://arxiv.org/abs/2112.07381#samsung
: “You Only Need One Model for Open-domain Question Answering”, Haejun Lee, Akhil Kedia, Jongwon Lee, Ashwin Paranjape, Christopher D. Manning, Kyoung-Gu Woo -
https://arxiv.org/abs/2111.13824
: “FQ-ViT: Fully Quantized Vision Transformer without Retraining”, Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, Shuchang Zhou -
https://arxiv.org/abs/2111.12233#microsoft
: “LEMON: Scaling Up Vision-Language Pre-training for Image Captioning”, Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang -
https://arxiv.org/abs/2111.06091
: “A Survey of Visual Transformers”, -
https://arxiv.org/abs/2109.12948
: “Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort -
https://arxiv.org/abs/2109.10282#microsoft
: “TrOCR: Transformer-based Optical Character Recognition With Pre-trained Models”, Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei -
https://arxiv.org/abs/2109.06243#huawei
: “KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Marzieh S. Tahaei, Ella Charlaix, Vahid Partovi Nia, Ali Ghodsi, Mehdi Rezagholizadeh -
https://arxiv.org/abs/2107.07566#facebook
: “Internet-Augmented Dialogue Generation”, Mojtaba Komeili, Kurt Shuster, Jason Weston -
https://arxiv.org/abs/2107.04589
: “ViTGAN: Training GANs With Vision Transformers”, Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, Ce Liu -
https://arxiv.org/abs/2106.12672#google
: “Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”, -
https://arxiv.org/abs/2106.10199
: “BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models”, Elad Ben Zaken, Shauli Ravfogel, Yoav Goldberg -
https://arxiv.org/abs/2106.09488#amazon
: “Scaling Laws for Acoustic Models”, Jasha Droppo, Oguz Elibol -
https://arxiv.org/abs/2106.04803#google
: “CoAtNet: Marrying Convolution and Attention for All Data Sizes”, Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan -
https://arxiv.org/abs/2106.04533
: “Chasing Sparsity in Vision Transformers: An End-to-End Exploration”, Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang -
https://arxiv.org/abs/2105.15203
: “SegFormer: Simple and Efficient Design for Semantic Segmentation With Transformers”, Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo -
https://arxiv.org/abs/2104.07567#facebook
: “Retrieval Augmentation Reduces Hallucination in Conversation”, Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, Jason Weston -
https://chinai.substack.com/p/chinai-137-year-3-of-chinai
: “ChinAI #137: Year 3 of ChinAI: Reflections on the Newsworthiness of Machine Translation”, Jeffrey Ding -
https://arxiv.org/abs/2103.11886#bytedance
: “DeepViT: Towards Deeper Vision Transformer”, Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, Jiashi Feng -
https://arxiv.org/abs/2103.10697#facebook
: “ConViT: Improving Vision Transformers With Soft Convolutional Inductive Biases”, Stéphane d’Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, Levent Sagun -
https://ai.facebook.com/blog/learning-from-videos-to-understand-the-world/
: “Learning from Videos to Understand the World”, Geoffrey Zweig, Polina Kuznetsova, Michael Auli, Francois Fagan -
https://arxiv.org/abs/2102.07074
: “TransGAN: Two Transformers Can Make One Strong GAN”, Yifan Jiang, Shiyu Chang, Zhangyang Wang -
https://arxiv.org/abs/2102.03334
: “ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”, Wonjae Kim, Bokyung Son, Ildoo Kim -
https://arxiv.org/abs/2102.00719
: “Video Transformer Network”, Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann -
https://arxiv.org/abs/2101.11986
: “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis E. H. Tay, Jiashi Feng, Shuicheng Yan -
https://arxiv.org/abs/2101.11605#google
: “Bottleneck Transformers for Visual Recognition”, Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani -
https://arxiv.org/abs/2101.08674
: “DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition”, Edwin Arkel Rios, Wen-Huang Cheng, Bo-Cheng Lai -
https://arxiv.org/abs/2101.04702#google
: “XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation”, Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang -
https://arxiv.org/abs/2012.12877#facebook
: “Training Data-efficient Image Transformers & Distillation through Attention”, Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou -
https://arxiv.org/abs/2012.08508#deepmind
: “Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, David Ding, Felix Hill, Adam Santoro, Matt Botvinick -
https://arxiv.org/abs/2011.13729#tencent
: “TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game”, -
https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
: “DeepSpeed: Extreme-scale Model Training for Everyone”, DeepSpeed Team, Rangan Majumder, Junhua Wang -
https://arxiv.org/abs/2008.02217
: “Hopfield Networks Is All You Need”, -
https://arxiv.org/abs/2006.03654#microsoft
: “DeBERTa: Decoding-enhanced BERT With Disentangled Attention”, Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen -
https://arxiv.org/abs/2005.12872#facebook
: “DETR: End-to-End Object Detection With Transformers”, Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko -
https://ai.meta.com/blog/state-of-the-art-open-source-chatbot/
: “Blender: A State-of-the-art Open Source Chatbot”, Stephen Roller, Jason Weston, Emily Dinan -
https://arxiv.org/abs/2004.03965
: “Rapformer: Conditional Rap Lyrics Generation With Denoising Autoencoders”, Nikola I. Nikolov, Eric Malmi, Curtis G. Northcutt, Loreto Parisi -
https://arxiv.org/abs/2004.03844
: “On the Effect of Dropping Layers of Pre-trained Transformer Models”, Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Preslav Nakov -
https://arxiv.org/abs/2002.10957#microsoft
: “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou -
https://blog.research.google/2020/01/towards-conversational-agent-that-can.html
: “Towards a Conversational Agent That Can Chat About…Anything”, Daniel Adiwardana, Thang Luong -
https://openai.com/research/deep-double-descent
: “Deep Double Descent: We Show That the Double Descent Phenomenon Occurs in CNNs, ResNets, and Transformers: Performance First Improves, Then Gets Worse, and Then Improves Again With Increasing Model Size, Data Size, or Training Time. This Effect Is Often Avoided through Careful Regularization. While This Behavior Appears to Be Fairly Universal, We Don’t yet Fully Understand Why It Happens, and View Further Study of This Phenomenon As an Important Research Direction.”, Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever -
https://arxiv.org/abs/1911.02116#facebook
: “Unsupervised Cross-lingual Representation Learning at Scale”, -
https://arxiv.org/abs/1909.10351
: “TinyBERT: Distilling BERT for Natural Language Understanding”, Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu -
https://arxiv.org/abs/1909.05286#ibm
: “Frustratingly Easy Natural Question Answering”, -
https://arxiv.org/abs/1908.04577#alibaba
: “StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding”, Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, Luo Si -
https://arxiv.org/abs/1907.11692#facebook
: “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, -
https://arxiv.org/abs/1905.03197
: “UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation”, Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon -
https://arxiv.org/abs/1904.00962#google
: “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes”, -
https://arxiv.org/abs/1902.03249#google
: “Insertion Transformer: Flexible Sequence Generation via Insertion Operations”, Mitchell Stern, William Chan, Jamie Kiros, Jakob Uszkoreit -
https://arxiv.org/abs/1901.08746
: “BioBERT: a Pre-trained Biomedical Language Representation Model for Biomedical Text Mining”, Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang -
2018-huang.pdf
: “Generating Structured Music through Self-Attention”, -
https://github.com/huggingface/transformers
: “Huggingface:transformers
Repo”, Huggingface