- See Also
-
Links
- “The Man of Your Dreams For $300, Replika Sells an AI Companion Who Will Never Die, Argue, or Cheat—until His Algorithm Is Updated”, Singh-2023
- “Tag2Text: Guiding Vision-Language Model via Image Tagging”, Et Al 2023
- “Towards Democratizing Joint-Embedding Self-Supervised Learning”, Et Al 2023
- “MUX-PLMs: Pre-training Language Models With Data Multiplexing”, Et Al 2023
- “Scaling Vision Transformers to 22 Billion Parameters”, Et Al 2023
- “BMT: Binarized Neural Machine Translation”, Et Al 2023
- “V1T: Large-scale Mouse V1 Response Prediction Using a Vision Transformer”, Et Al 2023
- “XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models”, Et Al 2023
- “ClimaX: A Foundation Model for Weather and Climate”, Et Al 2023
- “DataMUX: Data Multiplexing for Neural Networks”, Et Al 2023
- “Tracr: Compiled Transformers As a Laboratory for Interpretability”, Et Al 2023
- “Vision Transformers Are Good Mask Auto-Labelers”, Et Al 2023
- “Scaling Laws for Generative Mixed-Modal Language Models”, Et Al 2023
- “Why Do Nearest Neighbor Language Models Work?”, Et Al 2023
- “POM: A Principal Odor Map Unifies Diverse Tasks in Human Olfactory Perception”, Et Al 2022
- “What Do Vision Transformers Learn? A Visual Exploration”, Et Al 2022
- “MAGVIT: Masked Generative Video Transformer”, Et Al 2022
- “VindLU: A Recipe for Effective Video-and-Language Pretraining”, Et Al 2022
- “Discovering Latent Knowledge in Language Models Without Supervision”, Et Al 2022
- “BARTSmiles: Generative Masked Language Models for Molecular Representations”, Et Al 2022
- “What Learning Algorithm Is In-context Learning? Investigations With Linear Models”, Et Al 2022
- “A Deep Learning and Digital Archaeology Approach for Mosquito Repellent Discovery”, Et Al 2022
- “Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, Et Al 2022
- “Uni-Perceiver V2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks”, Et Al 2022
- “UniSumm: Unified Few-shot Summarization With Multi-Task Pre-Training and Prefix-Tuning”, Et Al 2022
- “OneFormer: One Transformer to Rule Universal Image Segmentation”, Et Al 2022
- “Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities”, Et Al 2022
- “Characterizing Intrinsic Compositionality in Transformers With Tree Projections”, Et Al 2022
- “Fast DistilBERT on CPUs”, Et Al 2022
- “N-gram Is Back: Residual Learning of Neural Text Generation With N-gram Language Model”, Et Al 2022
- “Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models”, Et Al 2022
- “Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”, Et Al 2022
- “Transformers Implement First-Order Logic With Majority Quantifiers”, 2022
- “Improving Sample Quality of Diffusion Models Using Self-Attention Guidance”, Et Al 2022
- “Semantic Scene Descriptions As an Objective of Human Vision”, Et Al 2022
- “A Generalist Neural Algorithmic Learner”, Et Al 2022
- “SetFit: Efficient Few-Shot Learning Without Prompts”, Et Al 2022
- “Machine Reading, Fast and Slow: When Do Models”Understand” Language?“, Et Al 2022
- “On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, Et Al 2022
- “ASR2K: Speech Recognition for Around 2000 Languages without Audio”, Et Al 2022
- “Analyzing Transformers in Embedding Space”, Et Al 2022
- “MeloForm: Generating Melody With Musical Form Based on Expert Systems and Neural Networks”, Et Al 2022
- “CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks”, Et Al 2022
- “PatchDropout: Economizing Vision Transformers Using Patch Dropout”, Et Al 2022
- “Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data?”, Et Al 2022
- “Re2G: Retrieve, Rerank, Generate”, Et Al 2022
- “Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling”, 2022
- “Neural Networks and the Chomsky Hierarchy”, Et Al 2022
- “TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data”, Et Al 2022
- “Transfer Learning With Deep Tabular Models”, Et Al 2022
- “BertNet: Harvesting Knowledge Graphs from Pretrained Language Models”, Et Al 2022
- “ProGen2: Exploring the Boundaries of Protein Language Models”, Et Al 2022
- “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Et Al 2022
- “RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt”, Et Al 2022
- “SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, 2022
- “Language Models Are General-Purpose Interfaces”, Et Al 2022
- “Reconstructing the Cascade of Language Processing in the Brain Using the Internal Computations of a Transformer-based Language Model”, Et Al 2022
- “Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Et Al 2022
- “A Neural Corpus Indexer for Document Retrieval”, Et Al 2022
- “XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient”, Et Al 2022
- “Toward a Realistic Model of Speech Processing in the Brain With Self-supervised Learning”, Et Al 2022
- “Text2Human: Text-Driven Controllable Human Image Generation”, Et Al 2022
- “Anime Character Recognition Using Intermediate Features Aggregation”, Et Al 2022
- “On the Paradox of Learning to Reason from Data”, Et Al 2022
- “HTPS: HyperTree Proof Search for Neural Theorem Proving”, Et Al 2022
- “Housekeep: Tidying Virtual Households Using Commonsense Reasoning”, Et Al 2022
- “Tradformer: A Transformer Model of Traditional Music Transcriptions”, 2022
- “Continual Pre-Training Mitigates Forgetting in Language and Vision”, Et Al 2022
- “Few-Shot Parameter-Efficient Fine-Tuning Is Better and Cheaper Than In-Context Learning”, Et Al 2022
- “SymphonyNet: Symphony Generation With Permutation Invariant Language Model”, Et Al 2022
- “When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet”, Et Al 2022
- “A Challenging Benchmark of Anime Style Recognition”, Et Al 2022
- “Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers”, Et Al 2022
- “Masked Siamese Networks for Label-Efficient Learning”, Et Al 2022
- “DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning”, Et Al 2022
- “Language Models That Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion”, Et Al 2022
- “On Embeddings for Numerical Features in Tabular Deep Learning”, Et Al 2022
- “In-context Learning and Induction Heads”, Et Al 2022
- “Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words”, Et Al 2022
- “The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention”, Et Al 2022
- “TACTiS: Transformer-Attentional Copulas for Time Series”, Et Al 2022
- “Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework”, Et Al 2022
- “AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, Et Al 2022
- “FIGARO: Generating Symbolic Music With Fine-Grained Artistic Control”, Et Al 2022
- “Robust Contrastive Learning against Noisy Views”, Et Al 2022
- “HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning”, Et Al 2022
- “A Mathematical Framework for Transformer Circuits”, Et Al 2021
- “XGLM: Few-shot Learning With Multilingual Language Models”, Et Al 2021
- “PFNs: Transformers Can Do Bayesian Inference”, Et Al 2021
- “AI Improvements in Chemical Calculations”, 2021
- “An Empirical Investigation of the Role of Pre-training in Lifelong Learning”, Et Al 2021
- “You Only Need One Model for Open-domain Question Answering”, Et Al 2021
- “Human Parity on CommonsenseQA: Augmenting Self-Attention With External Attention”, Et Al 2021
- “Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks”, Et Al 2021
- “Inducing Causal Structure for Interpretable Neural Networks (IIT)”, Et Al 2021
- “FQ-ViT: Fully Quantized Vision Transformer without Retraining”, Et Al 2021
- “Semi-Supervised Music Tagging Transformer”, Et Al 2021
- “LEMON: Scaling Up Vision-Language Pre-training for Image Captioning”, Et Al 2021
- “UNICORN: Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling”, Et Al 2021
- “Compositional Transformers for Scene Generation”, 2021
- “A Survey of Visual Transformers”, Et Al 2021
- “Improving Visual Quality of Image Synthesis by A Token-based Generator With Transformers”, Et Al 2021
- “STransGAN: An Empirical Study on Transformer in GANs”, Et Al 2021
- “The Efficiency Misnomer”, Et Al 2021
- “Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, Et Al 2021
- “The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail”, 2021
- “Palette: Image-to-Image Diffusion Models”, Et Al 2021
- “Autoregressive Latent Video Prediction With High-Fidelity Image Generator”, Et Al 2021
- “Transformers Are Meta-Reinforcement Learners”, 2021
- “Skill Induction and Planning With Latent Language”, Et Al 2021
- “Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query”, Et Al 2021
- “Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, Et Al 2021
- “KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Et Al 2021
- “Block Pruning For Faster Transformers”, Et Al 2021
- “The Sensory Neuron As a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning”, 2021
- “DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction”, Et Al 2021
- “ImageBART: Bidirectional Context With Multinomial Diffusion for Autoregressive Image Synthesis”, Et Al 2021
- “Modeling Protein Using Large-scale Pretrain Language Model”, Et Al 2021
- “Billion-Scale Pretraining With Vision Transformers for Multi-Task Visual Representations”, Et Al 2021
- “EVA: An Open-Domain Chinese Dialogue System With Large-Scale Generative Pre-Training”, Et Al 2021
- “Internet-Augmented Dialogue Generation”, Et Al 2021
- “ViTGAN: Training GANs With Vision Transformers”, Et Al 2021
- “ARM-Net: Adaptive Relation Modeling Network for Structured Data”, Et Al 2021
- “SCARF: Self-Supervised Contrastive Learning Using Random Feature Corruption”, Et Al 2021
- “Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”, Et Al 2021
- “Revisiting the Calibration of Modern Neural Networks”, Et Al 2021
- “Scaling Laws for Acoustic Models”, 2021
- “CoAtNet: Marrying Convolution and Attention for All Data Sizes”, Et Al 2021
- “Chasing Sparsity in Vision Transformers: An End-to-End Exploration”, Et Al 2021
- “Tabular Data: Deep Learning Is Not All You Need”, Shwartz-2021
- “Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning”, Et Al 2021
- “SegFormer: Simple and Efficient Design for Semantic Segmentation With Transformers”, Et Al 2021
- “Exploring Transfer Learning Techniques for Named Entity Recognition in Noisy User-Generated Text”, 2021
- “Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks”, Et Al 2021
- “MathBERT: A Pre-Trained Model for Mathematical Formula Understanding”, Et Al 2021
- “MDETR—Modulated Detection for End-to-End Multi-Modal Understanding”, Et Al 2021
- “SimCSE: Simple Contrastive Learning of Sentence Embeddings”, Et Al 2021
- “Robust Open-Vocabulary Translation from Visual Text Representations”, Et Al 2021
- “Gradient-based Adversarial Attacks against Text Transformers”, Et Al 2021
- “Retrieval Augmentation Reduces Hallucination in Conversation”, Et Al 2021
- “Machine Translation Decoding beyond Beam Search”, Et Al 2021
- “ChinAI #137: Year 3 of ChinAI: Reflections on the Newsworthiness of Machine Translation”, 2021
- “SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network”, Et Al 2021
- “GPV-1: Towards General Purpose Vision Systems”, Et Al 2021
- “DeepViT: Towards Deeper Vision Transformer”, Et Al 2021
- “ConViT: Improving Vision Transformers With Soft Convolutional Inductive Biases”, D’Et Al 2021
- “Get Your Vitamin C! Robust Fact Verification With Contrastive Evidence (VitaminC)”, Et Al 2021
- “Are NLP Models Really Able to Solve Simple Math Word Problems?”, Et Al 2021
- “Learning from Videos to Understand the World”, Et Al 2021
- “CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation”, Et Al 2021
- “TransGAN: Two Transformers Can Make One Strong GAN”, Et Al 2021
- “ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”, Et Al 2021
- “Baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling”, 2021
- “Video Transformer Network”, Et Al 2021
- “BENDR: Using Transformers and a Contrastive Self-supervised Learning Task to Learn from Massive Amounts of EEG Data”, Et Al 2021
- “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, Et Al 2021
- “Bottleneck Transformers for Visual Recognition”, Et Al 2021
- “DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition”, Et Al 2021
- “UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling With Transformers”, Et Al 2021
- “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, Et Al 2021
- “XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation”, Et Al 2021
- “Transformer Feed-Forward Layers Are Key-Value Memories”, Et Al 2020
- “Training Data-efficient Image Transformers & Distillation through Attention”, Et Al 2020
- “VQ-GAN: Taming Transformers for High-Resolution Image Synthesis”, Et Al 2020
- “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences”, Et Al 2020
- “Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, Et Al 2020
- “TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game”, Et Al 2020
- “A Recurrent Vision-and-Language BERT for Navigation”, Et Al 2020
- “A Primer in BERTology: What We Know about How BERT Works”, Et Al 2020
- “CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters”, Et Al 2020
- “TernaryBERT: Distillation-aware Ultra-low Bit BERT”, Et Al 2020
- “Weird AI Yankovic: Generating Parody Lyrics”, 2020
- “DeepSpeed: Extreme-scale Model Training for Everyone”, Et Al 2020
- “Hopfield Networks Is All You Need”, Et Al 2020
- “Modern Hopfield Networks and Attention for Immune Repertoire Classification”, Et Al 2020
- “DeepSinger: Singing Voice Synthesis With Data Mined From the Web”, Et Al 2020
- “Leveraging Passage Retrieval With Generative Models for Open Domain Question Answering”, 2020
- “Data Movement Is All You Need: A Case Study on Optimizing Transformers”, Et Al 2020
- “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”, Et Al 2020
- “Learning to Learn With Feedback and Local Plasticity”, Lindsey & Litwin-2020
- “PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training”, Et Al 2020
- “Improving GAN Training With Probability Ratio Clipping and Sample Reweighting”, Et Al 2020
- “DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations”, Et Al 2020
- “DeBERTa: Decoding-enhanced BERT With Disentangled Attention”, Et Al 2020
- “DETR: End-to-End Object Detection With Transformers”, Et Al 2020
- “TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data”, Et Al 2020
- “VLN-BERT: Improving Vision-and-Language Navigation With Image-Text Pairs from the Web”, Et Al 2020
- “General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, Et Al 2020
- “Blender: A State-of-the-art Open Source Chatbot”, Et Al 2020
- “Recipes for Building an Open-domain Chatbot”, Et Al 2020
- “Rapformer: Conditional Rap Lyrics Generation With Denoising Autoencoders”, Et Al 2020
- “On the Effect of Dropping Layers of Pre-trained Transformer Models”, Et Al 2020
- “TAPAS: Weakly Supervised Table Parsing via Pre-training”, Et Al 2020
- “A Hundred Visions and Revisions”, 2020
- “Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited”, Et Al 2020
- “AraBERT: Transformer-based Model for Arabic Language Understanding”, Et Al 2020
- “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Et Al 2020
- “Bayesian Deep Learning and a Probabilistic Perspective of Generalization”, 2020
- “Do We Need Zero Training Loss After Achieving Zero Training Error?”, Et Al 2020
- “Transformers As Soft Reasoners over Language”, Et Al 2020
- “Towards a Conversational Agent That Can Chat About…Anything”, 2020
- “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference”, Schick & 2020
- “VIME: Extending the Success of Self-supervised and Semi-supervised Learning to Tabular Domain”, Et Al 2020
- “Mastering Complex Control in MOBA Games With Deep Reinforcement Learning”, Et Al 2019
- “PEGASUS: Pre-training With Extracted Gap-sentences for Abstractive Summarization”, Et Al 2019
- “Encoding Musical Style With Transformer Autoencoders”, Et Al 2019
- “Deep Double Descent: We Show That the Double Descent Phenomenon Occurs in CNNs, ResNets, and Transformers: Performance First Improves, Then Gets Worse, and Then Improves Again With Increasing Model Size, Data Size, or Training Time. This Effect Is Often Avoided through Careful Regularization. While This Behavior Appears to Be Fairly Universal, We Don’t yet Fully Understand Why It Happens, and View Further Study of This Phenomenon As an Important Research Direction.”, Et Al 2019
- “Detecting GAN Generated Errors”, Et Al 2019
- “Unsupervised Cross-lingual Representation Learning at Scale”, Et Al 2019
- “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Et Al 2019
- “Multiplicative Interactions and Where to Find Them”, Et Al 2019
- “TinyBERT: Distilling BERT for Natural Language Understanding”, Et Al 2019
- “Do NLP Models Know Numbers? Probing Numeracy in Embeddings”, Et Al 2019
- “PubMedQA: A Dataset for Biomedical Research Question Answering”, Et Al 2019
- “The Bottom-up Evolution of Representations in the Transformer: A Study With Machine Translation and Language Modeling Objectives”, Et Al 2019
- “Language Models As Knowledge Bases?”, Et Al 2019
- “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks”, 2019
- “TabNet: Attentive Interpretable Tabular Learning”, 2019
- “StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding”, Et Al 2019
- “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, Et Al 2019
- “Theoretical Limitations of Self-Attention in Neural Sequence Models”, 2019
- “Energy and Policy Considerations for Deep Learning in NLP”, Et Al 2019
- “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned”, Et Al 2019
- “HellaSwag: Can a Machine Really Finish Your Sentence?”, Et Al 2019
- “UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation”, Et Al 2019
- “MASS: Masked Sequence to Sequence Pre-training for Language Generation”, Et Al 2019
- “Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Et Al 2019
- “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes”, Et Al 2019
- “Adapter: Parameter-Efficient Transfer Learning for NLP”, Et Al 2019
- “BioBERT: a Pre-trained Biomedical Language Representation Model for Biomedical Text Mining”, Et Al 2019
- “Bayesian Layers: A Module for Neural Network Uncertainty”, Et Al 2018
- “Character-Level Language Modeling With Deeper Self-Attention”, Al-Et Al 2018
- “Self-Attention Generative Adversarial Networks”, Et Al 2018
- “Universal Sentence Encoder”, Et Al 2018
- “Self-Attention With Relative Position Representations”, Et Al 2018
- “Learning Longer-term Dependencies in RNNs With Auxiliary Losses”, Et Al 2018
- “Generating Structured Music through Self-Attention”, Et Al 2018
- “A Simple Neural Attentive Meta-Learner”, Et Al 2017
- “Attention Is All You Need”, Et Al 2017
- “RAM: Dynamic Computational Time for Visual Attention”, Et Al 2017
- “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, 2016
- “QRNNs: Quasi-Recurrent Neural Networks”, Et Al 2016
- “Modeling Human Reading With Neural Attention”, 2016
- “Gaussian Error Linear Units (GELUs)”, 2016
- “Pointer Networks”, Et Al 2015
- “Neural Machine Translation by Jointly Learning to Align and Translate”, Et Al 2014
- “Huggingface: ‘Transformers’ Repo”, 2023
- “Transformers Are a Very Exciting Family of Machine Learning Architectures. Many Good Tutorials Exist (eg. [1, 2]) but in the Last Few Years, Transformers Have Mostly Become Simpler, so That It Is Now Much More Straightforward to Explain How Modern Architectures Work. This Post Is an Attempt to Explain Directly [in PyTorch] How Modern Transformers Work, and Why, without Some of the Historical Baggage.”
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Links
“The Man of Your Dreams For $300, Replika Sells an AI Companion Who Will Never Die, Argue, or Cheat—until His Algorithm Is Updated”, Singh-2023
“The Man of Your Dreams For $300, Replika sells an AI companion who will never die, argue, or cheat—until his algorithm is updated”, 2023-03-10 ( ; backlinks; similar)
“Tag2Text: Guiding Vision-Language Model via Image Tagging”, Et Al 2023
“Tag2Text: Guiding Vision-Language Model via Image Tagging”, 2023-03-10 ( ; similar)
“Towards Democratizing Joint-Embedding Self-Supervised Learning”, Et Al 2023
“Towards Democratizing Joint-Embedding Self-Supervised Learning”, 2023-03-03 (similar)
“MUX-PLMs: Pre-training Language Models With Data Multiplexing”, Et Al 2023
“MUX-PLMs: Pre-training Language Models with Data Multiplexing”, 2023-02-24 ( ; similar; bibliography)
“Scaling Vision Transformers to 22 Billion Parameters”, Et Al 2023
“Scaling Vision Transformers to 22 Billion Parameters”, 2023-02-10 ( ; similar; bibliography)
“BMT: Binarized Neural Machine Translation”, Et Al 2023
“BMT: Binarized Neural Machine Translation”, 2023-02-09 ( ; similar; bibliography)
“V1T: Large-scale Mouse V1 Response Prediction Using a Vision Transformer”, Et Al 2023
“V1T: large-scale mouse V1 response prediction using a Vision Transformer”, 2023-02-06 ( ; similar)
“XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models”, Et Al 2023
“XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models”, 2023-01-25 ( ; similar)
“ClimaX: A Foundation Model for Weather and Climate”, Et Al 2023
“ClimaX: A foundation model for weather and climate”, 2023-01-24 ( ; similar)
“DataMUX: Data Multiplexing for Neural Networks”, Et Al 2023
“DataMUX: Data Multiplexing for Neural Networks”, 2023-01-13 ( ; backlinks; similar)
“Tracr: Compiled Transformers As a Laboratory for Interpretability”, Et Al 2023
“Tracr: Compiled Transformers as a Laboratory for Interpretability”, 2023-01-12 ( ; similar)
“Vision Transformers Are Good Mask Auto-Labelers”, Et Al 2023
“Vision Transformers Are Good Mask Auto-Labelers”, 2023-01-10 (similar; bibliography)
“Scaling Laws for Generative Mixed-Modal Language Models”, Et Al 2023
“Scaling Laws for Generative Mixed-Modal Language Models”, 2023-01-10 ( ; similar; bibliography)
“Why Do Nearest Neighbor Language Models Work?”, Et Al 2023
“Why do Nearest Neighbor Language Models Work?”, 2023-01-07 ( ; similar)
“POM: A Principal Odor Map Unifies Diverse Tasks in Human Olfactory Perception”, Et Al 2022
“POM: A Principal Odor Map Unifies Diverse Tasks in Human Olfactory Perception”, 2022-12-13 ( ; similar)
“What Do Vision Transformers Learn? A Visual Exploration”, Et Al 2022
“What do Vision Transformers Learn? A Visual Exploration”, 2022-12-13 ( ; similar; bibliography)
“MAGVIT: Masked Generative Video Transformer”, Et Al 2022
“MAGVIT: Masked Generative Video Transformer”, 2022-12-10 ( ; similar; bibliography)
“VindLU: A Recipe for Effective Video-and-Language Pretraining”, Et Al 2022
“VindLU: A Recipe for Effective Video-and-Language Pretraining”, 2022-12-09 ( ; similar; bibliography)
“Discovering Latent Knowledge in Language Models Without Supervision”, Et Al 2022
“Discovering Latent Knowledge in Language Models Without Supervision”, 2022-12-07 (similar)
“BARTSmiles: Generative Masked Language Models for Molecular Representations”, Et Al 2022
“BARTSmiles: Generative Masked Language Models for Molecular Representations”, 2022-11-29 ( ; similar)
“What Learning Algorithm Is In-context Learning? Investigations With Linear Models”, Et Al 2022
“What learning algorithm is in-context learning? Investigations with linear models”, 2022-11-28 ( ; similar)
“A Deep Learning and Digital Archaeology Approach for Mosquito Repellent Discovery”, Et Al 2022
“A deep learning and digital archaeology approach for mosquito repellent discovery”, 2022-11-21 ( ; similar)
“Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, Et Al 2022
“Distilled DeepConsensus: Knowledge distillation for fast and accurate DNA sequence correction”, 2022-11-17 ( ; similar)
“Uni-Perceiver V2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks”, Et Al 2022
“Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks”, 2022-11-17 (similar; bibliography)
“UniSumm: Unified Few-shot Summarization With Multi-Task Pre-Training and Prefix-Tuning”, Et Al 2022
“UniSumm: Unified Few-shot Summarization with Multi-Task Pre-Training and Prefix-Tuning”, 2022-11-17 ( ; similar)
“OneFormer: One Transformer to Rule Universal Image Segmentation”, Et Al 2022
“OneFormer: One Transformer to Rule Universal Image Segmentation”, 2022-11-10 (similar; bibliography)
“Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities”, Et Al 2022
“Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities”, 2022-11-10 ( ; similar)
“Characterizing Intrinsic Compositionality in Transformers With Tree Projections”, Et Al 2022
“Characterizing Intrinsic Compositionality in Transformers with Tree Projections”, 2022-11-02 (similar)
“Fast DistilBERT on CPUs”, Et Al 2022
“Fast DistilBERT on CPUs”, 2022-10-27 ( ; similar)
“N-gram Is Back: Residual Learning of Neural Text Generation With N-gram Language Model”, Et Al 2022
“n-gram Is Back: Residual Learning of Neural Text Generation with n-gram Language Model”, 2022-10-26 ( ; similar)
“Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models”, Et Al 2022
“Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models”, 2022-10-25 (similar)
“Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”, Et Al 2022
“Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”, 2022-10-06 ( ; similar)
“Transformers Implement First-Order Logic With Majority Quantifiers”, 2022
“Transformers Implement First-Order Logic with Majority Quantifiers”, 2022-10-06 ( ; similar)
“Improving Sample Quality of Diffusion Models Using Self-Attention Guidance”, Et Al 2022
“Improving Sample Quality of Diffusion Models Using Self-Attention Guidance”, 2022-10-03 ( ; similar)
“Semantic Scene Descriptions As an Objective of Human Vision”, Et Al 2022
“Semantic scene descriptions as an objective of human vision”, 2022-09-23 ( ; similar; bibliography)
“A Generalist Neural Algorithmic Learner”, Et Al 2022
“A Generalist Neural Algorithmic Learner”, 2022-09-22 ( ; similar)
“SetFit: Efficient Few-Shot Learning Without Prompts”, Et Al 2022
“SetFit: Efficient Few-Shot Learning Without Prompts”, 2022-09-22 (similar; bibliography)
“Machine Reading, Fast and Slow: When Do Models”Understand” Language?“, Et Al 2022
“Machine Reading, Fast and Slow: When Do Models "Understand" Language?”, 2022-09-15 ( ; similar)
“On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, Et Al 2022
“On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)”, 2022-09-07 ( ; similar)
“ASR2K: Speech Recognition for Around 2000 Languages without Audio”, Et Al 2022
“ASR2K: Speech Recognition for Around 2000 Languages without Audio”, 2022-09-06 (similar)
“Analyzing Transformers in Embedding Space”, Et Al 2022
“Analyzing Transformers in Embedding Space”, 2022-09-06 (similar; bibliography)
“MeloForm: Generating Melody With Musical Form Based on Expert Systems and Neural Networks”, Et Al 2022
“MeloForm: Generating Melody with Musical Form based on Expert Systems and Neural Networks”, 2022-08-30 ( ; similar)
“CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks”, Et Al 2022
“CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks”, 2022-08-16 ( ; similar)
“PatchDropout: Economizing Vision Transformers Using Patch Dropout”, Et Al 2022
“PatchDropout: Economizing Vision Transformers Using Patch Dropout”, 2022-08-10 (similar)
“Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data?”, Et Al 2022
“Why do tree-based models still outperform deep learning on tabular data?”, 2022-07-18 ( ; similar)
“Re2G: Retrieve, Rerank, Generate”, Et Al 2022
“Re2G: Retrieve, Rerank, Generate”, 2022-07-13 ( ; similar; bibliography)
“Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling”, 2022
“Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling”, 2022-07-09 ( ; similar)
“Neural Networks and the Chomsky Hierarchy”, Et Al 2022
“Neural Networks and the Chomsky Hierarchy”, 2022-07-05 ( ; similar)
“TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data”, Et Al 2022
“TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data”, 2022-07-05 ( ; backlinks; similar; bibliography)
“Transfer Learning With Deep Tabular Models”, Et Al 2022
“Transfer Learning with Deep Tabular Models”, 2022-06-30 ( ; similar)
“BertNet: Harvesting Knowledge Graphs from Pretrained Language Models”, Et Al 2022
“BertNet: Harvesting Knowledge Graphs from Pretrained Language Models”, 2022-06-28 (similar)
“ProGen2: Exploring the Boundaries of Protein Language Models”, Et Al 2022
“ProGen2: Exploring the Boundaries of Protein Language Models”, 2022-06-27 ( ; similar)
“LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Et Al 2022
“LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling”, 2022-06-14 ( ; similar; bibliography)
“RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt”, Et Al 2022
“RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt”, 2022-06-14 ( ; similar; bibliography)
“SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, 2022
“SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features”, 2022-06-14 ( ; similar)
“Language Models Are General-Purpose Interfaces”, Et Al 2022
“Language Models are General-Purpose Interfaces”, 2022-06-13 (similar)
“Reconstructing the Cascade of Language Processing in the Brain Using the Internal Computations of a Transformer-based Language Model”, Et Al 2022
“Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model”, 2022-06-09 ( ; similar; bibliography)
“Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Et Al 2022
“Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs”, 2022-06-09 ( ; backlinks; similar)
“A Neural Corpus Indexer for Document Retrieval”, Et Al 2022
“A Neural Corpus Indexer for Document Retrieval”, 2022-06-06 ( ; similar)
“XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient”, Et Al 2022
“XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient”, 2022-06-04 ( ; similar; bibliography)
“Toward a Realistic Model of Speech Processing in the Brain With Self-supervised Learning”, Et Al 2022
“Toward a realistic model of speech processing in the brain with self-supervised learning”, 2022-06-03 ( ; backlinks; similar; bibliography)
“Text2Human: Text-Driven Controllable Human Image Generation”, Et Al 2022
“Text2Human: Text-Driven Controllable Human Image Generation”, 2022-05-31 ( ; similar)
“Anime Character Recognition Using Intermediate Features Aggregation”, Et Al 2022
“Anime Character Recognition using Intermediate Features Aggregation”, 2022-05-27 ( ; bibliography)
“On the Paradox of Learning to Reason from Data”, Et Al 2022
“On the Paradox of Learning to Reason from Data”, 2022-05-23 ( ; similar)
“HTPS: HyperTree Proof Search for Neural Theorem Proving”, Et Al 2022
“HTPS: HyperTree Proof Search for Neural Theorem Proving”, 2022-05-23 ( ; similar)
“Housekeep: Tidying Virtual Households Using Commonsense Reasoning”, Et Al 2022
“Housekeep: Tidying Virtual Households using Commonsense Reasoning”, 2022-05-22 ( ; backlinks; similar)
“Tradformer: A Transformer Model of Traditional Music Transcriptions”, 2022
“Tradformer: A Transformer Model of Traditional Music Transcriptions”, 2022-05-20 ( ; similar)
“Continual Pre-Training Mitigates Forgetting in Language and Vision”, Et Al 2022
“Continual Pre-Training Mitigates Forgetting in Language and Vision”, 2022-05-19 ( ; backlinks; similar)
“Few-Shot Parameter-Efficient Fine-Tuning Is Better and Cheaper Than In-Context Learning”, Et Al 2022
“Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning”, 2022-05-11 (backlinks; similar)
“SymphonyNet: Symphony Generation With Permutation Invariant Language Model”, Et Al 2022
“SymphonyNet: Symphony Generation with Permutation Invariant Language Model”, 2022-05-10 ( ; similar)
“When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet”, Et Al 2022
“When does dough become a bagel? Analyzing the remaining mistakes on ImageNet”, 2022-05-09 ( ; similar; bibliography)
“A Challenging Benchmark of Anime Style Recognition”, Et Al 2022
“A Challenging Benchmark of Anime Style Recognition”, 2022-04-29 ( ; similar)
“Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers”, Et Al 2022
“Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers”, 2022-04-22 ( ; similar)
“Masked Siamese Networks for Label-Efficient Learning”, Et Al 2022
“Masked Siamese Networks for Label-Efficient Learning”, 2022-04-14 (similar)
“DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning”, Et Al 2022
“DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning”, 2022-04-10 ( ; similar)
“Language Models That Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion”, Et Al 2022
“Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion”, 2022-03-24 ( ; similar; bibliography)
“On Embeddings for Numerical Features in Tabular Deep Learning”, Et Al 2022
“On Embeddings for Numerical Features in Tabular Deep Learning”, 2022-03-10 ( ; similar)
“In-context Learning and Induction Heads”, Et Al 2022
“In-context Learning and Induction Heads”, 2022-03-08 ( )
“Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words”, Et Al 2022
“Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words”, 2022-02-24 ( ; similar)
“The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention”, Et Al 2022
“The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention”, 2022-02-11 ( ; similar)
“TACTiS: Transformer-Attentional Copulas for Time Series”, Et Al 2022
“TACTiS: Transformer-Attentional Copulas for Time Series”, 2022-02-07 ( ; similar)
“Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework”, Et Al 2022
“Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework”, 2022-02-07 ( ; similar; bibliography)
“AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, Et Al 2022
“AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models”, 2022-01-29 ( ; similar)
“FIGARO: Generating Symbolic Music With Fine-Grained Artistic Control”, Et Al 2022
“FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control”, 2022-01-26 ( ; similar)
“Robust Contrastive Learning against Noisy Views”, Et Al 2022
“Robust Contrastive Learning against Noisy Views”, 2022-01-12 (similar)
“HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning”, Et Al 2022
“HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning”, 2022-01-11 ( ; similar)
“A Mathematical Framework for Transformer Circuits”, Et Al 2021
“A Mathematical Framework for Transformer Circuits”, 2021-12-22 ( )
“XGLM: Few-shot Learning With Multilingual Language Models”, Et Al 2021
“XGLM: Few-shot Learning with Multilingual Language Models”, 2021-12-20 ( ; similar)
“PFNs: Transformers Can Do Bayesian Inference”, Et Al 2021
“PFNs: Transformers Can Do Bayesian Inference”, 2021-12-20 ( ; backlinks; similar; bibliography)
“AI Improvements in Chemical Calculations”, 2021
“AI Improvements in Chemical Calculations”, 2021-12-16 ( ; backlinks; similar)
“An Empirical Investigation of the Role of Pre-training in Lifelong Learning”, Et Al 2021
“An Empirical Investigation of the Role of Pre-training in Lifelong Learning”, 2021-12-16 ( ; backlinks; similar)
“You Only Need One Model for Open-domain Question Answering”, Et Al 2021
“You Only Need One Model for Open-domain Question Answering”, 2021-12-14 ( ; similar)
“Human Parity on CommonsenseQA: Augmenting Self-Attention With External Attention”, Et Al 2021
“Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention”, 2021-12-06 ( ; similar)
“Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks”, Et Al 2021
“Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks”, 2021-12-02 (backlinks; similar)
“Inducing Causal Structure for Interpretable Neural Networks (IIT)”, Et Al 2021
“Inducing Causal Structure for Interpretable Neural Networks (IIT)”, 2021-12-01 ( ; similar)
“FQ-ViT: Fully Quantized Vision Transformer without Retraining”, Et Al 2021
“FQ-ViT: Fully Quantized Vision Transformer without Retraining”, 2021-11-27 ( ; similar; bibliography)
“Semi-Supervised Music Tagging Transformer”, Et Al 2021
“Semi-Supervised Music Tagging Transformer”, 2021-11-26 ( ; similar)
“LEMON: Scaling Up Vision-Language Pre-training for Image Captioning”, Et Al 2021
“LEMON: Scaling Up Vision-Language Pre-training for Image Captioning”, 2021-11-24 ( ; similar; bibliography)
“UNICORN: Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling”, Et Al 2021
“UNICORN: Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling”, 2021-11-23 (similar)
“Compositional Transformers for Scene Generation”, 2021
“Compositional Transformers for Scene Generation”, 2021-11-17 ( ; similar)
“A Survey of Visual Transformers”, Et Al 2021
“A Survey of Visual Transformers”, 2021-11-11 (similar; bibliography)
“Improving Visual Quality of Image Synthesis by A Token-based Generator With Transformers”, Et Al 2021
“Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers”, 2021-11-05 ( ; similar)
“STransGAN: An Empirical Study on Transformer in GANs”, Et Al 2021
“STransGAN: An Empirical Study on Transformer in GANs”, 2021-10-25 ( ; similar)
“The Efficiency Misnomer”, Et Al 2021
“The Efficiency Misnomer”, 2021-10-25 ( ; similar)
“Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, Et Al 2021
“Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora”, 2021-10-16 ( ; similar)
“The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail”, 2021
“The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail”, 2021-10-15 ( ; similar)
“Palette: Image-to-Image Diffusion Models”, Et Al 2021
“Palette: Image-to-Image Diffusion Models”, 2021-10-06 ( ; similar)
“Autoregressive Latent Video Prediction With High-Fidelity Image Generator”, Et Al 2021
“Autoregressive Latent Video Prediction with High-Fidelity Image Generator”, 2021-10-05 ( ; similar)
“Transformers Are Meta-Reinforcement Learners”, 2021
“Transformers are Meta-Reinforcement Learners”, 2021-10-05 ( ; similar)
“Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query”, Et Al 2021
“Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query”, 2021-09-28 ( ; similar)
“Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, Et Al 2021
“Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, 2021-09-27 ( ; similar; bibliography)
“KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Et Al 2021
“KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, 2021-09-13 ( ; similar; bibliography)
“Block Pruning For Faster Transformers”, Et Al 2021
“Block Pruning For Faster Transformers”, 2021-09-10 ( ; similar)
“The Sensory Neuron As a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning”, 2021
“The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning”, 2021-09-07 ( ; similar)
“DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction”, Et Al 2021
“DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction”, 2021-08-31 ( ; backlinks; similar)
“ImageBART: Bidirectional Context With Multinomial Diffusion for Autoregressive Image Synthesis”, Et Al 2021
“ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis”, 2021-08-19 ( ; similar)
“Modeling Protein Using Large-scale Pretrain Language Model”, Et Al 2021
“Modeling Protein Using Large-scale Pretrain Language Model”, 2021-08-17 ( ; similar)
“Billion-Scale Pretraining With Vision Transformers for Multi-Task Visual Representations”, Et Al 2021
“Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations”, 2021-08-12 ( ; similar)
“EVA: An Open-Domain Chinese Dialogue System With Large-Scale Generative Pre-Training”, Et Al 2021
“EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training”, 2021-08-03 ( ; similar)
“Internet-Augmented Dialogue Generation”, Et Al 2021
“Internet-Augmented Dialogue Generation”, 2021-07-15 ( ; similar; bibliography)
“ViTGAN: Training GANs With Vision Transformers”, Et Al 2021
“ViTGAN: Training GANs with Vision Transformers”, 2021-07-09 ( ; similar; bibliography)
“ARM-Net: Adaptive Relation Modeling Network for Structured Data”, Et Al 2021
“ARM-Net: Adaptive Relation Modeling Network for Structured Data”, 2021-07-05 ( ; similar)
“SCARF: Self-Supervised Contrastive Learning Using Random Feature Corruption”, Et Al 2021
“SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption”, 2021-06-29 ( ; similar)
“Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”, Et Al 2021
“Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”, 2021-06-23 ( ; similar; bibliography)
“Revisiting the Calibration of Modern Neural Networks”, Et Al 2021
“Revisiting the Calibration of Modern Neural Networks”, 2021-06-15 ( ; similar)
“Scaling Laws for Acoustic Models”, 2021
“Scaling Laws for Acoustic Models”, 2021-06-11 ( ; similar; bibliography)
“CoAtNet: Marrying Convolution and Attention for All Data Sizes”, Et Al 2021
“CoAtNet: Marrying Convolution and Attention for All Data Sizes”, 2021-06-09 ( ; similar; bibliography)
“Chasing Sparsity in Vision Transformers: An End-to-End Exploration”, Et Al 2021
“Chasing Sparsity in Vision Transformers: An End-to-End Exploration”, 2021-06-08 ( ; similar; bibliography)
“Tabular Data: Deep Learning Is Not All You Need”, Shwartz-2021
“Tabular Data: Deep Learning is Not All You Need”, 2021-06-06 ( ; similar)
“Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning”, Et Al 2021
“Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning”, 2021-06-04 ( ; similar)
“SegFormer: Simple and Efficient Design for Semantic Segmentation With Transformers”, Et Al 2021
“SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers”, 2021-05-31 (backlinks; similar; bibliography)
“Exploring Transfer Learning Techniques for Named Entity Recognition in Noisy User-Generated Text”, 2021
“Exploring Transfer Learning techniques for Named Entity Recognition in Noisy User-Generated Text”, 2021-05-31 ( ; similar)
“Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks”, Et Al 2021
“Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks”, 2021-05-30 ( ; similar)
“MathBERT: A Pre-Trained Model for Mathematical Formula Understanding”, Et Al 2021
“MathBERT: A Pre-Trained Model for Mathematical Formula Understanding”, 2021-05-02 ( ; similar)
“MDETR—Modulated Detection for End-to-End Multi-Modal Understanding”, Et Al 2021
“MDETR—Modulated Detection for End-to-End Multi-Modal Understanding”, 2021-04-26 (similar)
“SimCSE: Simple Contrastive Learning of Sentence Embeddings”, Et Al 2021
“SimCSE: Simple Contrastive Learning of Sentence Embeddings”, 2021-04-18 ( ; backlinks; similar)
“Robust Open-Vocabulary Translation from Visual Text Representations”, Et Al 2021
“Robust Open-Vocabulary Translation from Visual Text Representations”, 2021-04-16 ( ; backlinks; similar)
“Gradient-based Adversarial Attacks against Text Transformers”, Et Al 2021
“Gradient-based Adversarial Attacks against Text Transformers”, 2021-04-15 ( ; similar)
“Retrieval Augmentation Reduces Hallucination in Conversation”, Et Al 2021
“Retrieval Augmentation Reduces Hallucination in Conversation”, 2021-04-15 ( ; similar; bibliography)
“Machine Translation Decoding beyond Beam Search”, Et Al 2021
“Machine Translation Decoding beyond Beam Search”, 2021-04-12 ( ; similar)
“ChinAI #137: Year 3 of ChinAI: Reflections on the Newsworthiness of Machine Translation”, 2021
“ChinAI #137: Year 3 of ChinAI: Reflections on the newsworthiness of machine translation”, 2021-04-05 ( ; similar; bibliography)
“SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network”, Et Al 2021
“SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network”, 2021-04-05 ( ; similar)
“GPV-1: Towards General Purpose Vision Systems”, Et Al 2021
“GPV-1: Towards General Purpose Vision Systems”, 2021-04-01 (similar)
“DeepViT: Towards Deeper Vision Transformer”, Et Al 2021
“DeepViT: Towards Deeper Vision Transformer”, 2021-03-22 (similar; bibliography)
“ConViT: Improving Vision Transformers With Soft Convolutional Inductive Biases”, D’Et Al 2021
“ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases”, 2021-03-19 (similar; bibliography)
“Get Your Vitamin C! Robust Fact Verification With Contrastive Evidence (VitaminC)”, Et Al 2021
“Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence (VitaminC)”, 2021-03-15 ( ; backlinks; similar)
“Are NLP Models Really Able to Solve Simple Math Word Problems?”, Et Al 2021
“Are NLP Models really able to Solve Simple Math Word Problems?”, 2021-03-12 ( ; similar)
“Learning from Videos to Understand the World”, Et Al 2021
“Learning from videos to understand the world”, 2021-03-12 ( ; similar; bibliography)
“CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation”, Et Al 2021
“CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation”, 2021-03-11 ( ; similar)
“TransGAN: Two Transformers Can Make One Strong GAN”, Et Al 2021
“TransGAN: Two Transformers Can Make One Strong GAN”, 2021-02-14 ( ; similar; bibliography)
“ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”, Et Al 2021
“ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”, 2021-02-05 (backlinks; similar; bibliography)
“Baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling”, 2021
“baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling”, 2021-02-05 ( ; backlinks; similar)
“Video Transformer Network”, Et Al 2021
“Video Transformer Network”, 2021-02-01 ( ; backlinks; similar; bibliography)
“BENDR: Using Transformers and a Contrastive Self-supervised Learning Task to Learn from Massive Amounts of EEG Data”, Et Al 2021
“BENDR: using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data”, 2021-01-28 ( ; similar)
“Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, Et Al 2021
“Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, 2021-01-28 (similar; bibliography)
“Bottleneck Transformers for Visual Recognition”, Et Al 2021
“Bottleneck Transformers for Visual Recognition”, 2021-01-27 (similar; bibliography)
“DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition”, Et Al 2021
“DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition”, 2021-01-21 ( ; backlinks; bibliography)
“UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling With Transformers”, Et Al 2021
“UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers”, 2021-01-20 ( ; similar)
“MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, Et Al 2021
“MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, 2021-01-17 ( ; similar)
“XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation”, Et Al 2021
“XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation”, 2021-01-12 ( ; similar; bibliography)
“Transformer Feed-Forward Layers Are Key-Value Memories”, Et Al 2020
“Transformer Feed-Forward Layers Are Key-Value Memories”, 2020-12-29 (backlinks; similar)
“Training Data-efficient Image Transformers & Distillation through Attention”, Et Al 2020
“Training data-efficient image transformers & distillation through attention”, 2020-12-23 ( ; similar; bibliography)
“VQ-GAN: Taming Transformers for High-Resolution Image Synthesis”, Et Al 2020
“VQ-GAN: Taming Transformers for High-Resolution Image Synthesis”, 2020-12-17 ( ; backlinks; similar)
“Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences”, Et Al 2020
“Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”, 2020-12-15 ( ; similar)
“Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, Et Al 2020
“Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures”, 2020-12-15 ( ; similar; bibliography)
“TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game”, Et Al 2020
“TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game”, 2020-11-27 ( ; similar; bibliography)
“A Recurrent Vision-and-Language BERT for Navigation”, Et Al 2020
“A Recurrent Vision-and-Language BERT for Navigation”, 2020-11-26 ( ; similar)
“A Primer in BERTology: What We Know about How BERT Works”, Et Al 2020
“A Primer in BERTology: What we know about how BERT works”, 2020-11-09 ( ; similar)
“CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters”, Et Al 2020
“CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters”, 2020-10-20 ( ; backlinks; similar)
“TernaryBERT: Distillation-aware Ultra-low Bit BERT”, Et Al 2020
“TernaryBERT: Distillation-aware Ultra-low Bit BERT”, 2020-09-27 ( ; similar)
“Weird AI Yankovic: Generating Parody Lyrics”, 2020
“Weird AI Yankovic: Generating Parody Lyrics”, 2020-09-25 ( ; similar)
“DeepSpeed: Extreme-scale Model Training for Everyone”, Et Al 2020
“DeepSpeed: Extreme-scale model training for everyone”, 2020-09-10 ( ; backlinks; similar; bibliography)
“Hopfield Networks Is All You Need”, Et Al 2020
“Hopfield Networks is All You Need”, 2020-07-16 ( ; backlinks; similar; bibliography)
“Modern Hopfield Networks and Attention for Immune Repertoire Classification”, Et Al 2020
“Modern Hopfield Networks and Attention for Immune Repertoire Classification”, 2020-07-16 ( ; backlinks; similar)
“DeepSinger: Singing Voice Synthesis With Data Mined From the Web”, Et Al 2020
“DeepSinger: Singing Voice Synthesis with Data Mined From the Web”, 2020-07-09 ( ; similar)
“Leveraging Passage Retrieval With Generative Models for Open Domain Question Answering”, 2020
“Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering”, 2020-07-02 ( ; similar)
“Data Movement Is All You Need: A Case Study on Optimizing Transformers”, Et Al 2020
“Data Movement Is All You Need: A Case Study on Optimizing Transformers”, 2020-06-30 ( ; similar)
“Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”, Et Al 2020
“wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”, 2020-06-20 ( ; similar)
“Learning to Learn With Feedback and Local Plasticity”, Lindsey & Litwin-2020
“Learning to Learn with Feedback and Local Plasticity”, 2020-06-16 ( ; backlinks; similar)
“PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training”, Et Al 2020
“PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training”, 2020-06-16 ( ; similar)
“Improving GAN Training With Probability Ratio Clipping and Sample Reweighting”, Et Al 2020
“Improving GAN Training with Probability Ratio Clipping and Sample Reweighting”, 2020-06-12 ( ; backlinks; similar)
“DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations”, Et Al 2020
“DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations”, 2020-06-05 (backlinks; similar)
“DeBERTa: Decoding-enhanced BERT With Disentangled Attention”, Et Al 2020
“DeBERTa: Decoding-enhanced BERT with Disentangled Attention”, 2020-06-05 (similar; bibliography)
“DETR: End-to-End Object Detection With Transformers”, Et Al 2020
“DETR: End-to-End Object Detection with Transformers”, 2020-05-26 (similar; bibliography)
“TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data”, Et Al 2020
“TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data”, 2020-05-17 ( ; similar)
“VLN-BERT: Improving Vision-and-Language Navigation With Image-Text Pairs from the Web”, Et Al 2020
“VLN-BERT: Improving Vision-and-Language Navigation with Image-Text Pairs from the Web”, 2020-04-30 ( ; similar)
“General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, Et Al 2020
“General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference”, 2020-04-29 ( ; similar)
“Blender: A State-of-the-art Open Source Chatbot”, Et Al 2020
“Blender: A state-of-the-art open source chatbot”, 2020-04-29 ( ; similar; bibliography)
“Recipes for Building an Open-domain Chatbot”, Et Al 2020
“Recipes for building an open-domain chatbot”, 2020-04-28 (similar)
“Rapformer: Conditional Rap Lyrics Generation With Denoising Autoencoders”, Et Al 2020
“Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders”, 2020-04-08 ( ; similar; bibliography)
“On the Effect of Dropping Layers of Pre-trained Transformer Models”, Et Al 2020
“On the Effect of Dropping Layers of Pre-trained Transformer Models”, 2020-04-08 ( ; similar; bibliography)
“TAPAS: Weakly Supervised Table Parsing via Pre-training”, Et Al 2020
“TAPAS: Weakly Supervised Table Parsing via Pre-training”, 2020-04-05 ( ; similar)
“A Hundred Visions and Revisions”, 2020
“A Hundred Visions and Revisions”, 2020-03-11 ( ; backlinks; similar)
“Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited”, Et Al 2020
“Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited”, 2020-03-04 ( ; backlinks; similar)
“AraBERT: Transformer-based Model for Arabic Language Understanding”, Et Al 2020
“AraBERT: Transformer-based Model for Arabic Language Understanding”, 2020-02-28 (backlinks; similar)
“MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Et Al 2020
“MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, 2020-02-25 ( ; similar; bibliography)
“Bayesian Deep Learning and a Probabilistic Perspective of Generalization”, 2020
“Bayesian Deep Learning and a Probabilistic Perspective of Generalization”, 2020-02-20 ( ; backlinks; similar)
“Do We Need Zero Training Loss After Achieving Zero Training Error?”, Et Al 2020
“Do We Need Zero Training Loss After Achieving Zero Training Error?”, 2020-02-20 ( ; backlinks; similar)
“Transformers As Soft Reasoners over Language”, Et Al 2020
“Transformers as Soft Reasoners over Language”, 2020-02-14 ( ; backlinks; similar)
“Towards a Conversational Agent That Can Chat About…Anything”, 2020
“Towards a Conversational Agent that Can Chat About…Anything”, 2020-01-28 ( ; similar; bibliography)
“Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference”, Schick & 2020
“Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference”, 2020-01-21 (backlinks; similar)
“VIME: Extending the Success of Self-supervised and Semi-supervised Learning to Tabular Domain”, Et Al 2020
“VIME: Extending the Success of Self-supervised and Semi-supervised Learning to Tabular Domain”, 2020 ( ; similar)
“Mastering Complex Control in MOBA Games With Deep Reinforcement Learning”, Et Al 2019
“Mastering Complex Control in MOBA Games with Deep Reinforcement Learning”, 2019-12-20 ( ; similar)
“PEGASUS: Pre-training With Extracted Gap-sentences for Abstractive Summarization”, Et Al 2019
“PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization”, 2019-12-18 (similar)
“Encoding Musical Style With Transformer Autoencoders”, Et Al 2019
“Encoding Musical Style with Transformer Autoencoders”, 2019-12-10 ( ; backlinks; similar)
“Deep Double Descent: We Show That the Double Descent Phenomenon Occurs in CNNs, ResNets, and Transformers: Performance First Improves, Then Gets Worse, and Then Improves Again With Increasing Model Size, Data Size, or Training Time. This Effect Is Often Avoided through Careful Regularization. While This Behavior Appears to Be Fairly Universal, We Don’t yet Fully Understand Why It Happens, and View Further Study of This Phenomenon As an Important Research Direction.”, Et Al 2019
“Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.”, 2019-12-05 ( ; backlinks; similar; bibliography)
“Detecting GAN Generated Errors”, Et Al 2019
“Detecting GAN generated errors”, 2019-12-02 ( ; backlinks; similar)
“Unsupervised Cross-lingual Representation Learning at Scale”, Et Al 2019
“Unsupervised Cross-lingual Representation Learning at Scale”, 2019-11-05 ( ; similar; bibliography)
“DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Et Al 2019
“DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, 2019-10-02 ( ; backlinks; similar)
“Multiplicative Interactions and Where to Find Them”, Et Al 2019
“Multiplicative Interactions and Where to Find Them”, 2019-09-25 ( ; similar)
“TinyBERT: Distilling BERT for Natural Language Understanding”, Et Al 2019
“TinyBERT: Distilling BERT for Natural Language Understanding”, 2019-09-23 ( ; backlinks; similar; bibliography)
“Do NLP Models Know Numbers? Probing Numeracy in Embeddings”, Et Al 2019
“Do NLP Models Know Numbers? Probing Numeracy in Embeddings”, 2019-09-17 ( ; backlinks; similar)
“PubMedQA: A Dataset for Biomedical Research Question Answering”, Et Al 2019
“PubMedQA: A Dataset for Biomedical Research Question Answering”, 2019-09-13 ( ; similar)
“The Bottom-up Evolution of Representations in the Transformer: A Study With Machine Translation and Language Modeling Objectives”, Et Al 2019
“The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives”, 2019-09-03 (backlinks; similar)
“Language Models As Knowledge Bases?”, Et Al 2019
“Language Models as Knowledge Bases?”, 2019-09-03 ( ; similar)
“Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks”, 2019
“Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, 2019-08-27 (backlinks; similar)
“TabNet: Attentive Interpretable Tabular Learning”, 2019
“TabNet: Attentive Interpretable Tabular Learning”, 2019-08-20 ( ; similar)
“StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding”, Et Al 2019
“StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding”, 2019-08-13 (similar; bibliography)
“RoBERTa: A Robustly Optimized BERT Pretraining Approach”, Et Al 2019
“RoBERTa: A Robustly Optimized BERT Pretraining Approach”, 2019-07-26 ( ; similar; bibliography)
“Theoretical Limitations of Self-Attention in Neural Sequence Models”, 2019
“Theoretical Limitations of Self-Attention in Neural Sequence Models”, 2019-06-16 (backlinks; similar)
“Energy and Policy Considerations for Deep Learning in NLP”, Et Al 2019
“Energy and Policy Considerations for Deep Learning in NLP”, 2019-06-05 ( ; backlinks; similar)
“Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned”, Et Al 2019
“Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned”, 2019-05-23 ( ; similar)
“HellaSwag: Can a Machine Really Finish Your Sentence?”, Et Al 2019
“HellaSwag: Can a Machine Really Finish Your Sentence?”, 2019-05-19 (backlinks; similar)
“UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation”, Et Al 2019
“UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation”, 2019-05-08 ( ; backlinks; similar; bibliography)
“MASS: Masked Sequence to Sequence Pre-training for Language Generation”, Et Al 2019
“MASS: Masked Sequence to Sequence Pre-training for Language Generation”, 2019-05-07 (similar)
“Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, Et Al 2019
“Mask-Predict: Parallel Decoding of Conditional Masked Language Models”, 2019-04-19 ( ; similar)
“Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes”, Et Al 2019
“Large Batch Optimization for Deep Learning: Training BERT in 76 minutes”, 2019-04-01 ( ; similar; bibliography)
“Adapter: Parameter-Efficient Transfer Learning for NLP”, Et Al 2019
“Adapter: Parameter-Efficient Transfer Learning for NLP”, 2019-02-02 (similar)
“BioBERT: a Pre-trained Biomedical Language Representation Model for Biomedical Text Mining”, Et Al 2019
“BioBERT: a pre-trained biomedical language representation model for biomedical text mining”, 2019-01-25 (backlinks; similar; bibliography)
“Bayesian Layers: A Module for Neural Network Uncertainty”, Et Al 2018
“Bayesian Layers: A Module for Neural Network Uncertainty”, 2018-12-10 ( ; similar)
“Character-Level Language Modeling With Deeper Self-Attention”, Al-Et Al 2018
“Character-Level Language Modeling with Deeper Self-Attention”, 2018-08-09 ( ; backlinks; similar)
“Self-Attention Generative Adversarial Networks”, Et Al 2018
“Self-Attention Generative Adversarial Networks”, 2018-05-21 ( ; backlinks; similar)
“Universal Sentence Encoder”, Et Al 2018
“Universal Sentence Encoder”, 2018-03-29 (similar)
“Self-Attention With Relative Position Representations”, Et Al 2018
“Self-Attention with Relative Position Representations”, 2018-03-06 (similar)
“Learning Longer-term Dependencies in RNNs With Auxiliary Losses”, Et Al 2018
“Learning Longer-term Dependencies in RNNs with Auxiliary Losses”, 2018-03-01 ( ; similar)
“Generating Structured Music through Self-Attention”, Et Al 2018
“Generating Structured Music through Self-Attention”, 2018 ( ; similar; bibliography)
“A Simple Neural Attentive Meta-Learner”, Et Al 2017
“A Simple Neural Attentive Meta-Learner”, 2017-07-11 ( ; backlinks; similar)
“Attention Is All You Need”, Et Al 2017
“Attention Is All You Need”, 2017-06-12 (similar)
“RAM: Dynamic Computational Time for Visual Attention”, Et Al 2017
“RAM: Dynamic Computational Time for Visual Attention”, 2017-03-30 ( ; backlinks; similar)
“Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, 2016
“Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, 2016-12-12 ( ; similar)
“QRNNs: Quasi-Recurrent Neural Networks”, Et Al 2016
“QRNNs: Quasi-Recurrent Neural Networks”, 2016-11-05 ( ; similar)
“Modeling Human Reading With Neural Attention”, 2016
“Modeling Human Reading with Neural Attention”, 2016-08-19 ( ; backlinks; similar)
“Gaussian Error Linear Units (GELUs)”, 2016
“Gaussian Error Linear Units (GELUs)”, 2016-06-27 (backlinks; similar)
“Pointer Networks”, Et Al 2015
“Pointer Networks”, 2015-06-09 ( ; backlinks; similar)
“Neural Machine Translation by Jointly Learning to Align and Translate”, Et Al 2014
“Neural Machine Translation by Jointly Learning to Align and Translate”, 2014-09-01 ( ; backlinks; similar)
“Huggingface: ‘Transformers’ Repo”, 2023
“Huggingface: 'transformers' repo”, (similar; bibliography)
“Transformers Are a Very Exciting Family of Machine Learning Architectures. Many Good Tutorials Exist (eg. [1, 2]) but in the Last Few Years, Transformers Have Mostly Become Simpler, so That It Is Now Much More Straightforward to Explain How Modern Architectures Work. This Post Is an Attempt to Explain Directly [in PyTorch] How Modern Transformers Work, and Why, without Some of the Historical Baggage.”
“Transformers are a very exciting family of machine learning architectures. Many good tutorials exist (eg. [1, 2]) but in the last few years, transformers have mostly become simpler, so that it is now much more straightforward to explain how modern architectures work. This post is an attempt to explain directly [in PyTorch] how modern transformers work, and why, without some of the historical baggage.” (backlinks)
Wikipedia
Miscellaneous
Link Bibliography
-
https://arxiv.org/abs/2302.12441
: “MUX-PLMs: Pre-training Language Models With Data Multiplexing”, Vishvak Murahari, Ameet Deshpande, Carlos E. Jimenez, Izhak Shafran, Mingqiu Wang, Yuan Cao, Karthik Narasimhan: -
https://arxiv.org/abs/2302.05442#google
: “Scaling Vision Transformers to 22 Billion Parameters”, : -
https://arxiv.org/abs/2302.04907#google
: “BMT: Binarized Neural Machine Translation”, Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat: -
https://arxiv.org/abs/2301.03992#nvidia
: “Vision Transformers Are Good Mask Auto-Labelers”, Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar: -
https://arxiv.org/abs/2301.03728#facebook
: “Scaling Laws for Generative Mixed-Modal Language Models”, : -
https://arxiv.org/abs/2212.06727
: “What Do Vision Transformers Learn? A Visual Exploration”, : -
https://arxiv.org/abs/2212.05199
: “MAGVIT: Masked Generative Video Transformer”, : -
https://arxiv.org/abs/2212.05051
: “VindLU: A Recipe for Effective Video-and-Language Pretraining”, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius: -
https://arxiv.org/abs/2211.09808
: “Uni-Perceiver V2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks”, : -
https://arxiv.org/abs/2211.06220
: “OneFormer: One Transformer to Rule Universal Image Segmentation”, Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi: -
https://arxiv.org/abs/2209.11737
: “Semantic Scene Descriptions As an Objective of Human Vision”, Adrien Doerig, Tim C. Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, Ian Charest: -
https://arxiv.org/abs/2209.11055
: “SetFit: Efficient Few-Shot Learning Without Prompts”, Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, Oren Pereg: -
https://arxiv.org/abs/2209.02535
: “Analyzing Transformers in Embedding Space”, Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant: -
https://arxiv.org/abs/2207.06300#ibm
: “Re2G: Retrieve, Rerank, Generate”, Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, Alfio Gliozzo: -
https://arxiv.org/abs/2207.01848
: “TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data”, Noah Hollmann, Samuel Müller, Katharina Eggensperger, Frank Hutter: -
https://arxiv.org/abs/2206.07160#microsoft
: “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang: -
https://arxiv.org/abs/2206.07137
: “RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt”, : -
https://www.biorxiv.org/content/10.1101/2022.06.08.495348.full
: “Reconstructing the Cascade of Language Processing in the Brain Using the Internal Computations of a Transformer-based Language Model”, : -
https://arxiv.org/abs/2206.01859#microsoft
: “XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient”, Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He: -
https://arxiv.org/abs/2206.01685
: “Toward a Realistic Model of Speech Processing in the Brain With Self-supervised Learning”, : -
2022-rios.pdf
: “Anime Character Recognition Using Intermediate Features Aggregation”, Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai: -
https://arxiv.org/abs/2205.04596#google
: “When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet”, Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes, Sara Fridovich-Keil, Rebecca Roelofs: -
https://arxiv.org/abs/2203.13224#facebook
: “Language Models That Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion”, Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, Jason Weston: -
https://arxiv.org/abs/2202.03052#alibaba
: “Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework”, : -
https://arxiv.org/abs/2112.10510
: “PFNs: Transformers Can Do Bayesian Inference”, Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, Frank Hutter: -
https://arxiv.org/abs/2111.13824
: “FQ-ViT: Fully Quantized Vision Transformer without Retraining”, Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, Shuchang Zhou: -
https://arxiv.org/abs/2111.12233#microsoft
: “LEMON: Scaling Up Vision-Language Pre-training for Image Captioning”, Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang: -
https://arxiv.org/abs/2111.06091
: “A Survey of Visual Transformers”, : -
https://arxiv.org/abs/2109.12948
: “Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort: -
https://arxiv.org/abs/2109.06243#huawei
: “KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation”, Marzieh S. Tahaei, Ella Charlaix, Vahid Partovi Nia, Ali Ghodsi, Mehdi Rezagholizadeh: -
https://arxiv.org/abs/2107.07566#facebook
: “Internet-Augmented Dialogue Generation”, Mojtaba Komeili, Kurt Shuster, Jason Weston: -
https://arxiv.org/abs/2107.04589
: “ViTGAN: Training GANs With Vision Transformers”, Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, Ce Liu: -
https://arxiv.org/abs/2106.12672#google
: “Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”, : -
https://arxiv.org/abs/2106.09488#amazon
: “Scaling Laws for Acoustic Models”, Jasha Droppo, Oguz Elibol: -
https://arxiv.org/abs/2106.04803#google
: “CoAtNet: Marrying Convolution and Attention for All Data Sizes”, Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan: -
https://arxiv.org/abs/2106.04533
: “Chasing Sparsity in Vision Transformers: An End-to-End Exploration”, Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang: -
https://arxiv.org/abs/2105.15203
: “SegFormer: Simple and Efficient Design for Semantic Segmentation With Transformers”, Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo: -
https://arxiv.org/abs/2104.07567#facebook
: “Retrieval Augmentation Reduces Hallucination in Conversation”, Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, Jason Weston: -
https://chinai.substack.com/p/chinai-137-year-3-of-chinai
: “ChinAI #137: Year 3 of ChinAI: Reflections on the Newsworthiness of Machine Translation”, Jeffrey Ding: -
https://arxiv.org/abs/2103.11886#bytedance
: “DeepViT: Towards Deeper Vision Transformer”, Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, Jiashi Feng: -
https://arxiv.org/abs/2103.10697#facebook
: “ConViT: Improving Vision Transformers With Soft Convolutional Inductive Biases”, Stéphane d’Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, Levent Sagun: -
https://ai.facebook.com/blog/learning-from-videos-to-understand-the-world/
: “Learning from Videos to Understand the World”, Geoffrey Zweig, Polina Kuznetsova, Michael Auli, Francois Fagan: -
https://arxiv.org/abs/2102.07074
: “TransGAN: Two Transformers Can Make One Strong GAN”, Yifan Jiang, Shiyu Chang, Zhangyang Wang: -
https://arxiv.org/abs/2102.03334
: “ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”, Wonjae Kim, Bokyung Son, Ildoo Kim: -
https://arxiv.org/abs/2102.00719
: “Video Transformer Network”, Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann: -
https://arxiv.org/abs/2101.11986
: “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, Shuicheng Yan: -
https://arxiv.org/abs/2101.11605#google
: “Bottleneck Transformers for Visual Recognition”, Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani: -
https://arxiv.org/abs/2101.08674
: “DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition”, Edwin Arkel Rios, Wen-Huang Cheng, Bo-Cheng Lai: -
https://arxiv.org/abs/2101.04702#google
: “XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation”, Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang: -
https://arxiv.org/abs/2012.12877#facebook
: “Training Data-efficient Image Transformers & Distillation through Attention”, Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou: -
https://arxiv.org/abs/2012.08508#deepmind
: “Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, David Ding, Felix Hill, Adam Santoro, Matt Botvinick: -
https://arxiv.org/abs/2011.13729#tencent
: “TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game”, : -
https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
: “DeepSpeed: Extreme-scale Model Training for Everyone”, DeepSpeed Team, Rangan Majumder, Junhua Wang: -
https://arxiv.org/abs/2008.02217
: “Hopfield Networks Is All You Need”, : -
https://arxiv.org/abs/2006.03654#microsoft
: “DeBERTa: Decoding-enhanced BERT With Disentangled Attention”, Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen: -
https://arxiv.org/abs/2005.12872#facebook
: “DETR: End-to-End Object Detection With Transformers”, Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko: -
https://ai.facebook.com/blog/state-of-the-art-open-source-chatbot
: “Blender: A State-of-the-art Open Source Chatbot”, Stephen Roller, Jason Weston, Emily Dinan: -
https://arxiv.org/abs/2004.03965
: “Rapformer: Conditional Rap Lyrics Generation With Denoising Autoencoders”, Nikola I. Nikolov, Eric Malmi, Curtis G. Northcutt, Loreto Parisi: -
https://arxiv.org/abs/2004.03844
: “On the Effect of Dropping Layers of Pre-trained Transformer Models”, Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Preslav Nakov: -
https://arxiv.org/abs/2002.10957#microsoft
: “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou: -
https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html
: “Towards a Conversational Agent That Can Chat About…Anything”, Daniel Adiwardana, Thang Luong: -
https://openai.com/blog/deep-double-descent/
: “Deep Double Descent: We Show That the Double Descent Phenomenon Occurs in CNNs, ResNets, and Transformers: Performance First Improves, Then Gets Worse, and Then Improves Again With Increasing Model Size, Data Size, or Training Time. This Effect Is Often Avoided through Careful Regularization. While This Behavior Appears to Be Fairly Universal, We Don’t yet Fully Understand Why It Happens, and View Further Study of This Phenomenon As an Important Research Direction.”, Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever: -
https://arxiv.org/abs/1911.02116#facebook
: “Unsupervised Cross-lingual Representation Learning at Scale”, : -
https://arxiv.org/abs/1909.10351
: “TinyBERT: Distilling BERT for Natural Language Understanding”, Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu: -
https://arxiv.org/abs/1908.04577#alibaba
: “StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding”, Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, Luo Si: -
https://arxiv.org/abs/1907.11692#facebook
: “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, : -
https://arxiv.org/abs/1905.03197
: “UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation”, Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon: -
https://arxiv.org/abs/1904.00962#google
: “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes”, : -
https://arxiv.org/abs/1901.08746
: “BioBERT: a Pre-trained Biomedical Language Representation Model for Biomedical Text Mining”, Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang: -
2018-huang.pdf
: “Generating Structured Music through Self-Attention”, : -
https://github.com/huggingface/transformers
: “Huggingface: 'transformers' Repo”, Huggingface: