Bibliography:

  1. ‘neural net’ tag

  2. ‘AlphaFold’ tag

  3. ‘compressed Transformers’ tag

  4. ‘multi-scale Transformers’ tag

  5. ‘self-attention’ tag

  6. ‘Transformer matrix optimizations’ tag

  7. ‘recurrent Transformers’ tag

  8. ‘sparse Transformers’ tag

  9. ‘CLIP’ tag

  10. ‘CLIP samples’ tag

  11. ‘GPT-2 fiction’ tag

  12. ‘GPT-2’ tag

  13. ‘GPT-2 nonfiction’ tag

  14. /doc/ai/nn/transformer/gpt/2/poetry

  15. ‘GPT-3 fiction’ tag

  16. ‘GPT-3 humor’ tag

  17. ‘GPT-3’ tag

  18. ‘GPT-3 nonfiction’ tag

  19. ‘GPT-3 poetry’ tag

  20. ‘GPT-4 fiction’ tag

  21. ‘GPT-4’ tag

  22. ‘GPT-4 nonfiction’ tag

  23. ‘GPT-4 poetry’ tag

  24. ‘Sydney (AI)’ tag

  25. ‘GPT-5’ tag

  26. ‘GPT calibration’ tag

  27. ‘Claude AI’ tag

  28. ‘Codex’ tag

  29. ‘DALL·E 1’ tag

  30. ‘DALL·E 2’ tag

  31. ‘DALL·E 3’ tag

  32. ‘DALL·E’ tag

  33. ‘GPT fiction’ tag

  34. ‘GPT’ tag

  35. ‘inner monologue (AI)’ tag

  36. ‘instruct-tuning LLMs’ tag

  37. ‘Jukebox’ tag

  38. ‘LaMDA’ tag

  39. ‘GPT non-fiction’ tag

  40. ‘PaLM 2’ tag

  41. ‘PaLM’ tag

  42. ‘GPT poetry’ tag

  43. ‘Whisper NN’ tag

  44. ‘T5 Transformer’ tag

  45. ‘MLP NN’ tag

  46. ‘BigGAN’ tag

  47. ‘masked autoencoder’ tag

  48. ‘AI scaling’ tag

  49. ‘MoE NN’ tag

  50. ‘tabular ML’ tag

  51. ‘AI video’ tag

  52. ‘AlphaStar’ tag

  53. ‘OA5’ tag

  54. Gemma 2: Improving Open Language Models at a Practical Size

  55. Investigating the Ability of LLMs to Recognize Their Own Writing

  56. Questionable practices in machine learning

  57. Revealing Fine-Grained Values and Opinions in Large Language Models

  58. BERTs are Generative In-Context Learners

  59. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

  60. Grokfast: Accelerated Grokking by Amplifying Slow Gradients

  61. Not All Language Model Features Are Linear

  62. You Only Cache Once: Decoder-Decoder Architectures for Language Models

  63. Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

  64. Chinchilla Scaling: A replication attempt

  65. Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?

  66. Conformer-1: Robust ASR via Large-Scale Semi-supervised Bootstrapping

  67. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

  68. Language models accurately infer correlations between psychological items and scales from text alone

  69. Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

  70. Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs

  71. A Study in Dataset Pruning for Image Super-Resolution

  72. AI and Memory Wall

  73. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

  74. Inflection-2.5: meet the world’s best personal AI

  75. LTE: Training Neural Networks from Scratch with Parallel Low-Rank Adapters

  76. Beyond A: Better Planning with Transformers via Search Dynamics Bootstrapping (Searchformer)

  77. KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students

  78. Do Llamas Work in English? On the Latent Language of Multilingual Transformers

  79. DE-COP: Detecting Copyrighted Content in Language Models Training Data

  80. Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

  81. The Manga Whisperer: Automatically Generating Transcriptions for Comics

  82. A Philosophical Introduction to Language Models—Part I: Continuity With Classic Debates

  83. Solving olympiad geometry without human demonstrations

  84. Real-Time AI & The Future of AI Hardware

  85. Seamless: Multilingual Expressive and Streaming Speech Translation

  86. Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

  87. The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

  88. GIVT: Generative Infinite-Vocabulary Transformers

  89. Sequential Modeling Enables Scalable Learning for Large Vision Models

  90. DiLoCo: Distributed Low-Communication Training of Language Models

  91. CogVLM: Visual Expert for Pretrained Language Models

  92. GLaMM: Pixel Grounding Large Multimodal Model

  93. Don’t Make Your LLM an Evaluation Benchmark Cheater

  94. ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models

  95. EELBERT: Tiny Models through Dynamic Embeddings

  96. LLM-FP4: 4-Bit Floating-Point Quantized Transformers

  97. Will releasing the weights of large language models grant widespread access to pandemic agents?

  98. Model Merging by Uncertainty-Based Gradient Matching

  99. To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

  100. Sparse Universal Transformer

  101. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

  102. Language Models Represent Space and Time

  103. DeWave: Discrete EEG Waves Encoding for Brain Dynamics to Text Translation

  104. Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions

  105. Demystifying RCE Vulnerabilities in LLM-Integrated Apps

  106. A Pooled Cell Painting CRISPR Screening Platform Enables de novo Inference of Gene Function by Self-supervised Deep Learning

  107. Nougat: Neural Optical Understanding for Academic Documents

  108. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

  109. Predicting brain activity using Transformers

  110. Copy Is All You Need

  111. HEADLINES: A Massive Scale Semantic Similarity Dataset of Historical English

  112. Expanding the methodological toolbox: Machine-based item desirability ratings as an alternative to human-based ratings

  113. OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

  114. RGD: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization

  115. SequenceMatch: Imitation Learning for Autoregressive Sequence Modeling with Backtracking

  116. Using Sequences of Life-events to Predict Human Lives

  117. Binary and Ternary Natural Language Generation

  118. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

  119. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

  120. Learning Transformer Programs

  121. FERMAT: An Alternative to Accuracy for Numerical Reasoning

  122. Translatotron 3: Speech to Speech Translation with Monolingual Data

  123. Deep Learning based Forecasting: a case study from the online fashion industry

  124. Scaling laws for language encoding models in fMRI

  125. DarkBERT: A Language Model for the Dark Side of the Internet

  126. Mitigating Lies in Vision-Language Models

  127. VendorLink: An NLP approach for Identifying & Linking Vendor Migrants & Potential Aliases on Darknet Markets

  128. Visual Instruction Tuning

  129. Segment Anything

  130. A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

  131. When and How Artificial Intelligence Augments Employee Creativity

  132. Trained on 100 million words and still in shape: BERT meets British National Corpus

  133. Mitigating YouTube Recommendation Polarity using BERT and K-Means Clustering

  134. Model scale versus domain knowledge in statistical forecasting of chaotic systems

  135. Tag2Text: Guiding Vision-Language Model via Image Tagging

  136. The Man of Your Dreams For $300, Replika sells an AI companion who will never die, argue, or cheat—until his algorithm is updated

  137. Towards Democratizing Joint-Embedding Self-Supervised Learning

  138. MUX-PLMs: Pre-training Language Models with Data Multiplexing

  139. Optical Transformers

  140. Scaling Vision Transformers to 22 Billion Parameters

  141. BMT: Binarized Neural Machine Translation

  142. V1T: large-scale mouse V1 response prediction using a Vision Transformer

  143. The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

  144. SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

  145. XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

  146. ClimaX: A foundation model for weather and climate

  147. DataMUX: Data Multiplexing for Neural Networks

  148. Progress measures for grokking via mechanistic interpretability

  149. Scaling Laws for Generative Mixed-Modal Language Models

  150. Vision Transformers Are Good Mask Auto-Labelers

  151. Why do Nearest Neighbor Language Models Work?

  152. Cramming: Training a Language Model on a Single GPU in One Day

  153. Less is More: Parameter-Free Text Classification with Gzip

  154. NBC-Softmax: Darkweb Author fingerprinting and migration tracking

  155. What do Vision Transformers Learn? A Visual Exploration

  156. POM: A Principal Odor Map Unifies Diverse Tasks in Human Olfactory Perception

  157. MAGVIT: Masked Generative Video Transformer

  158. VindLU: A Recipe for Effective Video-and-Language Pretraining

  159. Text Embeddings by Weakly-Supervised Contrastive Pre-training

  160. Discovering Latent Knowledge in Language Models Without Supervision

  161. NPM: Nonparametric Masked Language Modeling

  162. BARTSmiles: Generative Masked Language Models for Molecular Representations

  163. RGB no more: Minimally-decoded JPEG Vision Transformers

  164. Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models

  165. A deep learning and digital archaeology approach for mosquito repellent discovery

  166. GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation

  167. UniSumm: Unified Few-shot Summarization with Multi-Task Pre-Training and Prefix-Tuning

  168. Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

  169. Distilled DeepConsensus: Knowledge distillation for fast and accurate DNA sequence correction

  170. Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities

  171. OneFormer: One Transformer to Rule Universal Image Segmentation

  172. Characterizing Intrinsic Compositionality in Transformers with Tree Projections

  173. Fast DistilBERT on CPUs

  174. n-gram Is Back: Residual Learning of Neural Text Generation with n-gram Language Model

  175. Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

  176. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

  177. Noise-Robust De-Duplication at Scale

  178. Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints

  179. Improving Sample Quality of Diffusion Models Using Self-Attention Guidance

  180. Semantic scene descriptions as an objective of human vision

  181. SetFit: Efficient Few-Shot Learning Without Prompts

  182. A Generalist Neural Algorithmic Learner

  183. Machine Reading, Fast and Slow: When Do Models "Understand" Language?

  184. On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)

  185. Analyzing Transformers in Embedding Space

  186. ASR2K: Speech Recognition for Around 2,000 Languages without Audio

  187. MeloForm: Generating Melody with Musical Form based on Expert Systems and Neural Networks

  188. CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks

  189. PatchDropout: Economizing Vision Transformers Using Patch Dropout

  190. Why do tree-based models still outperform deep learning on tabular data?

  191. Re2G: Retrieve, Rerank, Generate

  192. Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling

  193. TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data

  194. Neural Networks and the Chomsky Hierarchy

  195. Do Loyal Users Enjoy Better Recommendations? Understanding Recommender Accuracy from a Time Perspective

  196. Transfer Learning with Deep Tabular Models

  197. BertNet: Harvesting Knowledge Graphs from Pretrained Language Models

  198. ProGen2: Exploring the Boundaries of Protein Language Models

  199. SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features

  200. RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

  201. LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

  202. Language Models are General-Purpose Interfaces

  203. Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

  204. Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model

  205. A Neural Corpus Indexer for Document Retrieval

  206. XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient

  207. Toward a realistic model of speech processing in the brain with self-supervised learning

  208. Text2Human: Text-Driven Controllable Human Image Generation

  209. Anime Character Recognition using Intermediate Features Aggregation

  210. Towards Learning Universal Hyperparameter Optimizers with Transformers

  211. FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

  212. HTPS: HyperTree Proof Search for Neural Theorem Proving

  213. On the Paradox of Learning to Reason from Data

  214. Housekeep: Tidying Virtual Households using Commonsense Reasoning

  215. UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

  216. Tradformer: A Transformer Model of Traditional Music Transcriptions

  217. Continual Pre-Training Mitigates Forgetting in Language and Vision

  218. PLAID: An Efficient Engine for Late Interaction Retrieval

  219. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

  220. SymphonyNet: Symphony Generation with Permutation Invariant Language Model

  221. When does dough become a bagel? Analyzing the remaining mistakes on ImageNet

  222. A Challenging Benchmark of Anime Style Recognition

  223. Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers

  224. Masked Siamese Networks for Label-Efficient Learning

  225. DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning

  226. Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion

  227. On Embeddings for Numerical Features in Tabular Deep Learning

  228. In-Context Learning and Induction Heads

  229. LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

  230. Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

  231. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

  232. TACTiS: Transformer-Attentional Copulas for Time Series

  233. AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

  234. FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control

  235. Robust Contrastive Learning against Noisy Views

  236. HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning

  237. A Mathematical Framework for Transformer Circuits

  238. PFNs: Transformers Can Do Bayesian Inference

  239. XGLM: Few-shot Learning with Multilingual Language Models

  240. An Empirical Investigation of the Role of Pre-training in Lifelong Learning

  241. AI Improvements in Chemical Calculations

  242. You Only Need One Model for Open-domain Question Answering

  243. Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

  244. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction

  245. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

  246. Inducing Causal Structure for Interpretable Neural Networks (IIT)

  247. OCR-free Document Understanding Transformer

  248. FQ-ViT: Fully Quantized Vision Transformer without Retraining

  249. Semi-Supervised Music Tagging Transformer

  250. LEMON: Scaling Up Vision-Language Pre-training for Image Captioning

  251. UNICORN: Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

  252. Compositional Transformers for Scene Generation

  253. It’s About Time: Analog Clock Reading in the Wild

  254. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

  255. A Survey of Visual Transformers

  256. Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers

  257. The Efficiency Misnomer

  258. STransGAN: An Empirical Study on Transformer in GANs

  259. Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

  260. The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail

  261. Palette: Image-to-Image Diffusion Models

  262. Transformers are Meta-Reinforcement Learners

  263. Autoregressive Latent Video Prediction with High-Fidelity Image Generator

  264. Skill Induction and Planning with Latent Language

  265. Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query

  266. Understanding and Overcoming the Challenges of Efficient Transformer Quantization

  267. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

  268. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

  269. MeLT: Message-Level Transformer with Masked Document Representations as Pre-Training for Stance Detection

  270. KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

  271. Block Pruning For Faster Transformers

  272. The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning

  273. DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction

  274. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

  275. Data and Parameter Scaling Laws for Neural Machine Translation

  276. ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

  277. Modeling Protein Using Large-scale Pretrain Language Model

  278. Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

  279. EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training

  280. Internet-Augmented Dialogue Generation

  281. HTLM: Hyper-Text Pre-Training and Prompting of Language Models

  282. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

  283. ViTGAN: Training GANs with Vision Transformers

  284. ARM-Net: Adaptive Relation Modeling Network for Structured Data

  285. SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

  286. Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

  287. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

  288. Revisiting the Calibration of Modern Neural Networks

  289. Scaling Laws for Acoustic Models

  290. CoAtNet: Marrying Convolution and Attention for All Data Sizes

  291. Chasing Sparsity in Vision Transformers: An End-to-End Exploration

  292. Tabular Data: Deep Learning is Not All You Need

  293. Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

  294. Exploring Transfer Learning techniques for Named Entity Recognition in Noisy User-Generated Text

  295. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

  296. Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks

  297. One4all User Representation for Recommender Systems in E-commerce

  298. QASPER: A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

  299. MathBERT: A Pre-Trained Model for Mathematical Formula Understanding

  300. MDETR—Modulated Detection for End-to-End Multi-Modal Understanding

  301. XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond

  302. [Ali released PLUG: 27 billion parameters, the largest pre-trained language model in the Chinese community]

  303. SimCSE: Simple Contrastive Learning of Sentence Embeddings

  304. Robust Open-Vocabulary Translation from Visual Text Representations

  305. Memorization versus Generalization in Pre-trained Language Models

  306. Retrieval Augmentation Reduces Hallucination in Conversation

  307. Gradient-based Adversarial Attacks against Text Transformers

  308. TSDAE: Using Transformer-based Sequential Denoising Autoencoder for Unsupervised Sentence Embedding Learning

  309. Machine Translation Decoding beyond Beam Search

  310. An Empirical Study of Training Self-Supervised Vision Transformers

  311. ChinAI #137: Year 3 of ChinAI: Reflections on the newsworthiness of machine translation

  312. GPV-1: Towards General Purpose Vision Systems

  313. DeepViT: Towards Deeper Vision Transformer

  314. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

  315. Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence (VitaminC)

  316. Learning from videos to understand the world

  317. Are NLP Models really able to Solve Simple Math Word Problems?

  318. CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

  319. TransGAN: Two Transformers Can Make One Strong GAN

  320. baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling

  321. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

  322. Video Transformer Network

  323. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

  324. BENDR: using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data

  325. Bottleneck Transformers for Visual Recognition

  326. DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition

  327. UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers

  328. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

  329. XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation

  330. Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

  331. Training data-efficient image transformers & distillation through attention

  332. VQ-GAN: Taming Transformers for High-Resolution Image Synthesis

  333. Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures

  334. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

  335. Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup

  336. TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game

  337. A Recurrent Vision-and-Language BERT for Navigation

  338. A Primer in BERTology: What we know about how BERT works

  339. CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

  340. TernaryBERT: Distillation-aware Ultra-low Bit BERT

  341. Weird AI Yankovic: Generating Parody Lyrics

  342. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

  343. DeepSpeed: Extreme-scale model training for everyone

  344. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

  345. CoVoST 2 and Massively Multilingual Speech-to-Text Translation

  346. Modern Hopfield Networks and Attention for Immune Repertoire Classification

  347. Hopfield Networks is All You Need

  348. Can neural networks acquire a structural bias from raw linguistic data?

  349. DeepSinger: Singing Voice Synthesis with Data Mined From the Web

  350. Data Movement Is All You Need: A Case Study on Optimizing Transformers

  351. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

  352. PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training

  353. Learning to Learn with Feedback and Local Plasticity

  354. Improving GAN Training with Probability Ratio Clipping and Sample Reweighting

  355. DeBERTa: Decoding-enhanced BERT with Disentangled Attention

  356. DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations

  357. DETR: End-to-End Object Detection with Transformers

  358. Open-Retrieval Conversational Question Answering

  359. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data

  360. ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data

  361. VLN-BERT: Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

  362. Blender: A state-of-the-art open source chatbot

  363. General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference

  364. Recipes for building an open-domain chatbot

  365. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

  366. On the Effect of Dropping Layers of Pre-trained Transformer Models

  367. Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders

  368. TAPAS: Weakly Supervised Table Parsing via Pre-training

  369. A Hundred Visions and Revisions

  370. Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

  371. AraBERT: Transformer-based Model for Arabic Language Understanding

  372. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

  373. GNS: Learning to Simulate Complex Physics with Graph Networks

  374. Do We Need Zero Training Loss After Achieving Zero Training Error?

  375. Bayesian Deep Learning and a Probabilistic Perspective of Generalization

  376. Transformers as Soft Reasoners over Language

  377. Towards a Conversational Agent that Can Chat About…Anything

  378. Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference

  379. Improving Transformer Optimization Through Better Initialization

  380. VIME: Extending the Success of Self-supervised and Semi-supervised Learning to Tabular Domain

  381. Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

  382. Mastering Complex Control in MOBA Games with Deep Reinforcement Learning

  383. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

  384. Encoding Musical Style with Transformer Autoencoders

  385. Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time

  386. Detecting GAN generated errors

  387. SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling

  388. Unsupervised Cross-lingual Representation Learning at Scale

  389. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

  390. TinyBERT: Distilling BERT for Natural Language Understanding

  391. Do NLP Models Know Numbers? Probing Numeracy in Embeddings

  392. PubMedQA: A Dataset for Biomedical Research Question Answering

  393. Frustratingly Easy Natural Question Answering

  394. Distributionally Robust Language Modeling

  395. Language Models as Knowledge Bases?

  396. Encode, Tag, Realize: High-Precision Text Editing

  397. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

  398. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

  399. TabNet: Attentive Interpretable Tabular Learning

  400. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

  401. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models

  402. RoBERTa: A Robustly Optimized BERT Pretraining Approach

  403. Theoretical Limitations of Self-Attention in Neural Sequence Models

  404. Energy and Policy Considerations for Deep Learning in NLP

  405. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

  406. HellaSwag: Can a Machine Really Finish Your Sentence?

  407. UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation

  408. MASS: Masked Sequence to Sequence Pre-training for Language Generation

  409. Mask-Predict: Parallel Decoding of Conditional Masked Language Models

  410. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

  411. LIGHT: Learning to Speak and Act in a Fantasy Text Adventure Game

  412. Insertion Transformer: Flexible Sequence Generation via Insertion Operations

  413. Adapter: Parameter-Efficient Transfer Learning for NLP

  414. Learning and Evaluating General Linguistic Intelligence

  415. BioBERT: a pre-trained biomedical language representation model for biomedical text mining

  416. Efficient Training of BERT by Progressively Stacking

  417. Bayesian Layers: A Module for Neural Network Uncertainty

  418. Blockwise Parallel Decoding for Deep Autoregressive Models

  419. Object Hallucination in Image Captioning

  420. Self-Attention Generative Adversarial Networks

  421. Universal Sentence Encoder

  422. Self-Attention with Relative Position Representations

  423. Learning Longer-term Dependencies in RNNs with Auxiliary Losses

  424. Generating Structured Music through Self-Attention

  425. GPipe: Easy Scaling With Micro-Batch Pipeline Parallelism § Pg4

  426. a8efcc8272af6f434119f87a00c2edaf84241597.pdf#page=4&org=google

  427. A Simple Neural Attentive Meta-Learner

  428. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

  429. QRNNs: Quasi-Recurrent Neural Networks

  430. Gaussian Error Linear Units (GELUs)

  431. Pointer Networks

  432. No Physics? No Problem. AI Weather Forecasting Is Already Making Huge Strides.

  433. Huggingface: transformers repo

  434. Transformers in Vision

  435. da7c0669b636cff176304bf5ad7e5be6b1c548d4.html

  436. The Illustrated GPT-2 (Visualizing Transformer Language Models)

  437. 9c8f9cd3340b81bff85edf494914cb32c3cfa752.html

  438. The Illustrated Transformer

  439. e31ac90356bc07771cb4035cae68102d76c54b88.html

  440. Autoregressive Long-Context Music Generation With Perceiver AR

  441. The Transformer—Attention Is All You Need.

  442. bee2035033c271de0365755fabc40157a100c16d.html

  443. Understanding BERT Transformer: Attention Isn’t All You Need

  444. Etched Is Making the Biggest Bet in AI

  445. de827f3f96e99801324672b5ad20de1f492133fc.html

  446. Was Linguistic A.I. Created by Accident?

  447. Transformers are a very exciting family of machine learning architectures

  448. design#future-tag-features

    [Transclude the forward-link's context]

  449. 2023-nguyen-figure12-biggerclimateforecastingmodelsaremoresampleefficientonlowresolutiondata.jpg

  450. 2022-cheng-figure2-ablationofvindlutextvideomodelperformancebysourceofperformancechanges.jpg

  451. 2021-hu-figure2-b-datascalingfinetuningperformanceonnocaps.jpg

  452. 2021-hu-figure6-largerlemoncaptionmodelsaremoresampleefficient.jpg

  453. 2021-zaken-figure2-scalingcurveoffinetuningvsbiastuningshowscurvescrossasdatasetsizeincreases.png

  454. https://aclanthology.org/2020.wmt-1.1.pdf

  455. e0ceee45b691fe5fa5c00ccba35a78fae6f334dd.pdf

  456. https://aclanthology.org/2021.emnlp-main.563.pdf

  457. https://aclanthology.org/D18-1092/

  458. https://ai.facebook.com/blog/harmful-content-can-evolve-quickly-our-new-ai-system-adapts-to-tackle-it

  459. https://aimoprize.com/

  460. 0aecc719abaacb95782dbceb7a4f16e984f6b423.html

  461. https://bellard.org/ts_server/ts_zip.html

  462. https://blog.floydhub.com/the-transformer-in-pytorch/

  463. 1f701b1ca25c1e5f8798a68e6268e824939068c1.html

  464. https://github.com/Mozilla-Ocho/llamafile

  465. https://github.com/NVIDIA/FasterTransformer

  466. https://github.com/huggingface/transformers/tree/main/src/transformers

  467. https://github.com/lukas-blecher/LaTeX-OCR

  468. https://gonzoml.substack.com/p/you-only-cache-once-decoder-decoder

  469. https://justine.lol/matmul/

  470. a758f15e02c56c1677a4e1917cb89372948613f8.html

  471. https://kenschutte.com/gzip-knn-paper2/

  472. https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/

  473. https://linktransformer.github.io/

  474. https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms

  475. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4588941

  476. https://research.google/blog/on-device-content-distillation-with-graph-neural-networks/

  477. https://research.google/blog/unsupervised-speech-to-speech-translation-from-monolingual-data/

  478. https://sander.ai/2023/01/09/diffusion-language.html#deepmind

  479. c66010a3224572186602a701ae5692c87ca778b9.html#deepmind

  480. https://sites.google.com/view/medusa-llm

  481. d47fbad81f994a0b9eb15546016239e5cd4805d4.html

  482. https://www.csm.ai/commonsim-1-generating-3d-worlds

  483. https://www.lesswrong.com/posts/2JJtxitp6nqu6ffak/basic-facts-about-language-models-during-training-1

  484. https://www.lesswrong.com/posts/4Hnso8NMAeeYs8Cta/revealing-intentionality-in-language-models-through-adavae#BigVAE_and_Its_Samplers

  485. https://www.quantamagazine.org/how-ai-transformers-mimic-parts-of-the-brain-20220912/

  486. 3dbe58fcc214734989043567a1248a2f535fbb1a.html

  487. https://www.reddit.com/r/MachineLearning/comments/yxt8sa/r_rwkv4_7b_release_an_attentionfree_rnn_language/

  488. ad5ae41ac26ebe772323d4cfbfd6bf360e7efa3b.html

  489. https://x.com/JosephJacks_/status/1647328379266551808

  490. https://x.com/alyssamvance/status/1612580727744520192

  491. https://x.com/jconorgrogan/status/1820212444016345146

  492. https://x.com/kanishkamisra/status/1775156612988088736

  493. https://x.com/karpathy/status/1765473722985771335

  494. https://x.com/stephenroller/status/1579993017234382849

  495. Gemma 2: Improving Open Language Models at a Practical Size

  496. Behnam Neyshabur

  497. Koray Kavukcuoglu

  498. https%253A%252F%252Farxiv.org%252Fabs%252F2408.00118%2523google.html

  499. Investigating the Ability of LLMs to Recognize Their Own Writing

  500. https%253A%252F%252Fwww.lesswrong.com%252Fposts%252FADrTuuus6JsQr5CSi%252Finvestigating-the-ability-of-llms-to-recognize-their-own.html

  501. Grokfast: Accelerated Grokking by Amplifying Slow Gradients

  502. https%253A%252F%252Farxiv.org%252Fabs%252F2405.20233.html

  503. Not All Language Model Features Are Linear

  504. https%253A%252F%252Farxiv.org%252Fabs%252F2405.14860.html

  505. You Only Cache Once: Decoder-Decoder Architectures for Language Models

  506. Furu Wei

  507. https%253A%252F%252Farxiv.org%252Fabs%252F2405.05254%2523microsoft.html

  508. Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

  509. https%253A%252F%252Farxiv.org%252Fabs%252F2404.13292.html

  510. Chinchilla Scaling: A replication attempt

  511. https%253A%252F%252Farxiv.org%252Fabs%252F2404.10102.html

  512. Language models accurately infer correlations between psychological items and scales from text alone

  513. https%253A%252F%252Fosf.io%252Fpreprints%252Fpsyarxiv%252Fkjuce.html

  514. Inflection-2.5: meet the world’s best personal AI

  515. https%253A%252F%252Finflection.ai%252Finflection-2-5.html

  516. Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

  517. Aditya Grover

  518. https%253A%252F%252Farxiv.org%252Fabs%252F2312.03876.html

  519. GIVT: Generative Infinite-Vocabulary Transformers

  520. https%253A%252F%252Farxiv.org%252Fabs%252F2312.02116.html

  521. CogVLM: Visual Expert for Pretrained Language Models

  522. https%253A%252F%252Farxiv.org%252Fabs%252F2311.03079%2523zhipu.html

  523. LLM-FP4: 4-Bit Floating-Point Quantized Transformers

  524. https%253A%252F%252Farxiv.org%252Fabs%252F2310.16836.html

  525. To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

  526. https%253A%252F%252Farxiv.org%252Fabs%252F2310.13061.html

  527. Sparse Universal Transformer

  528. Aaron Courville

  529. https%253A%252F%252Farxiv.org%252Fabs%252F2310.07096%2523ibm.html

  530. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

  531. https%253A%252F%252Farxiv.org%252Fabs%252F2310.06694.html

  532. Language Models Represent Space and Time

  533. https%253A%252F%252Farxiv.org%252Fabs%252F2310.02207.html

  534. Nougat: Neural Optical Understanding for Academic Documents

  535. https%253A%252F%252Farxiv.org%252Fabs%252F2308.13418%2523facebook.html

  536. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

  537. https%253A%252F%252Farxiv.org%252Fabs%252F2308.11596%2523facebook.html

  538. RGD: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization

  539. https%253A%252F%252Farxiv.org%252Fabs%252F2306.09222%2523google.html

  540. SequenceMatch: Imitation Learning for Autoregressive Sequence Modeling with Backtracking

  541. Stefano Ermon

  542. https%253A%252F%252Farxiv.org%252Fabs%252F2306.05426.html

  543. Scaling laws for language encoding models in fMRI

  544. https%253A%252F%252Farxiv.org%252Fabs%252F2305.11863.html

  545. When and How Artificial Intelligence Augments Employee Creativity

  546. %252Fdoc%252Feconomics%252Fautomation%252F2023-jia.pdf.html

  547. MUX-PLMs: Pre-training Language Models with Data Multiplexing

  548. https%253A%252F%252Farxiv.org%252Fabs%252F2302.12441.html

  549. Scaling Vision Transformers to 22 Billion Parameters

  550. Robert Geirhos

  551. Lucas Beyer

  552. Yi Tay

  553. Neil Houlsby

  554. https%253A%252F%252Farxiv.org%252Fabs%252F2302.05442%2523google.html

  555. BMT: Binarized Neural Machine Translation

  556. https%253A%252F%252Farxiv.org%252Fabs%252F2302.04907%2523google.html

  557. Progress measures for grokking via mechanistic interpretability

  558. Jacob Steinhardt

  559. https%253A%252F%252Farxiv.org%252Fabs%252F2301.05217.html

  560. Scaling Laws for Generative Mixed-Modal Language Models

  561. Omer Levy

  562. Luke Zettlemoyer

  563. https%253A%252F%252Farxiv.org%252Fabs%252F2301.03728%2523facebook.html

  564. Vision Transformers Are Good Mask Auto-Labelers

  565. https%253A%252F%252Farxiv.org%252Fabs%252F2301.03992%2523nvidia.html

  566. Cramming: Training a Language Model on a Single GPU in One Day

  567. https%253A%252F%252Farxiv.org%252Fabs%252F2212.14034.html

  568. Less is More: Parameter-Free Text Classification with Gzip

  569. https%253A%252F%252Farxiv.org%252Fabs%252F2212.09410.html

  570. What do Vision Transformers Learn? A Visual Exploration

  571. https%253A%252F%252Farxiv.org%252Fabs%252F2212.06727.html

  572. MAGVIT: Masked Generative Video Transformer

  573. https%253A%252F%252Farxiv.org%252Fabs%252F2212.05199%2523google.html

  574. VindLU: A Recipe for Effective Video-and-Language Pretraining

  575. Mohit Bansal

  576. https%253A%252F%252Farxiv.org%252Fabs%252F2212.05051.html

  577. Text Embeddings by Weakly-Supervised Contrastive Pre-training

  578. Furu Wei

  579. https%253A%252F%252Farxiv.org%252Fabs%252F2212.03533%2523microsoft.html

  580. NPM: Nonparametric Masked Language Modeling

  581. Mike Lewis

  582. Hannaneh Hajishirzi—University of Washington

  583. Luke Zettlemoyer

  584. https%253A%252F%252Farxiv.org%252Fabs%252F2212.01349%2523facebook.html

  585. Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

  586. https%253A%252F%252Farxiv.org%252Fabs%252F2211.09808.html

  587. OneFormer: One Transformer to Rule Universal Image Segmentation

  588. https%253A%252F%252Farxiv.org%252Fabs%252F2211.06220.html

  589. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

  590. Sanjiv Kumar

  591. https%253A%252F%252Farxiv.org%252Fabs%252F2210.06313%2523google.html

  592. Semantic scene descriptions as an objective of human vision

  593. https%253A%252F%252Farxiv.org%252Fabs%252F2209.11737.html

  594. SetFit: Efficient Few-Shot Learning Without Prompts

  595. https%253A%252F%252Farxiv.org%252Fabs%252F2209.11055.html

  596. Analyzing Transformers in Embedding Space

  597. https%253A%252F%252Farxiv.org%252Fabs%252F2209.02535.html

  598. Re2G: Retrieve, Rerank, Generate

  599. https%253A%252F%252Farxiv.org%252Fabs%252F2207.06300%2523ibm.html

  600. TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data

  601. Profile – Machine Learning Lab

  602. https%253A%252F%252Farxiv.org%252Fabs%252F2207.01848.html

  603. Do Loyal Users Enjoy Better Recommendations? Understanding Recommender Accuracy from a Time Perspective

  604. https%253A%252F%252Farxiv.org%252Fabs%252F2204.05927.html

  605. RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

  606. https%253A%252F%252Farxiv.org%252Fabs%252F2206.07137.html

  607. LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

  608. https%253A%252F%252Farxiv.org%252Fabs%252F2206.07160%2523microsoft.html

  609. Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model

  610. https%253A%252F%252Fwww.biorxiv.org%252Fcontent%252F10.1101%252F2022.06.08.495348.full.html

  611. XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient

  612. https%253A%252F%252Farxiv.org%252Fabs%252F2206.01859%2523microsoft.html

  613. Toward a realistic model of speech processing in the brain with self-supervised learning

  614. https%253A%252F%252Farxiv.org%252Fabs%252F2206.01685.html

  615. Anime Character Recognition using Intermediate Features Aggregation

  616. %252Fdoc%252Fai%252Fanime%252Fdanbooru%252F2022-rios.pdf.html

  617. Towards Learning Universal Hyperparameter Optimizers with Transformers

  618. https%253A%252F%252Farxiv.org%252Fabs%252F2205.13320%2523google.html

  619. HTPS: HyperTree Proof Search for Neural Theorem Proving

  620. https%253A%252F%252Farxiv.org%252Fabs%252F2205.11491%2523facebook.html

  621. When does dough become a bagel? Analyzing the remaining mistakes on ImageNet

  622. https%253A%252F%252Farxiv.org%252Fabs%252F2205.04596%2523google.html

  623. Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion

  624. https%253A%252F%252Farxiv.org%252Fabs%252F2203.13224%2523facebook.html

  625. LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

  626. https%253A%252F%252Farxiv.org%252Fabs%252F2203.02094%2523microsoft.html

  627. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

  628. https%253A%252F%252Farxiv.org%252Fabs%252F2202.03052%2523alibaba.html

  629. PFNs: Transformers Can Do Bayesian Inference

  630. Profile – Machine Learning Lab

  631. https%253A%252F%252Farxiv.org%252Fabs%252F2112.10510.html

  632. FQ-ViT: Fully Quantized Vision Transformer without Retraining

  633. https%253A%252F%252Farxiv.org%252Fabs%252F2111.13824.html

  634. LEMON: Scaling Up Vision-Language Pre-training for Image Captioning

  635. https%253A%252F%252Farxiv.org%252Fabs%252F2111.12233%2523microsoft.html

  636. It’s About Time: Analog Clock Reading in the Wild

  637. https%253A%252F%252Farxiv.org%252Fabs%252F2111.09162.html

  638. A Survey of Visual Transformers

  639. https%253A%252F%252Farxiv.org%252Fabs%252F2111.06091.html

  640. Understanding and Overcoming the Challenges of Efficient Transformer Quantization

  641. https%253A%252F%252Farxiv.org%252Fabs%252F2109.12948.html

  642. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

  643. Furu Wei

  644. https%253A%252F%252Farxiv.org%252Fabs%252F2109.10282%2523microsoft.html

  645. KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

  646. https%253A%252F%252Farxiv.org%252Fabs%252F2109.06243%2523huawei.html

  647. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

  648. https%253A%252F%252Farxiv.org%252Fabs%252F2108.13002%2523microsoft.html

  649. Internet-Augmented Dialogue Generation

  650. https%253A%252F%252Farxiv.org%252Fabs%252F2107.07566%2523facebook.html

  651. ViTGAN: Training GANs with Vision Transformers

  652. https%253A%252F%252Farxiv.org%252Fabs%252F2107.04589.html

  653. Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

  654. Yi Tay

  655. https%253A%252F%252Farxiv.org%252Fabs%252F2106.12672%2523google.html

  656. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

  657. https%253A%252F%252Farxiv.org%252Fabs%252F2106.10199.html

  658. Scaling Laws for Acoustic Models

  659. https%253A%252F%252Farxiv.org%252Fabs%252F2106.09488%2523amazon.html

  660. CoAtNet: Marrying Convolution and Attention for All Data Sizes

  661. Zihang Dai

  662. https%253A%252F%252Farxiv.org%252Fabs%252F2106.04803%2523google.html

  663. Chasing Sparsity in Vision Transformers: An End-to-End Exploration

  664. https%253A%252F%252Farxiv.org%252Fabs%252F2106.04533.html

  665. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

  666. https%253A%252F%252Farxiv.org%252Fabs%252F2105.15203.html

  667. Retrieval Augmentation Reduces Hallucination in Conversation

  668. https%253A%252F%252Farxiv.org%252Fabs%252F2104.07567%2523facebook.html

  669. ChinAI #137: Year 3 of ChinAI: Reflections on the newsworthiness of machine translation

  670. https%253A%252F%252Fchinai.substack.com%252Fp%252Fchinai-137-year-3-of-chinai.html

  671. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

  672. https%253A%252F%252Farxiv.org%252Fabs%252F2103.10697%2523facebook.html

  673. Learning from videos to understand the world

  674. Polina Kuznetsova

  675. https%253A%252F%252Fai.facebook.com%252Fblog%252Flearning-from-videos-to-understand-the-world%252F.html

  676. TransGAN: Two Transformers Can Make One Strong GAN

  677. https%253A%252F%252Farxiv.org%252Fabs%252F2102.07074.html

  678. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

  679. https%253A%252F%252Farxiv.org%252Fabs%252F2102.03334.html

  680. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

  681. https%253A%252F%252Farxiv.org%252Fabs%252F2101.11986.html

  682. Bottleneck Transformers for Visual Recognition

  683. Aravind Srinivas

  684. Niki Parmar

  685. https%253A%252F%252Farxiv.org%252Fabs%252F2101.11605%2523google.html

  686. DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition

  687. https%253A%252F%252Farxiv.org%252Fabs%252F2101.08674.html

  688. XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation

  689. https%253A%252F%252Farxiv.org%252Fabs%252F2101.04702%2523google.html

  690. Training data-efficient image transformers & distillation through attention

  691. https%253A%252F%252Farxiv.org%252Fabs%252F2012.12877%2523facebook.html

  692. Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures

  693. Language Understanding Grounded in Perception and Action

  694. https%253A%252F%252Farxiv.org%252Fabs%252F2012.08508%2523deepmind.html

  695. TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game

  696. https%253A%252F%252Farxiv.org%252Fabs%252F2011.13729%2523tencent.html

  697. DeepSpeed: Extreme-scale model training for everyone

  698. https%253A%252F%252Fwww.microsoft.com%252Fen-us%252Fresearch%252Fblog%252Fdeepspeed-extreme-scale-model-training-for-everyone%252F.html

  699. Hopfield Networks is All You Need

  700. https%253A%252F%252Farxiv.org%252Fabs%252F2008.02217.html

  701. DeBERTa: Decoding-enhanced BERT with Disentangled Attention

  702. Jianfeng Gao at Microsoft Research

  703. https%253A%252F%252Farxiv.org%252Fabs%252F2006.03654%2523microsoft.html

  704. DETR: End-to-End Object Detection with Transformers

  705. https%253A%252F%252Farxiv.org%252Fabs%252F2005.12872%2523facebook.html

  706. Blender: A state-of-the-art open source chatbot

  707. https%253A%252F%252Fai.meta.com%252Fblog%252Fstate-of-the-art-open-source-chatbot%252F.html

  708. On the Effect of Dropping Layers of Pre-trained Transformer Models

  709. https%253A%252F%252Farxiv.org%252Fabs%252F2004.03844.html

  710. Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders

  711. https%253A%252F%252Farxiv.org%252Fabs%252F2004.03965.html

  712. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

  713. Furu Wei

  714. https%253A%252F%252Farxiv.org%252Fabs%252F2002.10957%2523microsoft.html

  715. Towards a Conversational Agent that Can Chat About…Anything

  716. https%253A%252F%252Fresearch.google%252Fblog%252Ftowards-a-conversational-agent-that-can-chat-aboutanything%252F.html

  717. Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time

  718. Preetum Nakkiran

  719. Yamini Bansal

  720. https%253A%252F%252Fopenai.com%252Fresearch%252Fdeep-double-descent.html

  721. Unsupervised Cross-lingual Representation Learning at Scale

  722. Luke Zettlemoyer

  723. https%253A%252F%252Farxiv.org%252Fabs%252F1911.02116%2523facebook.html

  724. TinyBERT: Distilling BERT for Natural Language Understanding

  725. https%253A%252F%252Farxiv.org%252Fabs%252F1909.10351.html

  726. Frustratingly Easy Natural Question Answering

  727. https%253A%252F%252Farxiv.org%252Fabs%252F1909.05286%2523ibm.html

  728. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

  729. https%253A%252F%252Farxiv.org%252Fabs%252F1908.04577%2523alibaba.html

  730. RoBERTa: A Robustly Optimized BERT Pretraining Approach

  731. Omer Levy

  732. Mike Lewis

  733. Luke Zettlemoyer

  734. https%253A%252F%252Farxiv.org%252Fabs%252F1907.11692%2523facebook.html

  735. UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation

  736. Furu Wei

  737. Jianfeng Gao at Microsoft Research

  738. https%253A%252F%252Farxiv.org%252Fabs%252F1905.03197.html

  739. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

  740. Sanjiv Kumar

  741. https%253A%252F%252Farxiv.org%252Fabs%252F1904.00962%2523google.html

  742. BioBERT: a pre-trained biomedical language representation model for biomedical text mining

  743. https%253A%252F%252Farxiv.org%252Fabs%252F1901.08746.html

  744. Generating Structured Music through Self-Attention

  745. Jakob Uszkoreit

  746. %252Fdoc%252Fai%252Fmusic%252F2018-huang.pdf.html

  747. Huggingface: transformers repo

  748. https%253A%252F%252Fgithub.com%252Fhuggingface%252Ftransformers.html