‘Transformer’ directory

Annotations sorted by machine learning into ⁠inferred 'tags'⁠. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Wikipedia

Attention (machine learning)⁠
BERT (language model)⁠
Hopfield network⁠
Perceiver⁠ :

https://en.wikipedia.org/wiki/Perceiver⁠
Vision transformer⁠
Wu Dao⁠ :

https://en.wikipedia.org/wiki/Wu_Dao⁠

Miscellaneous

Bibliography

https://arxiv.org/abs/2408.00118#google: “Gemma 2: Improving Open Language Models at a Practical Size ”⁠, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa …, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk⁠, Sertan Girgin, Nikola Momchev, Matt Hoffman⁠, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A. Choquette-Choo, Danila Sinopalnikov, David Weinberger, Dimple Vijaykumar, Dominika Rogozińska, Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Plucińska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju-yeong Ji, Kareem Mohamed, Kartikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjoesund, Lauren Usui, Laurent Sifre⁠, Lena Heuermann, Leticia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid, Manvinder Singh, Mark Iverson, Martin Görner, Mat Velloso, Mateo Wirth, Matt Davidow, Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, Mehran Kazemi, Michael Moynihan, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Oscar Wahltinez, Pankil Botarda, Parker Barnes⁠, Paul Barham⁠, Paul Michel, Pengchong Jin, Petko Georgiev, Phil Culliton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, Reza Ardeshir Rokni, Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy, Sarah Cogan, Sarah Perrin, Sébastien M. R. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ronstrom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles⁠, Tom Hennigan, Tomas Kocisky, Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani⁠, Raia Hadsell, D. Sculley, Jeanine Banks, Anca Dragan, Slav Petrov, Oriol Vinyals⁠, Jeff Dean⁠, Demis Hassabis⁠, Koray Kavukcuoglu⁠, Clement Farabet, Elena Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Armand Joulin⁠, Kathleen Kenealy⁠, Robert Dadashi, Alek Andreev
link-bibliography⁠
https://www.lesswrong.com/posts/ADrTuuus6JsQr5CSi/investigating-the-ability-of-llms-to-recognize-their-own: “Investigating the Ability of LLMs to Recognize Their Own Writing ”⁠, Christopher Ackerman, Nina Panickssery
link-bibliography⁠
https://arxiv.org/abs/2405.20233: “Grokfast: Accelerated Grokking by Amplifying Slow Gradients ”⁠, Jaerin Lee, Bong Gyun Kang, Kihoon Kim, Kyoung Mu Lee
link-bibliography⁠
https://arxiv.org/abs/2405.14860: “Not All Language Model Features Are Linear ”⁠, Joshua Engels, Isaac Liao, Eric J. Michaud …, Wes Gurnee, Max Tegmark⁠
link-bibliography⁠
https://arxiv.org/abs/2405.05254#microsoft: “You Only Cache Once: Decoder-Decoder Architectures for Language Models ”⁠, Yutao Sun, Li Dong⁠, Yi Zhu …, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei⁠
link-bibliography⁠
https://arxiv.org/abs/2404.13292: “Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge ”⁠, Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers …, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella
link-bibliography⁠
https://osf.io/preprints/psyarxiv/kjuce: “Language Models Accurately Infer Correlations between Psychological Items and Scales from Text Alone ”⁠, Björn E. Hommel, Ruben C. Arslan
link-bibliography⁠
https://inflection.ai/inflection-2-5: “Inflection-2.5: Meet the World’s Best Personal AI ”, Inflection
link-bibliography⁠
https://arxiv.org/abs/2312.03876: “Scaling Transformer Neural Networks for Skillful and Reliable Medium-Range Weather Forecasting ”⁠, Tung Nguyen, Rohan Shah, Hritik Bansal …, Troy Arcomano, Sandeep Madireddy, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster, Aditya Grover⁠
link-bibliography⁠
https://arxiv.org/abs/2312.02116: “GIVT: Generative Infinite-Vocabulary Transformers ”⁠, ⁠Michael Tschannen, Cian Eastwood, Fabian Mentzer
link-bibliography⁠
https://arxiv.org/abs/2311.03079#zhipu: “CogVLM: Visual Expert for Pretrained Language Models ”⁠, Weihan Wang, Qingsong Lv, Wenmeng Yu …, Wenyi Hong, Ji Qi⁠, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu⁠, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang⁠
link-bibliography⁠
https://arxiv.org/abs/2310.16836: “LLM-FP4: 4-Bit Floating-Point Quantized Transformers ”⁠, Shih-yang Liu, Zechun Liu, Xijie Huang …, Pingcheng Dong, Kwang-Ting Cheng
link-bibliography⁠
https://arxiv.org/abs/2310.13061: “To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets ”⁠, Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov
link-bibliography⁠
https://arxiv.org/abs/2310.07096#ibm: “Sparse Universal Transformer ”⁠, Shawn Tan, Yikang Shen, Zhenfang Chen …, Aaron Courville⁠, Chuang Gan
link-bibliography⁠
https://arxiv.org/abs/2310.06694: “Sheared LLaMA: Accelerating Language Model Pre-Training via Structured Pruning ”⁠, Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen⁠
link-bibliography⁠
https://arxiv.org/abs/2310.02207: “Language Models Represent Space and Time ”⁠, Wes Gurnee, Max Tegmark⁠
link-bibliography⁠
https://arxiv.org/abs/2308.13418#facebook: “Nougat: Neural Optical Understanding for Academic Documents ”⁠, Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic
link-bibliography⁠
https://arxiv.org/abs/2308.11596#facebook: “SeamlessM4T: Massively Multilingual & Multimodal Machine Translation ”⁠, Seamless Communication, Loïc Barrault, Yu-An Chung …, Mariano Cora Meglioli, David Dale⁠, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard⁠, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun⁠, Kevin Tran⁠, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang⁠, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee⁠, Alexandre Mourachko, Juan Pino⁠, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang
link-bibliography⁠
https://arxiv.org/abs/2306.09222#google: “RGD: Stochastic Re-Weighted Gradient Descent via Distributionally Robust Optimization ”⁠, Ramnath Kumar, Kushal Majmundar, Dheeraj Nagaraj, Arun Sai Suggala
link-bibliography⁠
https://arxiv.org/abs/2306.05426: “SequenceMatch: Imitation Learning for Autoregressive Sequence Modeling With Backtracking ”⁠, Chris Cundy, Stefano Ermon⁠
link-bibliography⁠
https://arxiv.org/abs/2305.11863: “Scaling Laws for Language Encoding Models in FMRI ”⁠, Richard Antonello, Aditya Vaidya, Alexander G. Huth
link-bibliography⁠
2023-jia.pdf: “When and How Artificial Intelligence Augments Employee Creativity ”⁠, Nan Jia, Xueming Luo, Zheng Fang, Chengcheng Liao
link-bibliography⁠
https://arxiv.org/abs/2302.12441: “MUX-PLMs: Pre-Training Language Models With Data Multiplexing ”⁠, Vishvak Murahari, Ameet Deshpande, Carlos E. Jimenez …, Izhak Shafran, Mingqiu Wang, Yuan Cao⁠, Karthik Narasimhan
link-bibliography⁠
https://arxiv.org/abs/2302.05442#google: “Scaling Vision Transformers to 22 Billion Parameters ”⁠, Mostafa Dehghani, Josip Djolonga, Basil Mustafa …, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos⁠, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer⁠, ⁠Michael Tschannen, Anurag Arnab, Xiao Wang⁠, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, ⁠Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai⁠, Daniel Keysers, Jeremiah Harmsen, ⁠Neil Houlsby
link-bibliography⁠
https://arxiv.org/abs/2301.05217: “Progress Measures for Grokking via Mechanistic Interpretability ”⁠, Neel Nanda, Lawrence Chan, Tom Lieberum …, Jess Smith⁠, ⁠Jacob Steinhardt
link-bibliography⁠
https://arxiv.org/abs/2301.03728#facebook: “Scaling Laws for Generative Mixed-Modal Language Models ”⁠, Armen Aghajanyan, Lili Yu, Alexis Conneau …, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy⁠, Luke Zettlemoyer⁠
link-bibliography⁠
https://arxiv.org/abs/2301.03992#nvidia: “Vision Transformers Are Good Mask Auto-Labelers ”⁠, Shiyi Lan, Xitong Yang, Zhiding Yu …, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar⁠
link-bibliography⁠
https://arxiv.org/abs/2212.09410: “Less Is More: Parameter-Free Text Classification With Gzip ”⁠, Zhiying Jiang, Matthew Y. R. Yang, Mikhail Tsirlin …, Raphael Tang, Jimmy Lin
link-bibliography⁠
https://arxiv.org/abs/2212.06727: “What Do Vision Transformers Learn? A Visual Exploration ”⁠, Amin Ghiasi, Hamid Kazemi, Eitan Borgnia …, Steven Reich⁠, Manli Shu, Micah Goldblum, Andrew Gordon Wilson, Tom Goldstein⁠
link-bibliography⁠
https://arxiv.org/abs/2212.05199#google: “MAGVIT: Masked Generative Video Transformer ”⁠, Lijun Yu, Yong Cheng, Kihyuk Sohn …, José Lezama, Han Zhang⁠, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao⁠, Irfan Essa⁠, Lu Jiang
link-bibliography⁠
https://arxiv.org/abs/2212.05051: “VindLU: A Recipe for Effective Video-And-Language Pretraining ”⁠, Feng Cheng, Xizi Wang, Jie Lei …, David Crandall, ⁠Mohit Bansal, Gedas Bertasius
link-bibliography⁠
https://arxiv.org/abs/2212.03533#microsoft: “Text Embeddings by Weakly-Supervised Contrastive Pre-Training ”⁠, Liang Wang, Nan Yang, Xiaolong Huang …, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei⁠
link-bibliography⁠
https://arxiv.org/abs/2212.01349#facebook: “NPM: Nonparametric Masked Language Modeling ”⁠, Sewon Min, Weijia Shi, Mike Lewis⁠ …, Xilun Chen, Wen-tau Yih, ⁠Hannaneh Hajishirzi, Luke Zettlemoyer⁠
link-bibliography⁠
https://arxiv.org/abs/2211.09808: “Uni-Perceiver V2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks ”⁠, Hao Li⁠, Jinguo Zhu, Xiaohu Jiang …, Xizhou Zhu, Hongsheng Li, Chun Yuan⁠, Xiaohua Wang, Yu Qiao, Xiaogang Wang⁠, Wenhai Wang, Jifeng Dai
link-bibliography⁠
https://arxiv.org/abs/2211.06220: “OneFormer: One Transformer to Rule Universal Image Segmentation ”⁠, Jitesh Jain, Jiachen Li, MangTik Chiu …, Ali Hassani⁠, Nikita Orlov, Humphrey Shi
link-bibliography⁠
https://arxiv.org/abs/2210.06313#google: “The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers ”⁠, Zonglin Li, Chong You, Srinadh Bhojanapalli …, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar⁠
link-bibliography⁠
https://arxiv.org/abs/2209.11737: “Semantic Scene Descriptions As an Objective of Human Vision ”⁠, Adrien Doerig, Tim C. Kietzmann, Emily Allen …, Yihan Wu, Thomas Naselaris, Kendrick Kay, Ian Charest
link-bibliography⁠
https://arxiv.org/abs/2209.11055: “SetFit: Efficient Few-Shot Learning Without Prompts ”⁠, Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo …, Luke Bates⁠, Daniel Korat, Moshe Wasserblat, Oren Pereg
link-bibliography⁠
https://arxiv.org/abs/2209.02535: “Analyzing Transformers in Embedding Space ”⁠, Guy Dar, Mor Geva, Ankit Gupta, ⁠Jonathan Berant
link-bibliography⁠
https://arxiv.org/abs/2207.06300#ibm: “Re2G: Retrieve, Rerank, Generate ”⁠, Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury …, Ankita Rajaram Naik, Pengshan Cai, Alfio Gliozzo
link-bibliography⁠
https://arxiv.org/abs/2207.01848: “TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data ”⁠, Noah Hollmann, Samuel Müller, Katharina Eggensperger, ⁠Frank Hutter
link-bibliography⁠
https://arxiv.org/abs/2206.07137: “RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt ”⁠, Sören Mindermann, Jan Brauner, Muhammed Razzak …, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez⁠, Adrien Morisot, Sebastian Farquhar, Yarin Gal
link-bibliography⁠
https://arxiv.org/abs/2206.07160#microsoft: “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling ”⁠, Linjie Li, Zhe Gan, Kevin Lin⁠ …, Chung-Ching Lin, Zicheng Liu⁠, Ce Liu, Lijuan Wang
link-bibliography⁠
https://www.biorxiv.org/content/10.1101/2022.06.08.495348.full: “Reconstructing the Cascade of Language Processing in the Brain Using the Internal Computations of a Transformer-Based Language Model ”⁠, Sreejan Kumar, Theodore R. Sumers, Takateru Yamakoshi …, Ariel Goldstein, Uri Hasson, Kenneth A. Norman, Thomas L. Griffiths⁠, Robert D. Hawkins, Samuel A. Nastase
link-bibliography⁠
https://arxiv.org/abs/2206.01859#microsoft: “XTC: Extreme Compression for Pre-Trained Transformers Made Simple and Efficient ”⁠, Xiaoxia Wu, Zhewei Yao, Minjia Zhang …, Conglong Li, Yuxiong He
link-bibliography⁠
https://arxiv.org/abs/2206.01685: “Toward a Realistic Model of Speech Processing in the Brain With Self-Supervised Learning ”⁠, Juliette Millet, Charlotte Caucheteux, Pierre Orhan …, Yves Boubenec, Alexandre Gramfort, Ewan Dunbar, Christophe Pallier, Jean-Remi King
link-bibliography⁠
2022-rios.pdf: “Anime Character Recognition Using Intermediate Features Aggregation ”⁠, Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai
link-bibliography⁠
https://arxiv.org/abs/2205.13320#google: “Towards Learning Universal Hyperparameter Optimizers With Transformers ”⁠, Yutian Chen⁠, Xingyou Song, Chansoo Lee …, Zi Wang⁠, Qiuyi Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc’aurelio Ranzato, Sagi Perel, Nando de Freitas⁠
link-bibliography⁠
https://arxiv.org/abs/2205.11491#facebook: “HTPS: HyperTree Proof Search for Neural Theorem Proving ”⁠, Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril …, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, Timothée Lacroix
link-bibliography⁠
https://arxiv.org/abs/2205.04596#google: “When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet ”⁠, Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes …, Sara Fridovich-Keil, Rebecca Roelofs
link-bibliography⁠
https://arxiv.org/abs/2204.05927: “Do Loyal Users Enjoy Better Recommendations? Understanding Recommender Accuracy from a Time Perspective ”⁠, Yitong Ji, Aixin Sun, Jie Zhang, Chenliang Li
link-bibliography⁠
https://arxiv.org/abs/2203.13224#facebook: “Language Models That Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion ”⁠, Kurt Shuster, Mojtaba Komeili, Leonard Adolphs …, Stephen Roller, Arthur Szlam, Jason Weston⁠
link-bibliography⁠
https://arxiv.org/abs/2203.02094#microsoft: “LiteTransformerSearch: Training-Free Neural Architecture Search for Efficient Language Models ”⁠, Mojan Javaheripi, Gustavo H. de Rosa, Subhabrata Mukherjee …, Shital Shah, Tomasz L. Religa, Caio C. T. Mendes, Sebastien Bubeck⁠, Farinaz Koushanfar⁠, Debadeepta Dey
link-bibliography⁠
https://arxiv.org/abs/2202.03052#alibaba: “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-To-Sequence Learning Framework ”⁠, Peng Wang, An Yang⁠, Rui Men …, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang
link-bibliography⁠
https://arxiv.org/abs/2112.10510: “PFNs: Transformers Can Do Bayesian Inference ”⁠, Samuel Müller, Noah Hollmann, Sebastian Pineda Arango …, Josif Grabocka, ⁠Frank Hutter
link-bibliography⁠
https://arxiv.org/abs/2111.13824: “FQ-ViT: Fully Quantized Vision Transformer without Retraining ”⁠, Yang Lin, Tianyu Zhang, Peiqin Sun …, Zheng Li, Shuchang Zhou
link-bibliography⁠
https://arxiv.org/abs/2111.12233#microsoft: “LEMON: Scaling Up Vision-Language Pre-Training for Image Captioning ”⁠, Xiaowei Hu, Zhe Gan, Jianfeng Wang …, Zhengyuan Yang, Zicheng Liu⁠, Yumao Lu, Lijuan Wang
link-bibliography⁠
https://arxiv.org/abs/2111.09162: “It’s About Time: Analog Clock Reading in the Wild ”⁠, Charig Yang, Weidi Xie, Andrew Zisserman⁠
link-bibliography⁠
https://arxiv.org/abs/2111.06091: “A Survey of Visual Transformers ”⁠, Yang Liu, Yao Zhang, Yixin Wang …, Feng Hou, Jin Yuan⁠, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, Zhiqiang He
link-bibliography⁠
https://arxiv.org/abs/2109.12948: “Understanding and Overcoming the Challenges of Efficient Transformer Quantization ”⁠, Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort
link-bibliography⁠
https://arxiv.org/abs/2109.10282#microsoft: “TrOCR: Transformer-Based Optical Character Recognition With Pre-Trained Models ”⁠, Minghao Li, Tengchao Lv, Jingye Chen …, Lei Cui, Yijuan Lu, Dinei Florencio⁠, Cha Zhang, Zhoujun Li, Furu Wei⁠
link-bibliography⁠
https://arxiv.org/abs/2109.06243#huawei: “KroneckerBERT: Learning Kronecker Decomposition for Pre-Trained Language Models via Knowledge Distillation ”⁠, Marzieh S. Tahaei, Ella Charlaix, Vahid Partovi Nia …, Ali Ghodsi⁠, Mehdi Rezagholizadeh
link-bibliography⁠
https://arxiv.org/abs/2108.13002#microsoft: “A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP ”⁠, Yucheng Zhao, Guangting Wang, Chuanxin Tang …, Chong Luo, Wenjun Zeng⁠, Zheng-Jun Zha
link-bibliography⁠
https://arxiv.org/abs/2107.07566#facebook: “Internet-Augmented Dialogue Generation ”⁠, Mojtaba Komeili, Kurt Shuster, Jason Weston⁠
link-bibliography⁠
https://arxiv.org/abs/2107.04589: “ViTGAN: Training GANs With Vision Transformers ”⁠, Kwonjoon Lee, Huiwen Chang, Lu Jiang …, Han Zhang⁠, Zhuowen Tu, Ce Liu
link-bibliography⁠
https://arxiv.org/abs/2106.12672#google: “Charformer: Fast Character Transformers via Gradient-Based Subword Tokenization ”⁠, ⁠Yi Tay, Vinh Q. Tran, Sebastian Ruder …, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler
link-bibliography⁠
https://arxiv.org/abs/2106.10199: “BitFit: Simple Parameter-Efficient Fine-Tuning for Transformer-Based Masked Language-Models ”⁠, Elad Ben Zaken, Shauli Ravfogel, Yoav Goldberg⁠
link-bibliography⁠
https://arxiv.org/abs/2106.04803#google: “CoAtNet: Marrying Convolution and Attention for All Data Sizes ”⁠, Zihang Dai⁠, Hanxiao Liu, Quoc V. Le⁠, Mingxing Tan
link-bibliography⁠
https://arxiv.org/abs/2106.04533: “Chasing Sparsity in Vision Transformers: An End-To-End Exploration ”⁠, Tianlong Chen, Yu Cheng, Zhe Gan …, Lu Yuan, Lei Zhang, Zhangyang Wang
link-bibliography⁠
https://arxiv.org/abs/2105.15203: “SegFormer: Simple and Efficient Design for Semantic Segmentation With Transformers ”⁠, Enze Xie, Wenhai Wang, Zhiding Yu …, Anima Anandkumar⁠, Jose M. Alvarez, Ping Luo
link-bibliography⁠
https://arxiv.org/abs/2104.07567#facebook: “Retrieval Augmentation Reduces Hallucination in Conversation ”⁠, Kurt Shuster, Spencer Poff, Moya Chen …, Douwe Kiela, Jason Weston⁠
link-bibliography⁠
https://chinai.substack.com/p/chinai-137-year-3-of-chinai: “ChinAI #137: Year 3 of ChinAI: Reflections on the Newsworthiness of Machine Translation ”⁠, Jeffrey Ding
link-bibliography⁠
https://arxiv.org/abs/2103.10697#facebook: “ConViT: Improving Vision Transformers With Soft Convolutional Inductive Biases ”⁠, Stéphane d’Ascoli, Hugo Touvron, Matthew Leavitt …, Ari Morcos, Giulio Biroli, Levent Sagun
link-bibliography⁠
https://ai.meta.com/blog/learning-from-videos-to-understand-the-world/: “Learning from Videos to Understand the World ”⁠, Geoffrey Zweig, Polina Kuznetsova⁠, Michael Auli, Francois Fagan
link-bibliography⁠
https://arxiv.org/abs/2102.07074: “TransGAN: Two Transformers Can Make One Strong GAN ”⁠, Yifan Jiang, Shiyu Chang, Zhangyang Wang
link-bibliography⁠
https://arxiv.org/abs/2102.03334: “ViLT: Vision-And-Language Transformer Without Convolution or Region Supervision ”⁠, Wonjae Kim, Bokyung Son⁠, Ildoo Kim
link-bibliography⁠
https://arxiv.org/abs/2101.11986: “Tokens-To-Token ViT: Training Vision Transformers from Scratch on ImageNet ”⁠, Li Yuan⁠, Yunpeng Chen, Tao Wang …, Weihao Yu, Yujun Shi, Francis E. H. Tay, Jiashi Feng, Shuicheng Yan
link-bibliography⁠
https://arxiv.org/abs/2101.11605#google: “Bottleneck Transformers for Visual Recognition ”⁠, Aravind Srinivas⁠, Tsung-Yi Lin, Niki Parmar⁠ …, Jonathon Shlens, Pieter Abbeel⁠, Ashish Vaswani⁠
link-bibliography⁠
https://arxiv.org/abs/2101.08674: “DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition ”⁠, Edwin Arkel Rios, Wen-Huang Cheng, Bo-Cheng Lai
link-bibliography⁠
https://arxiv.org/abs/2101.04702#google: “XMC-GAN: Cross-Modal Contrastive Learning for Text-To-Image Generation ”⁠, Han Zhang⁠, Jing Yu Koh, Jason Baldridge …, Honglak Lee, Yinfei Yang
link-bibliography⁠
https://arxiv.org/abs/2012.12877#facebook: “Training Data-Efficient Image Transformers & Distillation through Attention ”⁠, Hugo Touvron, Matthieu Cord, Matthijs Douze …, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou
link-bibliography⁠
https://arxiv.org/abs/2012.08508#deepmind: “Object-Based Attention for Spatio-Temporal Reasoning: Outperforming Neuro-Symbolic Models With Flexible Distributed Architectures ”⁠, David Ding⁠, ⁠Felix Hill, Adam Santoro⁠, Matt Botvinick
link-bibliography⁠
https://arxiv.org/abs/2011.13729#tencent: “TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game ”⁠, Lei Han⁠, Jiechao Xiong, Peng Sun⁠ …, Xinghai Sun, Meng Fang, Qingwei Guo, Qiaobo Chen, Tengfei Shi⁠, Hongsheng Yu, Zhengyou Zhang⁠
link-bibliography⁠
https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/: “DeepSpeed: Extreme-Scale Model Training for Everyone ”⁠, DeepSpeed Team, Rangan Majumder, Junhua Wang
link-bibliography⁠
https://arxiv.org/abs/2008.02217: “Hopfield Networks Is All You Need ”⁠, Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner …, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter⁠
link-bibliography⁠
https://arxiv.org/abs/2006.11477#facebook: “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations ”⁠, Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
link-bibliography⁠
https://arxiv.org/abs/2006.03654#microsoft: “DeBERTa: Decoding-Enhanced BERT With Disentangled Attention ”⁠, Pengcheng He, Xiaodong Liu, ⁠Jianfeng Gao⁠, Weizhu Chen
link-bibliography⁠
https://arxiv.org/abs/2005.12872#facebook: “DETR: End-To-End Object Detection With Transformers ”⁠, Nicolas Carion, Francisco Massa, Gabriel Synnaeve …, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko
link-bibliography⁠
https://ai.meta.com/blog/state-of-the-art-open-source-chatbot/: “Blender: A State-Of-The-Art Open Source Chatbot ”⁠, Stephen Roller, Jason Weston⁠, Emily Dinan
link-bibliography⁠
https://arxiv.org/abs/2004.03844: “On the Effect of Dropping Layers of Pre-Trained Transformer Models ”⁠, Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Preslav Nakov⁠
link-bibliography⁠
https://arxiv.org/abs/2004.03965: “Rapformer: Conditional Rap Lyrics Generation With Denoising Autoencoders ”⁠, Nikola I. Nikolov, Eric Malmi, Curtis G. Northcutt, Loreto Parisi
link-bibliography⁠
https://arxiv.org/abs/2002.10957#microsoft: “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers ”⁠, Wenhui Wang, Furu Wei⁠, Li Dong⁠ …, Hangbo Bao, Nan Yang, Ming Zhou
link-bibliography⁠
https://research.google/blog/towards-a-conversational-agent-that-can-chat-aboutanything/: “Towards a Conversational Agent That Can Chat About…Anything ”⁠, Daniel Adiwardana, Thang Luong
link-bibliography⁠
https://openai.com/research/deep-double-descent: “Deep Double Descent: We Show That the Double Descent Phenomenon Occurs in CNNs, ResNets, and Transformers: Performance First Improves, Then Gets Worse, and Then Improves Again With Increasing Model Size, Data Size, or Training Time ”⁠, Preetum Nakkiran, ⁠Gal Kaplun, Yamini Bansal⁠ …, Tristan Yang⁠, Boaz Barak⁠, Ilya Sutskever⁠
link-bibliography⁠
https://arxiv.org/abs/1911.02116#facebook: “Unsupervised Cross-Lingual Representation Learning at Scale ”⁠, Alexis Conneau, Kartikay Khandelwal, Naman Goyal …, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer⁠, Veselin Stoyanov⁠
link-bibliography⁠
https://arxiv.org/abs/1909.10351: “TinyBERT: Distilling BERT for Natural Language Understanding ”⁠, Xiaoqi Jiao, Yichun Yin, Lifeng Shang …, Xin Jiang⁠, Xiao Chen, Linlin Li, Fang Wang, Qun Liu
link-bibliography⁠
https://arxiv.org/abs/1909.05286#ibm: “Frustratingly Easy Natural Question Answering ”⁠, Lin Pan, Rishav Chakravarti, Anthony Ferritto …, Michael Glass, Alfio Gliozzo, Salim Roukos, Radu Florian, Avirup Sil
link-bibliography⁠
https://arxiv.org/abs/1908.04577#alibaba: “StructBERT: Incorporating Language Structures into Pre-Training for Deep Language Understanding ”⁠, Wei Wang, Bin Bi, Ming Yan⁠ …, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, Luo Si
link-bibliography⁠
https://arxiv.org/abs/1907.11692#facebook: “RoBERTa: A Robustly Optimized BERT Pretraining Approach ”⁠, Yinhan Liu, Myle Ott, Naman Goyal …, Jingfei Du, Mandar Joshi, Danqi Chen⁠, Omer Levy⁠, Mike Lewis⁠, Luke Zettlemoyer⁠, Veselin Stoyanov⁠
link-bibliography⁠
https://arxiv.org/abs/1905.03197: “UniLM: Unified Language Model Pre-Training for Natural Language Understanding and Generation ”⁠, Li Dong⁠, Nan Yang, Wenhui Wang …, Furu Wei⁠, Xiaodong Liu, Yu Wang, ⁠Jianfeng Gao⁠, Ming Zhou, Hsiao-Wuen Hon⁠
link-bibliography⁠
https://arxiv.org/abs/1904.00962#google: “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes ”⁠, Yang You⁠, Jing Li, Sashank Reddi …, Jonathan Hseu, Sanjiv Kumar⁠, Srinadh Bhojanapalli, Xiaodan Song, James Demmel⁠, Kurt Keutzer⁠, Cho-Jui Hsieh
link-bibliography⁠
https://arxiv.org/abs/1901.08746: “BioBERT: a Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining ”⁠, Jinhyuk Lee, Wonjin Yoon, Sungdong Kim …, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang
link-bibliography⁠
2018-huang.pdf: “Generating Structured Music through Self-Attention ”⁠, Anna Huang, Ashish Vaswani⁠, Jakob Uszkoreit⁠ …, Noam Shazeer⁠, Andrew Dai, Matt Hoffman⁠, Curtis Hawthorne, Douglas Eck
link-bibliography⁠
https://github.com/huggingface/transformers: “Huggingface: transformers Repo ”⁠, Huggingface
link-bibliography⁠