Gemma 2: Improving Open Language Models at a Practical Size
Investigating the Ability of LLMs to Recognize Their Own Writing
Revealing Fine-Grained Values and Opinions in Large Language Models
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks
Grokfast: Accelerated Grokking by Amplifying Slow Gradients
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?
Conformer-1: Robust ASR via Large-Scale Semi-supervised Bootstrapping
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Language models accurately infer correlations between psychological items and scales from text alone
Privacy Backdoors: Stealing Data with Corrupted Pretrained Models
Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
LTE: Training Neural Networks from Scratch with Parallel Low-Rank Adapters
Beyond A✱: Better Planning with Transformers via Search Dynamics Bootstrapping (Searchformer)
KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students
Do Llamas Work in English? On the Latent Language of Multilingual Transformers
DE-COP: Detecting Copyrighted Content in Language Models Training Data
Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift
The Manga Whisperer: Automatically Generating Transcriptions for Comics
A Philosophical Introduction to Language Models—Part I: Continuity With Classic Debates
Seamless: Multilingual Expressive and Streaming Speech Translation
Scaling transformer neural networks for skillful and reliable medium-range weather forecasting
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
Sequential Modeling Enables Scalable Learning for Large Vision Models
DiLoCo: Distributed Low-Communication Training of Language Models
ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models
Will releasing the weights of large language models grant widespread access to pandemic agents?
To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
DeWave: Discrete EEG Waves Encoding for Brain Dynamics to Text Translation
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions
A Pooled Cell Painting CRISPR Screening Platform Enables de novo Inference of Gene Function by Self-supervised Deep Learning
Nougat: Neural Optical Understanding for Academic Documents
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
HEADLINES: A Massive Scale Semantic Similarity Dataset of Historical English
Expanding the methodological toolbox: Machine-based item desirability ratings as an alternative to human-based ratings
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
RGD: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization
SequenceMatch: Imitation Learning for Autoregressive Sequence Modeling with Backtracking
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
FERMAT: An Alternative to Accuracy for Numerical Reasoning
Translatotron 3: Speech to Speech Translation with Monolingual Data
Deep Learning based Forecasting: a case study from the online fashion industry
DarkBERT: A Language Model for the Dark Side of the Internet
VendorLink: An NLP approach for Identifying & Linking Vendor Migrants & Potential Aliases on Darknet Markets
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
When and How Artificial Intelligence Augments Employee Creativity
Trained on 100 million words and still in shape: BERT meets British National Corpus
Mitigating YouTube Recommendation Polarity using BERT and K-Means Clustering
Model scale versus domain knowledge in statistical forecasting of chaotic systems
The Man of Your Dreams For $300, Replika sells an AI companion who will never die, argue, or cheat—until his algorithm is updated
Towards Democratizing Joint-Embedding Self-Supervised Learning
MUX-PLMs: Pre-training Language Models with Data Multiplexing
V1T: large-scale mouse V1 response prediction using a Vision Transformer
The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models
Progress measures for grokking via mechanistic interpretability
Cramming: Training a Language Model on a Single GPU in One Day
Less is More: Parameter-Free Text Classification with Gzip
NBC-Softmax: Darkweb Author fingerprinting and migration tracking
POM: A Principal Odor Map Unifies Diverse Tasks in Human Olfactory Perception
VindLU: A Recipe for Effective Video-and-Language Pretraining
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Discovering Latent Knowledge in Language Models Without Supervision
BARTSmiles: Generative Masked Language Models for Molecular Representations
Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
A deep learning and digital archaeology approach for mosquito repellent discovery
GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation
UniSumm: Unified Few-shot Summarization with Multi-Task Pre-Training and Prefix-Tuning
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
Distilled DeepConsensus: Knowledge distillation for fast and accurate DNA sequence correction
Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities
OneFormer: One Transformer to Rule Universal Image Segmentation
Characterizing Intrinsic Compositionality in Transformers with Tree Projections
n-gram Is Back: Residual Learning of Neural Text Generation with n-gram Language Model
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints
Improving Sample Quality of Diffusion Models Using Self-Attention Guidance
Semantic scene descriptions as an objective of human vision
Machine Reading, Fast and Slow: When Do Models "Understand" Language?
On the Effectiveness of Compact Biomedical Transformers (✱BioBERT)
ASR2K: Speech Recognition for Around 2,000 Languages without Audio
MeloForm: Generating Melody with Musical Form based on Expert Systems and Neural Networks
CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks
PatchDropout: Economizing Vision Transformers Using Patch Dropout
Why do tree-based models still outperform deep learning on tabular data?
Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling
TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data
Do Loyal Users Enjoy Better Recommendations? Understanding Recommender Accuracy from a Time Perspective
BertNet: Harvesting Knowledge Graphs from Pretrained Language Models
ProGen2: Exploring the Boundaries of Protein Language Models
SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features
RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model
XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient
Toward a realistic model of speech processing in the brain with self-supervised learning
Text2Human: Text-Driven Controllable Human Image Generation
Anime Character Recognition using Intermediate Features Aggregation
Towards Learning Universal Hyperparameter Optimizers with Transformers
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
Housekeep: Tidying Virtual Households using Commonsense Reasoning
UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes
Tradformer: A Transformer Model of Traditional Music Transcriptions
Continual Pre-Training Mitigates Forgetting in Language and Vision
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
SymphonyNet: Symphony Generation with Permutation Invariant Language Model
When does dough become a bagel? Analyzing the remaining mistakes on ImageNet
Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers
DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning
Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion
On Embeddings for Numerical Features in Tabular Deep Learning
LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models
Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models
FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control
HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning
An Empirical Investigation of the Role of Pre-training in Lifelong Learning
You Only Need One Model for Open-domain Question Answering
Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Inducing Causal Structure for Interpretable Neural Networks (IIT)
FQ-ViT: Fully Quantized Vision Transformer without Retraining
LEMON: Scaling Up Vision-Language Pre-training for Image Captioning
UNICORN: Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers
Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail
Autoregressive Latent Video Prediction with High-Fidelity Image Generator
Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query
Understanding and Overcoming the Challenges of Efficient Transformer Quantization
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
MeLT: Message-Level Transformer with Masked Document Representations as Pre-Training for Stance Detection
KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation
The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning
DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
Data and Parameter Scaling Laws for Neural Machine Translation
ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis
Modeling Protein Using Large-scale Pretrain Language Model
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations
EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training
HTLM: Hyper-Text Pre-Training and Prompting of Language Models
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking
ARM-Net: Adaptive Relation Modeling Network for Structured Data
SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
CoAtNet: Marrying Convolution and Attention for All Data Sizes
Chasing Sparsity in Vision Transformers: An End-to-End Exploration
Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
Exploring Transfer Learning techniques for Named Entity Recognition in Noisy User-Generated Text
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Maximizing 3-D Parallelism in Distributed Training for Huge Neural Networks
One4all User Representation for Recommender Systems in E-commerce
QASPER: A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
MathBERT: A Pre-Trained Model for Mathematical Formula Understanding
MDETR—Modulated Detection for End-to-End Multi-Modal Understanding
XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond
[Ali released PLUG: 27 billion parameters, the largest pre-trained language model in the Chinese community]
SimCSE: Simple Contrastive Learning of Sentence Embeddings
Robust Open-Vocabulary Translation from Visual Text Representations
Memorization versus Generalization in Pre-trained Language Models
Retrieval Augmentation Reduces Hallucination in Conversation
Gradient-based Adversarial Attacks against Text Transformers
TSDAE: Using Transformer-based Sequential Denoising Autoencoder for Unsupervised Sentence Embedding Learning
An Empirical Study of Training Self-Supervised Vision Transformers
ChinAI #137: Year 3 of ChinAI: Reflections on the newsworthiness of machine translation
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence (VitaminC)
Are NLP Models really able to Solve Simple Math Word Problems?
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
BENDR: using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data
DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition
UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation
Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words
Training data-efficient image transformers & distillation through attention
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis
Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup
TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
CoVoST 2 and Massively Multilingual Speech-to-Text Translation
Modern Hopfield Networks and Attention for Immune Repertoire Classification
Can neural networks acquire a structural bias from raw linguistic data?
DeepSinger: Singing Voice Synthesis with Data Mined From the Web
Data Movement Is All You Need: A Case Study on Optimizing Transformers
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training
Improving GAN Training with Probability Ratio Clipping and Sample Reweighting
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data
VLN-BERT: Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
On the Effect of Dropping Layers of Pre-trained Transformer Models
Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders
Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited
AraBERT: Transformer-based Model for Arabic Language Understanding
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
GNS: Learning to Simulate Complex Physics with Graph Networks
Do We Need Zero Training Loss After Achieving Zero Training Error?
Bayesian Deep Learning and a Probabilistic Perspective of Generalization
Towards a Conversational Agent that Can Chat About…Anything
Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference
Improving Transformer Optimization Through Better Initialization
VIME: Extending the Success of Self-supervised and Semi-supervised Learning to Tabular Domain
Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
Mastering Complex Control in MOBA Games with Deep Reinforcement Learning
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time
SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling
Unsupervised Cross-lingual Representation Learning at Scale
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TinyBERT: Distilling BERT for Natural Language Understanding
Do NLP Models Know Numbers? Probing Numeracy in Embeddings
PubMedQA: A Dataset for Biomedical Research Question Answering
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models
Theoretical Limitations of Self-Attention in Neural Sequence Models
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation
MASS: Masked Sequence to Sequence Pre-training for Language Generation
Mask-Predict: Parallel Decoding of Conditional Masked Language Models
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
LIGHT: Learning to Speak and Act in a Fantasy Text Adventure Game
Insertion Transformer: Flexible Sequence Generation via Insertion Operations
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
Blockwise Parallel Decoding for Deep Autoregressive Models
Learning Longer-term Dependencies in RNNs with Auxiliary Losses
GPipe: Easy Scaling With Micro-Batch Pipeline Parallelism § Pg4
a8efcc8272af6f434119f87a00c2edaf84241597.pdf#page=4&org=google
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer
No Physics? No Problem. AI Weather Forecasting Is Already Making Huge Strides.
The Illustrated GPT-2 (Visualizing Transformer Language Models)
Autoregressive Long-Context Music Generation With Perceiver AR
Understanding BERT Transformer: Attention Isn’t All You Need
Transformers are a very exciting family of machine learning architectures
2023-nguyen-figure12-biggerclimateforecastingmodelsaremoresampleefficientonlowresolutiondata.jpg
2022-cheng-figure2-ablationofvindlutextvideomodelperformancebysourceofperformancechanges.jpg
2021-hu-figure2-b-datascalingfinetuningperformanceonnocaps.jpg
2021-hu-figure6-largerlemoncaptionmodelsaremoresampleefficient.jpg
2021-zaken-figure2-scalingcurveoffinetuningvsbiastuningshowscurvescrossasdatasetsizeincreases.png
https://ai.facebook.com/blog/harmful-content-can-evolve-quickly-our-new-ai-system-adapts-to-tackle-it
https://github.com/huggingface/transformers/tree/main/src/transformers
https://gonzoml.substack.com/p/you-only-cache-once-decoder-decoder
https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/
https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4588941
https://research.google/blog/on-device-content-distillation-with-graph-neural-networks/
https://research.google/blog/unsupervised-speech-to-speech-translation-from-monolingual-data/
https://sander.ai/2023/01/09/diffusion-language.html#deepmind
https://www.lesswrong.com/posts/2JJtxitp6nqu6ffak/basic-facts-about-language-models-during-training-1
https://www.lesswrong.com/posts/4Hnso8NMAeeYs8Cta/revealing-intentionality-in-language-models-through-adavae#BigVAE_and_Its_Samplers
https://www.quantamagazine.org/how-ai-transformers-mimic-parts-of-the-brain-20220912/
https://www.reddit.com/r/MachineLearning/comments/yxt8sa/r_rwkv4_7b_release_an_attentionfree_rnn_language/
Gemma 2: Improving Open Language Models at a Practical Size
https%253A%252F%252Farxiv.org%252Fabs%252F2408.00118%2523google.html
Investigating the Ability of LLMs to Recognize Their Own Writing
https%253A%252F%252Fwww.lesswrong.com%252Fposts%252FADrTuuus6JsQr5CSi%252Finvestigating-the-ability-of-llms-to-recognize-their-own.html
Grokfast: Accelerated Grokking by Amplifying Slow Gradients
You Only Cache Once: Decoder-Decoder Architectures for Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2405.05254%2523microsoft.html
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Language models accurately infer correlations between psychological items and scales from text alone
https%253A%252F%252Fosf.io%252Fpreprints%252Fpsyarxiv%252Fkjuce.html
Scaling transformer neural networks for skillful and reliable medium-range weather forecasting
https%253A%252F%252Farxiv.org%252Fabs%252F2311.03079%2523zhipu.html
To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets
https%253A%252F%252Farxiv.org%252Fabs%252F2310.07096%2523ibm.html
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Nougat: Neural Optical Understanding for Academic Documents
https%253A%252F%252Farxiv.org%252Fabs%252F2308.13418%2523facebook.html
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
https%253A%252F%252Farxiv.org%252Fabs%252F2308.11596%2523facebook.html
RGD: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization
https%253A%252F%252Farxiv.org%252Fabs%252F2306.09222%2523google.html
SequenceMatch: Imitation Learning for Autoregressive Sequence Modeling with Backtracking
When and How Artificial Intelligence Augments Employee Creativity
%252Fdoc%252Feconomics%252Fautomation%252F2023-jia.pdf.html
MUX-PLMs: Pre-training Language Models with Data Multiplexing
https%253A%252F%252Farxiv.org%252Fabs%252F2302.05442%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2302.04907%2523google.html
Progress measures for grokking via mechanistic interpretability
https%253A%252F%252Farxiv.org%252Fabs%252F2301.03728%2523facebook.html
https%253A%252F%252Farxiv.org%252Fabs%252F2301.03992%2523nvidia.html
Cramming: Training a Language Model on a Single GPU in One Day
Less is More: Parameter-Free Text Classification with Gzip
https%253A%252F%252Farxiv.org%252Fabs%252F2212.05199%2523google.html
VindLU: A Recipe for Effective Video-and-Language Pretraining
Text Embeddings by Weakly-Supervised Contrastive Pre-training
https%253A%252F%252Farxiv.org%252Fabs%252F2212.03533%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2212.01349%2523facebook.html
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
OneFormer: One Transformer to Rule Universal Image Segmentation
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2210.06313%2523google.html
Semantic scene descriptions as an objective of human vision
https%253A%252F%252Farxiv.org%252Fabs%252F2207.06300%2523ibm.html
TabPFN: Meta-Learning a Real-Time Tabular AutoML Method For Small Data
Do Loyal Users Enjoy Better Recommendations? Understanding Recommender Accuracy from a Time Perspective
RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
https%253A%252F%252Farxiv.org%252Fabs%252F2206.07160%2523microsoft.html
Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model
https%253A%252F%252Fwww.biorxiv.org%252Fcontent%252F10.1101%252F2022.06.08.495348.full.html
XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient
https%253A%252F%252Farxiv.org%252Fabs%252F2206.01859%2523microsoft.html
Toward a realistic model of speech processing in the brain with self-supervised learning
Anime Character Recognition using Intermediate Features Aggregation
%252Fdoc%252Fai%252Fanime%252Fdanbooru%252F2022-rios.pdf.html
Towards Learning Universal Hyperparameter Optimizers with Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2205.13320%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2205.11491%2523facebook.html
When does dough become a bagel? Analyzing the remaining mistakes on ImageNet
https%253A%252F%252Farxiv.org%252Fabs%252F2205.04596%2523google.html
Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion
https%253A%252F%252Farxiv.org%252Fabs%252F2203.13224%2523facebook.html
LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2203.02094%2523microsoft.html
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
https%253A%252F%252Farxiv.org%252Fabs%252F2202.03052%2523alibaba.html
FQ-ViT: Fully Quantized Vision Transformer without Retraining
LEMON: Scaling Up Vision-Language Pre-training for Image Captioning
https%253A%252F%252Farxiv.org%252Fabs%252F2111.12233%2523microsoft.html
Understanding and Overcoming the Challenges of Efficient Transformer Quantization
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
https%253A%252F%252Farxiv.org%252Fabs%252F2109.10282%2523microsoft.html
KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation
https%253A%252F%252Farxiv.org%252Fabs%252F2109.06243%2523huawei.html
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
https%253A%252F%252Farxiv.org%252Fabs%252F2108.13002%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2107.07566%2523facebook.html
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
https%253A%252F%252Farxiv.org%252Fabs%252F2106.12672%2523google.html
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
https%253A%252F%252Farxiv.org%252Fabs%252F2106.09488%2523amazon.html
CoAtNet: Marrying Convolution and Attention for All Data Sizes
https%253A%252F%252Farxiv.org%252Fabs%252F2106.04803%2523google.html
Chasing Sparsity in Vision Transformers: An End-to-End Exploration
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Retrieval Augmentation Reduces Hallucination in Conversation
https%253A%252F%252Farxiv.org%252Fabs%252F2104.07567%2523facebook.html
ChinAI #137: Year 3 of ChinAI: Reflections on the newsworthiness of machine translation
https%253A%252F%252Fchinai.substack.com%252Fp%252Fchinai-137-year-3-of-chinai.html
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
https%253A%252F%252Farxiv.org%252Fabs%252F2103.10697%2523facebook.html
https%253A%252F%252Fai.facebook.com%252Fblog%252Flearning-from-videos-to-understand-the-world%252F.html
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
https%253A%252F%252Farxiv.org%252Fabs%252F2101.11605%2523google.html
DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition
XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation
https%253A%252F%252Farxiv.org%252Fabs%252F2101.04702%2523google.html
Training data-efficient image transformers & distillation through attention
https%253A%252F%252Farxiv.org%252Fabs%252F2012.12877%2523facebook.html
Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures
https%253A%252F%252Farxiv.org%252Fabs%252F2012.08508%2523deepmind.html
TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game
https%253A%252F%252Farxiv.org%252Fabs%252F2011.13729%2523tencent.html
https%253A%252F%252Fwww.microsoft.com%252Fen-us%252Fresearch%252Fblog%252Fdeepspeed-extreme-scale-model-training-for-everyone%252F.html
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
https%253A%252F%252Farxiv.org%252Fabs%252F2006.03654%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2005.12872%2523facebook.html
https%253A%252F%252Fai.meta.com%252Fblog%252Fstate-of-the-art-open-source-chatbot%252F.html
On the Effect of Dropping Layers of Pre-trained Transformer Models
Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2002.10957%2523microsoft.html
Towards a Conversational Agent that Can Chat About…Anything
https%253A%252F%252Fresearch.google%252Fblog%252Ftowards-a-conversational-agent-that-can-chat-aboutanything%252F.html
Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time
https%253A%252F%252Fopenai.com%252Fresearch%252Fdeep-double-descent.html
Unsupervised Cross-lingual Representation Learning at Scale
https%253A%252F%252Farxiv.org%252Fabs%252F1911.02116%2523facebook.html
TinyBERT: Distilling BERT for Natural Language Understanding
https%253A%252F%252Farxiv.org%252Fabs%252F1909.05286%2523ibm.html
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
https%253A%252F%252Farxiv.org%252Fabs%252F1908.04577%2523alibaba.html
https%253A%252F%252Farxiv.org%252Fabs%252F1907.11692%2523facebook.html
UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
https%253A%252F%252Farxiv.org%252Fabs%252F1904.00962%2523google.html
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
https%253A%252F%252Fgithub.com%252Fhuggingface%252Ftransformers.html
Wikipedia Bibliography: