‘ML dataset’ directory

See Also

Gwern

“Anime Crop Datasets: Faces, Figures, & Hands ”, Gwern et al 2020

Anime Crop Datasets: Faces, Figures, & Hands

Links

“AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era ”, Zhu et al 2025

⁠AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era⁠

“AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-Time Computation ”, Chakrabarty et al 2025

⁠AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation⁠

“GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs ”, Vendrow et al 2025

⁠⁠GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs

“Rank1: Test-Time Compute for Reranking in Information Retrieval ”, Weller et al 2025

⁠Rank1: Test-Time Compute for Reranking in Information Retrieval⁠

“None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks ”, Salido et al 2025

⁠None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks⁠

“NaturalReasoning: Reasoning in the Wild With 2.8M Challenging Questions ”, Yuan et al 2025

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions⁠

“VLMs As GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks ”, Huang et al 2025

⁠VLMs as GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks⁠

“ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models ”, Roberts et al 2025

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models⁠

“Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs ”, Saxena et al 2025

Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs⁠

“Do Large Language Model Benchmarks Test Reliability? ”, Vendrow et al 2025

Do Large Language Model Benchmarks Test Reliability?⁠

“S1: Simple Test-Time Scaling ”, Muennighoff et al 2025

⁠s1: Simple test-time scaling⁠

“Do Generative Video Models Learn Physical Principles from Watching Videos? ”, Motamed et al 2025

Do generative video models learn physical principles from watching videos?⁠

“An Evaluation Framework for Clinical Use of Large Language Models in Patient Interaction Tasks ”, Johri et al 2025

An evaluation framework for clinical use of large language models in patient interaction tasks⁠

“Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? ”, Yang et al 2024

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?⁠

“BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games ”, Paglieri et al 2024

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games⁠

“Are LLMs Prescient? A Continuous Evaluation Using Daily News As the Oracle ”, Dai et al 2024

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle⁠

“HtmlRAG: HTML Is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems ”, Tan et al 2024

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems⁠

“Centaur: a Foundation Model of Human Cognition ”, Binz et al 2024

Centaur: a foundation model of human cognition⁠

“AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents ”, Andriushchenko et al 2024

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents⁠

“SimpleStrat: Diversifying Language Model Generation With Stratification ”, Wong et al 2024

SimpleStrat: Diversifying Language Model Generation with Stratification⁠

“SWE-Bench+: Enhanced Coding Benchmark for LLMs ”, Aleithan et al 2024

SWE-Bench+: Enhanced Coding Benchmark for LLMs⁠

“MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering ”, Chan et al 2024

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering⁠

“Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making ”, Li et al 2024

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making⁠

“Seeing Faces in Things: A Model and Dataset for Pareidolia ”, Hamilton et al 2024

Seeing Faces in Things: A Model and Dataset for Pareidolia⁠

“H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark ”, LeGris et al 2024

H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark⁠

“How to Evaluate Jailbreak Methods: A Case Study With the StrongREJECT Benchmark ”, Bowen et al 2024

⁠How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark⁠

“To Code, or Not To Code? Exploring Impact of Code in Pre-Training ”, Aryabumi et al 2024

To Code, or Not To Code? Exploring Impact of Code in Pre-training⁠

“Tails Tell Tales: Chapter-Wide Manga Transcriptions With Character Names ”, Sachdeva et al 2024

Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names⁠

“ImagiNet: A Multi-Content Dataset for Generalizable Synthetic Image Detection via Contrastive Learning ”, Boychev & Cholakov 2024

ImagiNet: A Multi-Content Dataset for Generalizable Synthetic Image Detection via Contrastive Learning⁠

“Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs ”, Laine et al 2024

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs⁠

“Future Events As Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs ”, Price et al 2024

Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs⁠

“Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets ”, Walsh et al 2024

Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets⁠

“APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets ”, Liu et al 2024

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets⁠

“Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”, Lee et al 2024

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?⁠

“OlympicArena: Benchmarking Multi-Discipline Cognitive Reasoning for Superintelligent AI ”, Huang et al 2024

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI⁠

“DataComp-LM: In Search of the next Generation of Training Sets for Language Models ”, Li et al 2024

DataComp-LM: In search of the next generation of training sets for language models⁠

“GUI-WORLD: A Dataset for GUI-Oriented Multimodal LLM-Based Agents ”, Chen et al 2024

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents⁠

“Newswire: A Large-Scale Structured Database of a Century of Historical News ”, Silcock et al 2024

Newswire: A Large-Scale Structured Database of a Century of Historical News⁠

“RWKV-CLIP: A Robust Vision-Language Representation Learner ”, Gu et al 2024

⁠RWKV-CLIP: A Robust Vision-Language Representation Learner⁠

“Are We Done With MMLU? ”, Gema et al 2024

Are We Done with MMLU?⁠

“ShareGPT4Video: Improving Video Understanding and Generation With Better Captions ”, Chen et al 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions⁠

“MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark ”, Wang et al 2024

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark⁠

“LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks ”, Street et al 2024

LLMs achieve adult human performance on higher-order theory of mind tasks⁠

“DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches With TikZ ”, Belouadi et al 2024

DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ⁠

“Sakuga-42M Dataset: Scaling Up Cartoon Research ”, Pan et al 2024

Sakuga-42M Dataset: Scaling Up Cartoon Research⁠

“Can Language Models Explain Their Own Classification Behavior? ”, Sherburn et al 2024

Can Language Models Explain Their Own Classification Behavior?⁠

“Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models ”, Bai et al 2024

Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models⁠

“ImageInWords: Unlocking Hyper-Detailed Image Descriptions ”, Garg et al 2024

ImageInWords: Unlocking Hyper-Detailed Image Descriptions⁠

“GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic ”, Zhang et al 2024

GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic⁠

“Building a Large Japanese Web Corpus for Large Language Models ”, Okazaki et al 2024

Building a Large Japanese Web Corpus for Large Language Models⁠

“CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack Of) Multicultural Knowledge ”, Chiu et al 2024

CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge⁠

“VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? ”, Liu et al 2024

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?⁠

“RULER: What’s the Real Context Size of Your Long-Context Language Models? ”, Hsieh et al 2024

RULER: What’s the Real Context Size of Your Long-Context Language Models?⁠

“Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators ”, Dubois et al 2024

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators⁠

“How Tech Giants Cut Corners to Harvest Data for AI: OpenAI, Google and Meta Ignored Corporate Policies, Altered Their Own Rules and Discussed Skirting Copyright Law As They Sought Online Information to Train Their Newest Artificial Intelligence Systems ”, Metz et al 2024

How Tech Giants Cut Corners to Harvest Data for AI: OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems⁠

“Evaluating Text-To-Visual Generation With Image-To-Text Generation ”, Lin et al 2024

Evaluating Text-to-Visual Generation with Image-to-Text Generation⁠

“Vulnerability Detection With Code Language Models: How Far Are We? ”, Ding et al 2024

Vulnerability Detection with Code Language Models: How Far Are We?⁠

“Long-Form Factuality in Large Language Models ”, Wei et al 2024

Long-form factuality in large language models⁠

“COIG-CQIA: Quality Is All You Need for Chinese Instruction Fine-Tuning ”, Bai et al 2024

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning⁠

“RewardBench: Evaluating Reward Models for Language Modeling ”, Lambert et al 2024

RewardBench: Evaluating Reward Models for Language Modeling⁠

“Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics ”, Hartwig et al 2024

Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics⁠

“Hierarchical Feature Warping and Blending for Talking Head Animation ”, Zhang et al 2024

Hierarchical Feature Warping and Blending for Talking Head Animation⁠

“Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models ”, Ding et al 2024

Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models⁠

“ELLA: Equip Diffusion Models With LLM for Enhanced Semantic Alignment ”, Hu et al 2024

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment⁠

“Investigating Continual Pretraining in Large Language Models: Insights and Implications ”, Yıldız et al 2024

Investigating Continual Pretraining in Large Language Models: Insights and Implications⁠

“Hal-Eval: A Universal and Fine-Grained Hallucination Evaluation Framework for Large Vision Language Models ”, Jiang et al 2024

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models⁠

“Assisting in Writing Wikipedia-Like Articles From Scratch With Large Language Models ”, Shao et al 2024

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models⁠

“Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia ”, Kuo et al 2024

Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia⁠

“`ArtPrompt`: ASCII Art-Based Jailbreak Attacks against Aligned LLMs ”, Jiang et al 2024

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs⁠

“DE-COP: Detecting Copyrighted Content in Language Models Training Data ”, Duarte et al 2024

DE-COP: Detecting Copyrighted Content in Language Models Training Data⁠

“I Think, Therefore I Am: Benchmarking Awareness of Large Language Models Using AwareBench ”, Li et al 2024

I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench⁠

“Can AI Assistants Know What They Don’t Know? ”, Cheng et al 2024

Can AI Assistants Know What They Don’t Know?⁠

“AnimeDiffusion: Anime Diffusion Colorization ”, Cao et al 2024

AnimeDiffusion: Anime Diffusion Colorization⁠

“I Am a Strange Dataset: Metalinguistic Tests for Language Models ”, Thrush et al 2024

I am a Strange Dataset: Metalinguistic Tests for Language Models⁠

“DeepSeek LLM: Scaling Open-Source Language Models With Longtermism ”, Bi et al 2024

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism⁠

“Generative AI for Math: Part I—MathPile: A Billion-Token-Scale Pretraining Corpus for Math ”, Wang et al 2023

Generative AI for Math: Part I—MathPile: A Billion-Token-Scale Pretraining Corpus for Math⁠

“WaveCoder: Widespread And Versatile Enhanced Instruction Tuning With Refined Data Generation ”, Yu et al 2023

WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation⁠

“Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach ”, Ma et al 2023

Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach⁠

“StarVector: Generating Scalable Vector Graphics Code from Images ”, Rodriguez et al 2023

StarVector: Generating Scalable Vector Graphics Code from Images⁠

“Rich Human Feedback for Text-To-Image Generation ”, Liang et al 2023

Rich Human Feedback for Text-to-Image Generation⁠

“TinyGSM: Achieving >80% on GSM8k With Small Language Models ”, Liu et al 2023

TinyGSM: achieving >80% on GSM8k with small language models⁠

“EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models ”, Paech 2023

EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models⁠

“Retrieving Conditions from Reference Images for Diffusion Models ”, Tang et al 2023

Retrieving Conditions from Reference Images for Diffusion Models⁠

“Sequential Modeling Enables Scalable Learning for Large Vision Models ”, Bai et al 2023

Sequential Modeling Enables Scalable Learning for Large Vision Models⁠

“BioCLIP: A Vision Foundation Model for the Tree of Life ”, Stevens et al 2023

BioCLIP: A Vision Foundation Model for the Tree of Life⁠

“Efficient Transformer Knowledge Distillation: A Performance Review ”, Brown et al 2023

Efficient Transformer Knowledge Distillation: A Performance Review⁠

“GPQA: A Graduate-Level Google-Proof Q&A Benchmark ”, Rein et al 2023

GPQA: A Graduate-Level Google-Proof Q&A Benchmark⁠

“Dazed & Confused: A Large-Scale Real-World User Study of ReCAPTCHAv2 ”, Searles et al 2023

Dazed & Confused: A Large-Scale Real-World User Study of reCAPTCHAv2⁠

“Instruction-Following Evaluation for Large Language Models ”, Zhou et al 2023

Instruction-Following Evaluation for Large Language Models⁠

“In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search ”, Li et al 2023

In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search⁠

“AnyText: Multilingual Visual Text Generation And Editing ”, Tuo et al 2023

AnyText: Multilingual Visual Text Generation And Editing⁠

“GLaMM: Pixel Grounding Large Multimodal Model ”, Rasheed et al 2023

GLaMM: Pixel Grounding Large Multimodal Model⁠

“Don’t Make Your LLM an Evaluation Benchmark Cheater ”, Zhou et al 2023

Don’t Make Your LLM an Evaluation Benchmark Cheater⁠

“CommonCanvas: An Open Diffusion Model Trained With Creative-Commons Images ”, Gokaslan et al 2023

CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images⁠

“FANToM: A Benchmark for Stress-Testing Machine Theory of Mind in Interactions ”, Kim et al 2023

FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions⁠

“MuSR: Testing the Limits of Chain-Of-Thought With Multistep Soft Reasoning ”, Sprague et al 2023

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning⁠

“Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition ”, Schulhoff et al 2023

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition⁠

“Llemma: An Open Language Model For Mathematics ”, Azerbayev et al 2023

Llemma: An Open Language Model For Mathematics⁠

“From Scarcity to Efficiency: Improving CLIP Training via Visual-Enriched Captions ”, Lai et al 2023

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions⁠

“TabLib: A Dataset of 627M Tables With Context ”, Eggert et al 2023

TabLib: A Dataset of 627M Tables with Context⁠

“SWE-Bench: Can Language Models Resolve Real-World GitHub Issues? ”, Jimenez et al 2023

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?⁠

“OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text ”, Paster et al 2023

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text⁠

“FreshLLMs: Refreshing Large Language Models With Search Engine Augmentation ”, Vu et al 2023

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation⁠

“UltraFeedback: Boosting Language Models With High-Quality Feedback ”, Cui et al 2023

UltraFeedback: Boosting Language Models with High-quality Feedback⁠

“MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book ”, Tanzer et al 2023

MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book⁠

“Demystifying CLIP Data ”, Xu et al 2023

Demystifying CLIP Data⁠

“The Cambridge Law Corpus: A Corpus for Legal AI Research ”, Östling et al 2023

The Cambridge Law Corpus: A Corpus for Legal AI Research⁠

“MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models ”, Yu et al 2023

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models⁠

“LongLoRA: Efficient Fine-Tuning of Long-Context Large Language Models ”, Chen et al 2023

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models⁠

“SlimPajama-DC: Understanding Data Combinations for LLM Training ”, Shen et al 2023

SlimPajama-DC: Understanding Data Combinations for LLM Training⁠

“MADLAD-400: A Multilingual And Document-Level Large Audited Dataset ”, Kudugunta et al 2023

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset⁠

“GoodWiki ”, Choi 2023

GoodWiki⁠

“From Sparse to Dense: GPT-4 Summarization With Chain of Density (CoD) Prompting ”, Adams et al 2023

From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting⁠

“FIMO: A Challenge Formal Dataset for Automated Theorem Proving ”, Liu et al 2023

FIMO: A Challenge Formal Dataset for Automated Theorem Proving⁠

“American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers ”, Dell et al 2023

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers⁠

“LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models ”, Guha et al 2023

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models⁠

“OctoPack: Instruction Tuning Code Large Language Models ”, Muennighoff et al 2023

OctoPack: Instruction Tuning Code Large Language Models⁠

“The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain ”, Moskvichev et al 2023

The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain⁠

“Android in the Wild: A Large-Scale Dataset for Android Device Control ”, Rawles et al 2023

Android in the Wild: A Large-Scale Dataset for Android Device Control⁠

“DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI ”, Zhang et al 2023

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI⁠

“AlpaGasus: Training A Better Alpaca With Fewer Data ”, Chen et al 2023

AlpaGasus: Training A Better Alpaca with Fewer Data⁠

“InternVid: A Large-Scale Video-Text Dataset for Multimodal Understanding and Generation ”, Wang et al 2023

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation⁠

“Instruction Mining: High-Quality Instruction Data Selection for Large Language Models ”, Cao et al 2023

Instruction Mining: High-Quality Instruction Data Selection for Large Language Models⁠

“Test-Time Training on Video Streams ”, Wang et al 2023

Test-Time Training on Video Streams⁠

“HEADLINES: A Massive Scale Semantic Similarity Dataset of Historical English ”, Silcock & Dell 2023

HEADLINES: A Massive Scale Semantic Similarity Dataset of Historical English⁠

“LeanDojo: Theorem Proving With Retrieval-Augmented Language Models ”, Yang et al 2023

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models⁠

“SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality ”, Hsieh et al 2023

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality⁠

“ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews ”, D’Arcy et al 2023

ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews⁠

“Understanding Social Reasoning in Language Models With Language Models ”, Gandhi et al 2023

Understanding Social Reasoning in Language Models with Language Models⁠

“OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents ”, Laurençon et al 2023

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents⁠

“AI Is a Lot of Work: As the Technology Becomes Ubiquitous, a Vast Tasker Underclass Is Emerging—And Not Going Anywhere ”, Dzieza 2023

AI Is a Lot of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging—and not going anywhere⁠

“Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model ”, Yi et al 2023

Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model⁠

“ChessGPT: Bridging Policy Learning and Language Modeling ”, Feng et al 2023

ChessGPT: Bridging Policy Learning and Language Modeling⁠

“Why YouTube Could Give Google an Edge in AI ”, Victor 2023

Why YouTube Could Give Google an Edge in AI⁠

“Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks ”, Veselovsky et al 2023

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks⁠

“The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora With Web Data, and Web Data Only ”, Penedo et al 2023

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only⁠

“Let’s Verify Step by Step ”, Lightman et al 2023

Let’s Verify Step by Step⁠

“WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia ”, Semnani et al 2023

WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia⁠

“SeeGULL: A Stereotype Benchmark With Broad Geo-Cultural Coverage Leveraging Generative Models ”, Jha et al 2023

SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models⁠

“C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models ”, Huang et al 2023

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models⁠

“TinyStories: How Small Can Language Models Be and Still Speak Coherent English? ”, Eldan & Li 2023

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?⁠

“Pick-A-Pic: An Open Dataset of User Preferences for Text-To-Image Generation ”, Kirstain et al 2023

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation⁠

“LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions ”, Wu et al 2023

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions⁠

“Multi-Party Chat (MultiLIGHT): Conversational Agents in Group Settings With Humans and Models ”, Wei et al 2023

Multi-Party Chat (MultiLIGHT): Conversational Agents in Group Settings with Humans and Models⁠

“ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification ”, Taesiri et al 2023

ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification⁠

“Parsing-Conditioned Anime Translation: A New Dataset and Method ”, Li et al 2023c

Parsing-Conditioned Anime Translation: A New Dataset and Method⁠

“Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling ”, Biderman et al 2023

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling⁠

“Abstraction-Perception Preserving Cartoon Face Synthesis ”, Ho et al 2023

Abstraction-Perception Preserving Cartoon Face Synthesis⁠

“How Well Do Large Language Models Perform in Arithmetic Tasks? ”, Yuan et al 2023

How well do Large Language Models perform in Arithmetic tasks?⁠

“The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset ”, Laurençon et al 2023

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset⁠

“Large Language Models Are State-Of-The-Art Evaluators of Translation Quality ”, Kocmi & Federmann 2023

Large Language Models Are State-of-the-Art Evaluators of Translation Quality⁠

“Poisoning Web-Scale Training Datasets Is Practical ”, Carlini et al 2023

Poisoning Web-Scale Training Datasets is Practical⁠

“Benchmarks for Automated Commonsense Reasoning: A Survey ”, Davis 2023

Benchmarks for Automated Commonsense Reasoning: A Survey⁠

“Data Selection for Language Models via Importance Resampling ”, Xie et al 2023

Data Selection for Language Models via Importance Resampling⁠

“Off-The-Grid MARL (OG-MARL): Datasets With Baselines for Offline Multi-Agent Reinforcement Learning ”, Formanek et al 2023

Off-the-Grid MARL (OG-MARL): Datasets with Baselines for Offline Multi-Agent Reinforcement Learning⁠

“The BabyLM Challenge: Sample-Efficient Pretraining on a Developmentally Plausible Corpus ”, Warstadt et al 2023

The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus⁠

“The Semantic Scholar Open Data Platform ”, Kinney et al 2023

The Semantic Scholar Open Data Platform⁠

“Interactive-Chain-Prompting (INTERCPT): Ambiguity Resolution for Crosslingual Conditional Generation With Interaction ”, Pilault et al 2023

Interactive-Chain-Prompting (INTERCPT): Ambiguity Resolution for Crosslingual Conditional Generation with Interaction⁠

“How Close Is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection ”, Guo et al 2023

How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection⁠

“Med-PaLM: Large Language Models Encode Clinical Knowledge ”, Singhal et al 2022

Med-PaLM: Large Language Models Encode Clinical Knowledge⁠

“Unnatural Instructions: Tuning Language Models With (Almost) No Human Labor ”, Honovich et al 2022

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor⁠

“HALIE: Evaluating Human-Language Model Interaction ”, Lee et al 2022

HALIE: Evaluating Human-Language Model Interaction⁠

“A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others ”, Li et al 2022

A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others⁠

“Text Embeddings by Weakly-Supervised Contrastive Pre-Training ”, Wang et al 2022

Text Embeddings by Weakly-Supervised Contrastive Pre-training⁠

“The Stack: 3 TB of Permissively Licensed Source Code ”, Kocetkov et al 2022

The Stack: 3 TB of permissively licensed source code⁠

“UniSumm: Unified Few-Shot Summarization With Multi-Task Pre-Training and Prefix-Tuning ”, Chen et al 2022

UniSumm: Unified Few-shot Summarization with Multi-Task Pre-Training and Prefix-Tuning⁠

“A Creative Industry Image Generation Dataset Based on Captions ”, Yuejia et al 2022

A Creative Industry Image Generation Dataset Based on Captions⁠

“AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities ”, Chen et al 2022

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities⁠

“AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies ”, Siyao et al 2022

AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies⁠

“MMDialog: A Large-Scale Multi-Turn Dialogue Dataset Towards Multi-Modal Open-Domain Conversation ”, Feng et al 2022

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation⁠

“BLOOMZ/mT0: Crosslingual Generalization through Multitask Finetuning ”, Muennighoff et al 2022

BLOOMZ/mT0: Crosslingual Generalization through Multitask Finetuning⁠

“Dungeons and Data: A Large-Scale NetHack Dataset ”, Hambro et al 2022

Dungeons and Data: A Large-Scale NetHack Dataset⁠

“Will We Run out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning ”, Villalobos et al 2022

Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning⁠

“Large Language Models Can Self-Improve ”, Huang et al 2022

Large Language Models Can Self-Improve⁠

“CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning ”, Castricato et al 2022

CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning⁠

“MTEB: Massive Text Embedding Benchmark ”, Muennighoff et al 2022

MTEB: Massive Text Embedding Benchmark⁠

“Most Language Models Can Be Poets Too: An AI Writing Assistant and Constrained Text Generation Studio ”, Roush et al 2022

Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio⁠

“Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle) ”, Press et al 2022

Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)⁠

“Dynamic Prompt Learning via Policy Gradient for Semi-Structured Mathematical Reasoning ”, Lu et al 2022

Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning⁠

“Brain Imaging Generation With Latent Diffusion Models ”, Pinaya et al 2022

Brain Imaging Generation with Latent Diffusion Models⁠

“PaLI: A Jointly-Scaled Multilingual Language-Image Model ”, Chen et al 2022

PaLI: A Jointly-Scaled Multilingual Language-Image Model⁠

“FOLIO: Natural Language Reasoning With First-Order Logic ”, Han et al 2022

FOLIO: Natural Language Reasoning with First-Order Logic⁠

“Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned ”, Ganguli et al 2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned⁠

“Bugs in the Data: How ImageNet Misrepresents Biodiversity ”, Luccioni & Rolnick 2022

Bugs in the Data: How ImageNet Misrepresents Biodiversity⁠

“Discovering Bugs in Vision Models Using Off-The-Shelf Image Generation and Captioning ”, Wiles et al 2022

Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning⁠

“Benchmarking Compositionality With Formal Languages ”, Valvoda et al 2022

Benchmarking Compositionality with Formal Languages⁠

“Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP ”, Nguyen et al 2022

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP⁠

“Learning to Generalize With Object-Centric Agents in the Open World Survival Game Crafter ”, Stanić et al 2022

Learning to Generalize with Object-centric Agents in the Open World Survival Game Crafter⁠

“Few-Shot Adaptation Works With UnpredicTable Data ”, Chan et al 2022

Few-shot Adaptation Works with UnpredicTable Data⁠

“Language Models Can Teach Themselves to Program Better ”, Haluptzok et al 2022

Language Models Can Teach Themselves to Program Better⁠

“RealTime QA: What’s the Answer Right Now? ”, Kasai et al 2022

RealTime QA: What’s the Answer Right Now?⁠

“NewsStories: Illustrating Articles With Visual Summaries ”, Tan et al 2022

NewsStories: Illustrating articles with visual summaries⁠

“CelebV-HQ: A Large-Scale Video Facial Attributes Dataset ”, Zhu et al 2022

CelebV-HQ: A Large-Scale Video Facial Attributes Dataset⁠

“Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data? ”, Grinsztajn et al 2022

Why do tree-based models still outperform deep learning on tabular data?⁠

“Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset ”, Henderson et al 2022

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset⁠

“Forecasting Future World Events With Neural Networks ”, Zou et al 2022

Forecasting Future World Events with Neural Networks⁠

“RST: ReStructured Pre-Training ”, Yuan & Liu 2022

RST: reStructured Pre-training⁠

“Learning to Generate Artistic Character Line Drawing ”, Fang et al 2022

Learning to Generate Artistic Character Line Drawing⁠

“Dataset Condensation via Efficient Synthetic-Data Parameterization ”, Kim et al 2022

Dataset Condensation via Efficient Synthetic-Data Parameterization⁠

“Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions ”, Jiang et al 2022

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions⁠

“Fine-Grained Image Captioning With CLIP Reward ”, Cho et al 2022

Fine-grained Image Captioning with CLIP Reward⁠

“FLEURS: Few-Shot Learning Evaluation of Universal Representations of Speech ”, Conneau et al 2022

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech⁠

“InstructDial: Improving Zero and Few-Shot Generalization in Dialogue through Instruction Tuning ”, Gupta et al 2022

InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning⁠

“Learning to Model Editing Processes ”, Reid & Neubig 2022

Learning to Model Editing Processes⁠

“Flexible Diffusion Modeling of Long Videos ”, Harvey et al 2022

Flexible Diffusion Modeling of Long Videos⁠

“Housekeep: Tidying Virtual Households Using Commonsense Reasoning ”, Kant et al 2022

Housekeep: Tidying Virtual Households using Commonsense Reasoning⁠

“Instruction Induction: From Few Examples to Natural Language Task Descriptions ”, Honovich et al 2022

Instruction Induction: From Few Examples to Natural Language Task Descriptions⁠

“Down and Across: Introducing Crossword-Solving As a New NLP Benchmark ”, Kulshreshtha et al 2022

Down and Across: Introducing Crossword-Solving as a New NLP Benchmark⁠

“Automated Crossword Solving ”, Wallace et al 2022

Automated Crossword Solving⁠

“Dialog Inpainting: Turning Documents into Dialogues ”, Dai et al 2022

Dialog Inpainting: Turning Documents into Dialogues⁠

“SymphonyNet: Symphony Generation With Permutation Invariant Language Model ”, Liu et al 2022

SymphonyNet: Symphony Generation with Permutation Invariant Language Model⁠

“Building Machine Translation Systems for the Next Thousand Languages ”, Bapna et al 2022

Building Machine Translation Systems for the Next Thousand Languages⁠

“When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet ”, Vasudevan et al 2022

When does dough become a bagel? Analyzing the remaining mistakes on ImageNet⁠

“Data Determines Distributional Robustness in Contrastive Language Image Pre-Training (CLIP) ”, Fang et al 2022

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)⁠

“A Challenging Benchmark of Anime Style Recognition ”, Li et al 2022

A Challenging Benchmark of Anime Style Recognition⁠

“Tk-Instruct: Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks ”, Wang et al 2022

Tk-Instruct: Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks⁠

“Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality ”, Thrush et al 2022

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality⁠

“KNN-Diffusion: Image Generation via Large-Scale Retrieval ”, Ashual et al 2022

KNN-Diffusion: Image Generation via Large-Scale Retrieval⁠

“ByT5 Model for Massively Multilingual Grapheme-To-Phoneme Conversion ”, Zhu et al 2022

ByT5 model for massively multilingual grapheme-to-phoneme conversion⁠

“STaR: Bootstrapping Reasoning With Reasoning ”, Zelikman et al 2022

STaR: Bootstrapping Reasoning With Reasoning⁠

“CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-Shot Transfer Learning ”, Taesiri et al 2022

CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning⁠

“Bamboo: Building Mega-Scale Vision Dataset Continually With Human-Machine Synergy ”, Zhang et al 2022

Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy⁠

“Self-Distilled StyleGAN: Towards Generation from Internet Photos ”, Mokady et al 2022

Self-Distilled StyleGAN: Towards Generation from Internet Photos⁠

“RuCLIP—New Models and Experiments: a Technical Report ”, Shonenkov et al 2022

RuCLIP—new models and experiments: a technical report⁠

“Wukong: 100 Million Large-Scale Chinese Cross-Modal Pre-Training Dataset and A Foundation Framework ”, Gu et al 2022

Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework⁠

“ROME: Locating and Editing Factual Associations in GPT ”, Meng et al 2022

ROME: Locating and Editing Factual Associations in GPT⁠

“DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-To-Image Generative Transformers ”, Cho et al 2022

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers⁠

“PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts ”, Bach et al 2022

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts⁠

“StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets ”, Sauer et al 2022

StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets⁠

“BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation ”, Li et al 2022

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation⁠

“Can Wikipedia Help Offline Reinforcement Learning? ”, Reid et al 2022

Can Wikipedia Help Offline Reinforcement Learning?⁠

“SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models ”, Singh et al 2022

SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models⁠

“CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities ”, Lee et al 2022

CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities⁠

“WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation ”, Liu et al 2022

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation⁠

“SynthBio: A Case Study in Faster Curation of Text Datasets ”, Yuan et al 2022

SynthBio: A Case Study in Faster Curation of Text Datasets⁠

“BigDatasetGAN: Synthesizing ImageNet With Pixel-Wise Annotations ”, Li et al 2022

BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations⁠

“ERNIE-ViLG: Unified Generative Pre-Training for Bidirectional Vision-Language Generation ”, Zhang et al 2021

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation⁠

“A Fistful of Words: Learning Transferable Visual Models from Bag-Of-Words Supervision ”, Tejankar et al 2021

A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision⁠

“GLIDE: Towards Photorealistic Image Generation and Editing With Text-Guided Diffusion Models ”, Nichol et al 2021

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models⁠

“QuALITY: Question Answering With Long Input Texts, Yes! ”, Pang et al 2021

QuALITY: Question Answering with Long Input Texts, Yes!⁠

“FRUIT: Faithfully Reflecting Updated Information in Text ”, IV et al 2021

FRUIT: Faithfully Reflecting Updated Information in Text⁠

“Models in the Loop: Aiding Crowdworkers With Generative Annotation Assistants ”, Bartolo et al 2021

Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants⁠

“WebGPT: Browser-Assisted Question-Answering With Human Feedback ”, Nakano et al 2021

WebGPT: Browser-assisted question-answering with human feedback⁠

“GLaM: Efficient Scaling of Language Models With Mixture-Of-Experts ”, Du et al 2021

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts⁠

“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions ”, Soldan et al 2021

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions⁠

“BASIC: Combined Scaling for Open-Vocabulary Image Classification ”, Pham et al 2021

BASIC: Combined Scaling for Open-Vocabulary Image Classification⁠

“It’s About Time: Analog Clock Reading in the Wild ”, Yang et al 2021

It’s About Time: Analog Clock Reading in the Wild⁠

“Solving Probability and Statistics Problems by Program Synthesis ”, Tang et al 2021

Solving Probability and Statistics Problems by Program Synthesis⁠

“Few-Shot Self-Rationalization With Natural Language Prompts ”, Marasović et al 2021

Few-Shot Self-Rationalization with Natural Language Prompts⁠

“AnimeCeleb: Large-Scale Animation CelebHeads Dataset for Head Reenactment ”, Kim et al 2021

AnimeCeleb: Large-Scale Animation CelebHeads Dataset for Head Reenactment⁠

“RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning ”, Ramos et al 2021

RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning⁠

“An Explanation of In-Context Learning As Implicit Bayesian Inference ”, Xie et al 2021

An Explanation of In-context Learning as Implicit Bayesian Inference⁠

“LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs ”, Schuhmann et al 2021

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs⁠

“Training Verifiers to Solve Math Word Problems ”, Cobbe et al 2021

Training Verifiers to Solve Math Word Problems⁠

“A Connectome of the Drosophila Central Complex Reveals Network Motifs Suitable for Flexible Navigation and Context-Dependent Action Selection ”, Hulse et al 2021

A connectome of the Drosophila central complex reveals network motifs suitable for flexible navigation and context-dependent action selection⁠

“HTCN: Harmonious Text Colorization Network for Visual-Textual Presentation Design ”, Yang et al 2021c

HTCN: Harmonious Text Colorization Network for Visual-Textual Presentation Design⁠

“T0: Multitask Prompted Training Enables Zero-Shot Task Generalization ”, Sanh et al 2021

T0: Multitask Prompted Training Enables Zero-Shot Task Generalization⁠

“Can Machines Learn Morality? The Delphi Experiment ”, Jiang et al 2021

Can Machines Learn Morality? The Delphi Experiment⁠

“Situated Dialogue Learning through Procedural Environment Generation ”, Ammanabrolu et al 2021

Situated Dialogue Learning through Procedural Environment Generation⁠

“MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research ”, Samvelyan et al 2021

MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research⁠

“TruthfulQA: Measuring How Models Mimic Human Falsehoods ”, Lin et al 2021

TruthfulQA: Measuring How Models Mimic Human Falsehoods⁠

“MiniF2F: a Cross-System Benchmark for Formal Olympiad-Level Mathematics ”, Zheng et al 2021

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics⁠

“LAION-400-Million Open Dataset ”, Schuhmann 2021

LAION-400-Million Open Dataset⁠

“Transfer Learning for Pose Estimation of Illustrated Characters ”, Chen & Zwicker 2021

Transfer Learning for Pose Estimation of Illustrated Characters⁠

“MuSiQue: Multi-Hop Questions via Single-Hop Question Composition ”, Trivedi et al 2021

MuSiQue: Multi-hop Questions via Single-hop Question Composition⁠

“Scaling Vision Transformers ”, Zhai et al 2021

Scaling Vision Transformers⁠

“QASPER: A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers ”, Dasigi et al 2021

QASPER: A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers⁠

“XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond ”, Barbieri et al 2021

XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond⁠

“BEIR: A Heterogenous Benchmark for Zero-Shot Evaluation of Information Retrieval Models ”, Thakur et al 2021

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models⁠

“SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network ”, Chan et al 2021

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network⁠

“Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks ”, Northcutt et al 2021

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks⁠

“NaturalProofs: Mathematical Theorem Proving in Natural Language ”, Welleck et al 2021

NaturalProofs: Mathematical Theorem Proving in Natural Language⁠

“Get Your Vitamin C! Robust Fact Verification With Contrastive Evidence (VitaminC) ”, Schuster et al 2021

Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence (VitaminC)⁠

“Are NLP Models Really Able to Solve Simple Math Word Problems? ”, Patel et al 2021

Are NLP Models really able to Solve Simple Math Word Problems?⁠

“Measuring Mathematical Problem Solving With the MATH Dataset ”, Hendrycks et al 2021

Measuring Mathematical Problem Solving With the MATH Dataset⁠

“Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food ”, Thames et al 2021

Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food⁠

“WIT: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning ”, Srinivasan et al 2021

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning⁠

“A Massive 7T FMRI Dataset to Bridge Cognitive and Computational Neuroscience ”, Allen et al 2021

A massive 7T fMRI dataset to bridge cognitive and computational neuroscience⁠

“Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts ”, Changpinyo et al 2021

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts⁠

“ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision ”, Jia et al 2021

ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision⁠

“Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling ”, Lazaridou et al 2021

Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling⁠

“Scaling Laws for Transfer ”, Hernandez et al 2021

Scaling Laws for Transfer⁠

“Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning ”, Lee et al 2021

Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning⁠

“MSR-VTT: A Large Video Description Dataset for Bridging Video and Language ”, Xu et al 2021

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language⁠

“CLIP: Learning Transferable Visual Models From Natural Language Supervision ”, Radford et al 2021

CLIP: Learning Transferable Visual Models From Natural Language Supervision⁠

“CLIP: Connecting Text and Images: We’re Introducing a Neural Network Called CLIP Which Efficiently Learns Visual Concepts from Natural Language Supervision. CLIP Can Be Applied to Any Visual Classification Benchmark by Simply Providing the Names of the Visual Categories to Be Recognized, Similar to the ‘Zero-Shot’ Capabilities of GPT-2 and GPT-3 ”, Radford et al 2021

CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3⁠

“The Pile: An 800GB Dataset of Diverse Text for Language Modeling ”, Gao et al 2021

The Pile: An 800GB Dataset of Diverse Text for Language Modeling⁠

“Selective Eye-Gaze Augmentation To Enhance Imitation Learning In Atari Games ”, Thammineni et al 2020

Selective Eye-gaze Augmentation To Enhance Imitation Learning In Atari Games⁠

“VoxLingua107: a Dataset for Spoken Language Recognition ”, Valk & Alumäe 2020

VoxLingua107: a Dataset for Spoken Language Recognition⁠

“MoGaze: A Dataset of Full-Body Motions That Includes Workspace Geometry and Eye-Gaze ”, Kratzer et al 2020

MoGaze: A Dataset of Full-Body Motions that Includes Workspace Geometry and Eye-Gaze⁠

“End-To-End Chinese Landscape Painting Creation Using Generative Adversarial Networks ”, Xue 2020

End-to-End Chinese Landscape Painting Creation Using Generative Adversarial Networks⁠

“Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding ”, Roberts et al 2020

Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding⁠

“Constructing A Multi-Hop QA Dataset for Comprehensive Evaluation of Reasoning Steps ”, Ho et al 2020

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps⁠

“Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus ”, Caswell et al 2020

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus⁠

“Open-Domain Question Answering Goes Conversational via Question Rewriting ”, Anantha et al 2020

Open-Domain Question Answering Goes Conversational via Question Rewriting⁠

“Digital Voicing of Silent Speech ”, Gaddy & Klein 2020

Digital Voicing of Silent Speech⁠

“A C/C++ Code Vulnerability Dataset With Code Changes and CVE Summaries ”, Fan et al 2020

A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries⁠

“MMLU: Measuring Massive Multitask Language Understanding ”, Hendrycks et al 2020

MMLU: Measuring Massive Multitask Language Understanding⁠

“ETHICS: Aligning AI With Shared Human Values ”, Hendrycks et al 2020

ETHICS: Aligning AI With Shared Human Values⁠

“Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing ”, Gu et al 2020

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing⁠

“CoVoST 2 and Massively Multilingual Speech-To-Text Translation ”, Wang et al 2020

CoVoST 2 and Massively Multilingual Speech-to-Text Translation⁠

“The Many Faces of Robustness: A Critical Analysis of Out-Of-Distribution Generalization ”, Hendrycks et al 2020

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization⁠

“The NetHack Learning Environment ”, Küttler et al 2020

The NetHack Learning Environment⁠

“ForecastQA: A Question Answering Challenge for Event Forecasting With Temporal Text Data ”, Jin et al 2020

ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data⁠

“Shortcut Learning in Deep Neural Networks ”, Geirhos et al 2020

Shortcut Learning in Deep Neural Networks⁠

“D4RL: Datasets for Deep Data-Driven Reinforcement Learning ”, Fu et al 2020

D4RL: Datasets for Deep Data-Driven Reinforcement Learning⁠

“TyDiQA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages ”, Clark et al 2020

TyDiQA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages⁠

“SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded from the Infant’s Perspective ”, Sullivan et al 2020

SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective⁠

“ImageNet-A: Natural Adversarial Examples ”, Hendrycks et al 2020

ImageNet-A: Natural Adversarial Examples⁠

“Measuring Compositional Generalization: A Comprehensive Method on Realistic Data ”, Keysers et al 2019

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data⁠

“Libri-Light: A Benchmark for ASR With Limited or No Supervision ”, Kahn et al 2019

Libri-Light: A Benchmark for ASR with Limited or No Supervision⁠

“How Can We Know What Language Models Know? ”, Jiang et al 2019

How Can We Know What Language Models Know?⁠

“SimpleBooks: Long-Term Dependency Book Dataset With Simplified English Vocabulary for Word-Level Language Modeling ”, Nguyen 2019

SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling⁠

“How Machine Learning Can Help Unlock the World of Ancient Japan ”, Lamb 2019

How Machine Learning Can Help Unlock the World of Ancient Japan⁠

“Compressive Transformers for Long-Range Sequence Modeling ”, Rae et al 2019

Compressive Transformers for Long-Range Sequence Modeling⁠

“CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning ”, Lin et al 2019

CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning⁠

“CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data ”, Wenzek et al 2019

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data⁠

“T5: Exploring the Limits of Transfer Learning With a Unified Text-To-Text Transformer ”, Raffel et al 2019

T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer⁠

“2019-10-18-Poetryfoundation-Formatted.txt ”

2019-10-18-poetryfoundation-formatted.txt⁠ :

View XZ archive:

“Restoring Ancient Text Using Deep Learning (Pythia): a Case Study on Greek Epigraphy ”, Assael et al 2019

Restoring ancient text using deep learning (Pythia): a case study on Greek epigraphy⁠

“CATER: A Diagnostic Dataset for Compositional Actions and TEmporal Reasoning ”, Girdhar & Ramanan 2019

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning⁠

“PubMedQA: A Dataset for Biomedical Research Question Answering ”, Jin et al 2019

PubMedQA: A Dataset for Biomedical Research Question Answering⁠

“ObjectNet: A Large-Scale Bias-Controlled Dataset for Pushing the Limits of Object Recognition Models ”, Barbu et al 2019

ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models⁠

“No Press Diplomacy: Modeling Multi-Agent Gameplay ”, Paquette et al 2019

No Press Diplomacy: Modeling Multi-Agent Gameplay⁠

“Language Modeling State-Of-The-Art Leaderboards ”, paperswithcode.com 2019

Language Modeling State-of-the-art leaderboards⁠

“LVIS: A Dataset for Large Vocabulary Instance Segmentation ”, Gupta et al 2019

LVIS: A Dataset for Large Vocabulary Instance Segmentation⁠

“Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank ”, Socher et al 2019

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank⁠

“A Large Single-Participant FMRI Dataset for Probing Brain Responses to Naturalistic Stimuli in Space and Time ”, Seeliger et al 2019

A large single-participant fMRI dataset for probing brain responses to naturalistic stimuli in space and time⁠

“OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge ”, Marino et al 2019

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge⁠

“ImageNet-Sketch: Learning Robust Global Representations by Penalizing Local Predictive Power ”, Wang et al 2019

ImageNet-Sketch: Learning Robust Global Representations by Penalizing Local Predictive Power⁠

“Cold Case: The Lost MNIST Digits ”, Yadav & Bottou 2019

Cold Case: The Lost MNIST Digits⁠

“SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems ”, Wang et al 2019

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems⁠

“The MineRL 2019 Competition on Sample Efficient Reinforcement Learning Using Human Priors ”, Guss et al 2019

The MineRL 2019 Competition on Sample Efficient Reinforcement Learning using Human Priors⁠

“ProductNet: a Collection of High-Quality Datasets for Product Representation Learning ”, Wang et al 2019

ProductNet: a Collection of High-Quality Datasets for Product Representation Learning⁠

“Benchmarking Neural Network Robustness to Common Corruptions and Perturbations ”, Hendrycks & Dietterich 2019

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations⁠

“Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset ”, Zhang et al 2019

Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset⁠

“LIGHT: Learning to Speak and Act in a Fantasy Text Adventure Game ”, Urbanek et al 2019

LIGHT: Learning to Speak and Act in a Fantasy Text Adventure Game⁠

“DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs ”, Dua et al 2019

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs⁠

“A Replication Study: Machine Learning Models Are Capable of Predicting Sexual Orientation From Facial Images ”, Leuner 2019

A Replication Study: Machine Learning Models Are Capable of Predicting Sexual Orientation From Facial Images⁠

“MyAnimeList Anime-Summary Scrape ”, Gokaslan 2019

⁠MyAnimeList anime-summary scrape⁠ :

View XZ archive:

“Language Models Are Unsupervised Multitask Learners ”, Radford et al 2019

Language Models are Unsupervised Multitask Learners⁠

“The Omniglot Challenge: a 3-Year Progress Report ”, Lake et al 2019

The Omniglot challenge: a 3-year progress report⁠

“Do We Train on Test Data? Purging CIFAR of Near-Duplicates ”, Barz & Denzler 2019

Do We Train on Test Data? Purging CIFAR of Near-Duplicates⁠

“The RobotriX: An EXtremely Photorealistic and Very-Large-Scale Indoor Dataset of Sequences With Robot Trajectories and Interactions ”, Garcia-Garcia et al 2019

The RobotriX: An eXtremely Photorealistic and Very-Large-Scale Indoor Dataset of Sequences with Robot Trajectories and Interactions⁠

“FIGR: Few-Shot Image Generation With Reptile ”, Clouâtre & Demers 2019

FIGR: Few-shot Image Generation with Reptile⁠

“Natural Questions: A Benchmark for Question Answering Research ”, Kwiatkowski et al 2019

Natural Questions: A Benchmark for Question Answering Research⁠

“A Style-Based Generator Architecture for Generative Adversarial Networks ”, Karras et al 2018

A Style-Based Generator Architecture for Generative Adversarial Networks⁠

“ImageNet-Trained CNNs Are Biased towards Texture; Increasing Shape Bias Improves Accuracy and Robustness ”, Geirhos et al 2018

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness⁠

“CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge ”, Talmor et al 2018

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge⁠

“The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale ”, Kuznetsova et al 2018

The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale⁠

“HotpotQA: A Dataset for Diverse, Explainable Multi-Hop Question Answering ”, Yang et al 2018

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering⁠

“Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization ”, Narayan et al 2018

Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization⁠

“CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images ”, Guo et al 2018

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images⁠

“A Short Note about Kinetics-600 ”, Carreira et al 2018

A Short Note about Kinetics-600⁠

“Cartoon Set ”, Royer et al 2018

Cartoon Set⁠

“Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations ”, Hendrycks & Dietterich 2018

Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations⁠

“Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning ”, Sharma et al 2018

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning⁠

“Know What You Don’t Know: Unanswerable Questions for SQuAD ”, Rajpurkar et al 2018

Know What You Don’t Know: Unanswerable Questions for SQuAD⁠

“BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning ”, Yu et al 2018

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning⁠

“Exploring the Limits of Weakly Supervised Pretraining ”, Mahajan et al 2018

Exploring the Limits of Weakly Supervised Pretraining⁠

“Newsroom: A Dataset of 1.3 Million Summaries With Diverse Extractive Strategies ”, Grusky et al 2018

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies⁠

“GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding ”, Wang et al 2018

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding⁠

“The Sound of Pixels ”, Zhao et al 2018

The Sound of Pixels⁠

“FEVER: a Large-Scale Dataset for Fact Extraction and VERification ”, Thorne et al 2018

FEVER: a large-scale dataset for Fact Extraction and VERification⁠

“Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge ”, Clark et al 2018

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge⁠

“SCUT-FBP5500: A Diverse Benchmark Dataset for Multi-Paradigm Facial Beauty Prediction ”, Liang et al 2018

SCUT-FBP5500: A Diverse Benchmark Dataset for Multi-Paradigm Facial Beauty Prediction⁠

“11K Hands: Gender Recognition and Biometric Identification Using a Large Dataset of Hand Images ”, Afifi 2017

11K Hands: Gender recognition and biometric identification using a large dataset of hand images⁠

“Progressive Growing of GANs for Improved Quality, Stability, and Variation ”, Karras et al 2017

Progressive Growing of GANs for Improved Quality, Stability, and Variation⁠

“OpenML Benchmarking Suites ”, Bischl et al 2017

OpenML Benchmarking Suites⁠

“WebVision Database: Visual Learning and Understanding from Web Data ”, Li et al 2017

WebVision Database: Visual Learning and Understanding from Web Data⁠

“A Downsampled Variant of ImageNet As an Alternative to the CIFAR Datasets ”, Chrabaszcz et al 2017

A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets⁠

“Revisiting Unreasonable Effectiveness of Data in Deep Learning Era ”, Sun et al 2017

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era⁠

“Driver Identification Using Automobile Sensor Data from a Single Turn ”, Hallac et al 2017

Driver Identification Using Automobile Sensor Data from a Single Turn⁠

“StreetStyle: Exploring World-Wide Clothing Styles from Millions of Photos ”, Matzen et al 2017

StreetStyle: Exploring world-wide clothing styles from millions of photos⁠

“The Kinetics Human Action Video Dataset ”, Kay et al 2017

The Kinetics Human Action Video Dataset⁠

“WebVision Challenge: Visual Learning and Understanding With Web Data ”, Li et al 2017

WebVision Challenge: Visual Learning and Understanding With Web Data⁠

“TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension ”, Joshi et al 2017

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension⁠

“Dense-Captioning Events in Videos ”, Krishna et al 2017

Dense-Captioning Events in Videos⁠

“BAM! The Behance Artistic Media Dataset for Recognition Beyond Photography ”, Wilber et al 2017

BAM! The Behance Artistic Media Dataset for Recognition Beyond Photography⁠

“SearchQA: A New Q&A Dataset Augmented With Context from a Search Engine ”, Dunn et al 2017

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine⁠

“RACE: Large-Scale ReAding Comprehension Dataset From Examinations ”, Lai et al 2017

RACE: Large-scale ReAding Comprehension Dataset From Examinations⁠

“NewsQA: A Machine Comprehension Dataset ”, Trischler et al 2016

NewsQA: A Machine Comprehension Dataset⁠

“MS MARCO: A Human Generated MAchine Reading COmprehension Dataset ”, Bajaj et al 2016

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset⁠

“Visual Dialog ”, Das et al 2016

Visual Dialog⁠

“Lip Reading Sentences in the Wild ”, Chung et al 2016

Lip Reading Sentences in the Wild⁠

“Pointer Sentinel Mixture Models ”, Merity et al 2016

Pointer Sentinel Mixture Models⁠

“Deep Learning the City: Quantifying Urban Perception At A Global Scale ”, Dubey et al 2016

Deep Learning the City: Quantifying Urban Perception At A Global Scale⁠

“MultiArith: Solving General Arithmetic Word Problems ”, Roy & Roth 2016

MultiArith: Solving General Arithmetic Word Problems⁠

“The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context ”, Paperno et al 2016

The LAMBADA dataset: Word prediction requiring a broad discourse context⁠

“SQuAD: 100,000+ Questions for Machine Comprehension of Text ”, Rajpurkar et al 2016

SQuAD: 100,000+ Questions for Machine Comprehension of Text⁠

“Matching Networks for One Shot Learning ”, Vinyals et al 2016

Matching Networks for One Shot Learning⁠

“Convolutional Sketch Inversion ”, Güçlütürk et al 2016

Convolutional Sketch Inversion⁠

“The MovieLens Datasets: History and Context ”, Harper & Konstan 2015

The MovieLens Datasets: History and Context⁠

“Neural Module Networks ”, Andreas et al 2015

Neural Module Networks⁠

“Sketch-Based Manga Retrieval Using Manga109 Dataset ”, Matsui et al 2015

Sketch-based Manga Retrieval using Manga109 Dataset⁠

“Amazon Reviews: Image-Based Recommendations on Styles and Substitutes ”, McAuley et al 2015

Amazon Reviews: Image-based Recommendations on Styles and Substitutes⁠

“Teaching Machines to Read and Comprehend ”, Hermann et al 2015

Teaching Machines to Read and Comprehend⁠

“LSUN: Construction of a Large-Scale Image Dataset Using Deep Learning With Humans in the Loop ”, Yu et al 2015

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop⁠

“The Unreasonable Effectiveness of Recurrent Neural Networks ”, Karpathy 2015

The Unreasonable Effectiveness of Recurrent Neural Networks

“VQA: Visual Question Answering ”, Agrawal et al 2015

VQA: Visual Question Answering⁠

“YFCC100M: The New Data in Multimedia Research ”, Thomee et al 2015

YFCC100M: The New Data in Multimedia Research⁠

“ImageNet Large Scale Visual Recognition Challenge ”, Russakovsky et al 2014

ImageNet Large Scale Visual Recognition Challenge⁠

“Microsoft COCO: Common Objects in Context ”, Lin et al 2014

Microsoft COCO: Common Objects in Context⁠

“N-Gram Counts and Language Models from the Common Crawl ”, Buck et al 2014

N-gram Counts and Language Models from the Common Crawl⁠

“One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling ”, Chelba et al 2013

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling⁠

“20 Years of Bitext ”, Brown et al 2013

⁠20 Years of Bitext⁠

“Ukiyo-E Search ”, Resig 2013

Ukiyo-e Search

“UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild ”, Soomro et al 2012

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild⁠

“The Caltech-UCSD Birds-200-2011 Dataset ”, Wah et al 2011

The Caltech-UCSD Birds-200-2011 Dataset⁠

“Unbiased Look at Dataset Bias ”, Torralba & Efros 2011

Unbiased look at dataset bias⁠

“Caltech-UCSD Birds 200 ”, Welinder et al 2010

⁠Caltech-UCSD Birds 200⁠ :

View PDF:

⁠/doc/ai/dataset/2010-welinder.pdf⁠

“Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments ”, Huang et al 2008

Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments⁠

“All Our n-Gram Are Belong to You ”, Franz & Brants 2006

⁠All Our n-gram are Belong to You⁠

“Building a Large Annotated Corpus of English: The Penn Treebank ”, Marcus et al 1993

Building a Large Annotated Corpus of English: The Penn Treebank⁠

“The Design for the Wall Street Journal-Based CSR Corpus ”, Paul & Baker 1992

⁠The Design for the Wall Street Journal-based CSR Corpus⁠ :

View PDF:

⁠/doc/ai/dataset/1992-paul.pdf⁠

“About the Test Data ”

⁠About the Test Data

“MIT Places Database for Scene Recognition ”

⁠⁠MIT Places Database for Scene Recognition :

View HTML:

⁠/doc/www/places.csail.mit.edu/e7643bbe03217046530b98b5db1ed1504d9ddd7e.html⁠

“DataGemma: AI Open Models Connecting LLMs to Google’s Data Commons ”

⁠DataGemma: AI open models connecting LLMs to Google’s Data Commons⁠ :

View HTML:

⁠/doc/www/blog.google/412e052c87cdece32165dd01da74a55852ab5107.html⁠

“STL-10 Dataset ”

⁠⁠STL-10 dataset⁠

“AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era ”

⁠⁠AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era

“Scale AI Secures $1B Funding at $14B Valuation As Its CEO Predicts Big Revenue Growth and Profitability by Year-End [On Very High Quality Data] ”

Scale AI secures $1B funding at $14B valuation as its CEO predicts big revenue growth and profitability by year-end [on very high quality data]⁠

“RWKV-CLIP: A Robust Vision-Language Representation Learner ”

⁠⁠RWKV-CLIP: A Robust Vision-Language Representation Learner⁠

“Salesforce/creativity_eval ”

⁠⁠salesforce/creativity_eval⁠

“No Robots: Look Ma, an Instruction Dataset That Wasn’t Generated by GPTs! ”, HuggingFace 2025

No Robots: Look Ma, an instruction dataset that wasn’t generated by GPTs!⁠

“Kaichengalex/YFCC15M ”

⁠Kaichengalex/YFCC15M⁠

“Blowing-Up-Groundhogs/font-Square-V2 ”

⁠blowing-up-groundhogs/font-square-v2⁠

“Psych-101 Dataset [For Centaur] ”

⁠Psych-101 dataset [for Centaur]⁠

“FineWeb: Decanting the Web for the Finest Text Data at Scale ”

FineWeb: decanting the web for the finest text data at scale⁠

“Solving Math Word Problems: We’ve Trained a System That Solves Grade School Math Problems With Nearly Twice the Accuracy of a Fine-Tuned GPT-3 Model. It Solves about 90% As Many Problems As Real Kids: a Small Sample of 9-12 Year Olds Scored 60% on a Test from Our Dataset, While Our System Scored 55% on Those Same Problems. This Is Important Because Today’s AI Is Still Quite Weak at Commonsense Multistep Reasoning, Which Is Easy Even for Grade School Kids. We Achieved These Results by Training Our Model to Recognize Its Mistakes, so That It Can Try Repeatedly Until It Finds a Solution That Works ”

Solving Math Word Problems: We’ve trained a system that solves grade school math problems with nearly twice the accuracy of a fine-tuned GPT-3 model. It solves about 90% as many problems as real kids: a small sample of 9-12 year olds scored 60% on a test from our dataset, while our system scored 55% on those same problems. This is important because today’s AI is still quite weak at commonsense multistep reasoning, which is easy even for grade school kids. We achieved these results by training our model to recognize its mistakes, so that it can try repeatedly until it finds a solution that works⁠

“SuperGLUE Benchmark ”

SuperGLUE Benchmark

“Some Lessons from the OpenAI FrontierMath Debacle ”

⁠Some lessons from the OpenAI FrontierMath debacle⁠ :

View HTML:

⁠/doc/www/www.greaterwrong.com/ed265357081c93d4e09f07600f1ac579a47da5df.html⁠

“Lip Reading Sentences in the Wild [Video] ”

⁠Lip Reading Sentences in the Wild [video]⁠ :

⁠https://www.youtube.com/watch?v=5aogzAUPilE⁠

Sort By Magic

Annotations sorted by machine learning into ⁠inferred 'tags'⁠. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

`reading-comprehension`

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

`language-modeling`

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

`benchmarking`

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

⁠[see previous entry]⁠

Wikipedia (7)

Brown Corpus⁠
Common Crawl⁠
Europarl Corpus⁠
MNIST database⁠
MNIST database § History⁠ :

⁠https://en.wikipedia.org/wiki/MNIST_database#History⁠
Netflix Prize⁠
UK Biobank⁠

Miscellaneous

Bibliography

https://arxiv.org/abs/2501.09038#deepmind: “Do Generative Video Models Learn Physical Principles from Watching Videos? ”⁠, Saman Motamed, Laura Culp, Kevin Swersky …, Priyank Jaini, Robert Geirhos⁠
link-bibliography⁠
2025-johri.pdf: “An Evaluation Framework for Clinical Use of Large Language Models in Patient Interaction Tasks ”⁠, Shreya Johri, Jaehwan Jeong, Benjamin A. Tran …, Daniel I. Schlessinger, Shannon Wongvibulsin, Leandra A. Barnes, Hong-Yu Zhou, Zhuo Ran Cai, Eliezer M. Van Allen, David Kim, Roxana Daneshjou, Pranav Rajpurkar
link-bibliography⁠
https://arxiv.org/abs/2411.13543: “BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games ”⁠, Davide Paglieri, Bartłomiej Cupiał, Samuel Coward …, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, ⁠Tim Rocktäschel
link-bibliography⁠
https://arxiv.org/abs/2410.06992: “SWE-Bench+: Enhanced Coding Benchmark for LLMs ”⁠, Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer …, Elijah Nnorom, Gias Uddin, Song Wang
link-bibliography⁠
https://arxiv.org/abs/2410.07095#openai: “MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering ”⁠, Jun Shern Chan, Neil Chowdhury, Oliver Jaffe …, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, ⁠Lilian Weng, ⁠Aleksander Madry
link-bibliography⁠
https://arxiv.org/abs/2407.20020: “ImagiNet: A Multi-Content Dataset for Generalizable Synthetic Image Detection via Contrastive Learning ”⁠, Delyan Boychev, Radostin Cholakov
link-bibliography⁠
https://arxiv.org/abs/2407.04694: “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs ”⁠, Rudolf Laine, Bilal Chughtai, Jan Betley …, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, ⁠Owain Evans
link-bibliography⁠
https://arxiv.org/abs/2407.04108: “Future Events As Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs ”⁠, Sara Price, Arjun Panickssery, ⁠Samuel R. Bowman, Asa Cooper Stickland
link-bibliography⁠
https://arxiv.org/abs/2406.18906: “Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets ”⁠, Melanie Walsh⁠, Anna Preus, Maria Antoniak
link-bibliography⁠
https://arxiv.org/abs/2406.18518#salesforce: “APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets ”⁠, Zuxin Liu, Thai Hoang, Jianguo Zhang …, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, ⁠Caiming Xiong
link-bibliography⁠
https://arxiv.org/abs/2406.13121#google: “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”⁠, Jinhyuk Lee, Anthony Chen⁠, Zhuyun Dai …, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin Guu
link-bibliography⁠
https://arxiv.org/abs/2406.11794: “DataComp-LM: In Search of the next Generation of Training Sets for Language Models ”⁠, Jeffrey Li, Alex Fang, Georgios Smyrnis …, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, ⁠Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, ⁠Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, ⁠Aaron Gokaslan⁠, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade⁠, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer⁠, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt⁠, Vaishaal Shankar
link-bibliography⁠
https://arxiv.org/abs/2406.06973: “RWKV-CLIP: A Robust Vision-Language Representation Learner ”⁠, Tiancheng Gu, Kaicheng Yang, Xiang An …, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
link-bibliography⁠
https://arxiv.org/abs/2405.18870#google: “LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks ”⁠, Winnie Street, John Oliver Siy, Geoff Keeling …, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas⁠, Robin I. M. Dunbar
link-bibliography⁠
https://arxiv.org/abs/2405.15306: “DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches With TikZ ”⁠, Jonas Belouadi, Simone Paolo Ponzetto, Steffen Eger
link-bibliography⁠
https://arxiv.org/abs/2405.07425: “Sakuga-42M Dataset: Scaling Up Cartoon Research ”⁠, Zhenglin Pan, Yu Zhu, Yuxuan Mu
link-bibliography⁠
https://arxiv.org/abs/2405.02793#google: “ImageInWords: Unlocking Hyper-Detailed Image Descriptions ”⁠, Roopal Garg, Andrea Burns⁠, Burcu Karagol Ayan …, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Baldridge, Radu Soricut
link-bibliography⁠
https://arxiv.org/abs/2405.00332#scale: “GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic ”⁠, Hugh Zhang, Jeff Da, Dean Lee …, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele Lunati, Summer Yue
link-bibliography⁠
https://arxiv.org/abs/2404.06664: “CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack Of) Multicultural Knowledge ”⁠, Yu Ying Chiu, Liwei Jiang, Maria Antoniak …, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi⁠
link-bibliography⁠
https://arxiv.org/abs/2404.05955: “VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? ”⁠, Junpeng Liu, Yifan Song, Bill Yuchen Lin …, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue
link-bibliography⁠
https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html: “How Tech Giants Cut Corners to Harvest Data for AI: OpenAI, Google and Meta Ignored Corporate Policies, Altered Their Own Rules and Discussed Skirting Copyright Law As They Sought Online Information to Train Their Newest Artificial Intelligence Systems ”⁠, Cade Metz⁠, Cecilia Kang, Sheera Frenkel …, Stuart A. Thompson, Nico Grant
link-bibliography⁠
https://arxiv.org/abs/2404.01291: “Evaluating Text-To-Visual Generation With Image-To-Text Generation ”⁠, Zhiqiu Lin, Deepak Pathak, Baiqi Li …, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan
link-bibliography⁠
https://arxiv.org/abs/2403.18624: “Vulnerability Detection With Code Language Models: How Far Are We? ”⁠, Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim …, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, Yizheng Chen
link-bibliography⁠
https://arxiv.org/abs/2403.18802#deepmind: “Long-Form Factuality in Large Language Models ”⁠, Jerry Wei, Chengrun Yang, Xinying Song …, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/2402.11753: “ArtPrompt: ASCII Art-Based Jailbreak Attacks against Aligned LLMs ”⁠, Fengqing Jiang, Zhangchen Xu, Luyao Niu …, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li⁠, Radha Poovendran⁠
link-bibliography⁠
https://arxiv.org/abs/2312.11556: “StarVector: Generating Scalable Vector Graphics Code from Images ”⁠, Juan A. Rodriguez, Shubham Agarwal⁠, Issam H. Laradji …, Pau Rodriguez, David Vazquez, Christopher Pal, Marco Pedersoli
link-bibliography⁠
https://arxiv.org/abs/2312.06281: “EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models ”⁠, Samuel J. Paech
link-bibliography⁠
https://arxiv.org/abs/2311.13657: “Efficient Transformer Knowledge Distillation: A Performance Review ”⁠, Nathan Brown, Ashton Williamson, Tahj Anderson, Logan Lawrence
link-bibliography⁠
https://arxiv.org/abs/2310.16825: “CommonCanvas: An Open Diffusion Model Trained With Creative-Commons Images ”⁠, ⁠Aaron Gokaslan⁠, A. Feder Cooper, Jasmine Collins …, Landan Seguin, Austin Jacobson, Mihir Patel, ⁠Jonathan Frankle, Cory Stephenson, Volodymyr Kuleshov
link-bibliography⁠
https://arxiv.org/abs/2310.06786: “OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text ”⁠, Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba
link-bibliography⁠
https://arxiv.org/abs/2310.03214#google: “FreshLLMs: Refreshing Large Language Models With Search Engine Augmentation ”⁠, Tu Vu, Mohit Iyyer, Xuezhi Wang …, Noah Constant⁠, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, ⁠Denny Zhou, Quoc V. Le⁠, Thang Luong
link-bibliography⁠
https://arxiv.org/abs/2310.01377: “UltraFeedback: Boosting Language Models With High-Quality Feedback ”⁠, Ganqu Cui, Lifan Yuan⁠, Ning Ding⁠ …, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, ⁠Zhiyuan Liu, ⁠Maosong Sun
link-bibliography⁠
https://arxiv.org/abs/2309.16671: “Demystifying CLIP Data ”⁠, Hu Xu, Saining Xie, Xiaoqing Ellen Tan …, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer⁠, Christoph Feichtenhofer
link-bibliography⁠
https://arxiv.org/abs/2309.12269: “The Cambridge Law Corpus: A Corpus for Legal AI Research ”⁠, Andreas Östling, Holli Sargeant, Huiyuan Xie …, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek
link-bibliography⁠
https://arxiv.org/abs/2309.12284: “MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models ”⁠, Longhui Yu, Weisen Jiang, Han Shi …, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu
link-bibliography⁠
https://arxiv.org/abs/2309.10818#cerebras: “SlimPajama-DC: Understanding Data Combinations for LLM Training ”⁠, Zhiqiang Shen, Tianhua Tao, Liqun Ma …, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, Eric Xing⁠
link-bibliography⁠
https://arxiv.org/abs/2309.04269: “From Sparse to Dense: GPT-4 Summarization With Chain of Density (CoD) Prompting ”⁠, Griffin Adams, Alexander Fabbri, Faisal Ladhak …, Eric Lehman, Noémie Elhadad⁠
link-bibliography⁠
https://arxiv.org/abs/2308.12477: “American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers ”⁠, Melissa Dell⁠, Jacob Carlson, Tom Bryan⁠ …, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D’Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring
link-bibliography⁠
https://arxiv.org/abs/2307.08701#samsung: “AlpaGasus: Training A Better Alpaca With Fewer Data ”⁠, Lichang Chen, Shiyang Li, Jun Yan …, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, Hongxia Jin
link-bibliography⁠
https://arxiv.org/abs/2307.05014: “Test-Time Training on Video Streams ”⁠, Renhao Wang, ⁠Yu Sun, Yossi Gandelsman …, Xinlei Chen, Alexei A. Efros⁠, Xiaolong Wang
link-bibliography⁠
https://arxiv.org/abs/2306.15626: “LeanDojo: Theorem Proving With Retrieval-Augmented Language Models ”⁠, Kaiyu Yang, Aidan M. Swope, Alex Gu …, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, Anima Anandkumar⁠
link-bibliography⁠
https://arxiv.org/abs/2306.12587: “ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews ”⁠, Mike D’Arcy, Alexis Ross, Erin Bransom …, Bailey Kuehl, Jonathan Bragg, Tom Hope⁠, Doug Downey⁠
link-bibliography⁠
https://arxiv.org/abs/2306.15448: “Understanding Social Reasoning in Language Models With Language Models ”⁠, Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, Noah D. Goodman
link-bibliography⁠
https://www.theverge.com/features/23764584/ai-artificial-intelligence-data-notation-labor-scale-surge-remotasks-openai-chatbots: “AI Is a Lot of Work: As the Technology Becomes Ubiquitous, a Vast Tasker Underclass Is Emerging—And Not Going Anywhere ”⁠, Josh Dzieza
link-bibliography⁠
2023-yi.pdf: “Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model ”⁠, Fan Yi⁠, Jiaxiang Wu, Minyi Zhao, Shuigeng Zhou
link-bibliography⁠
https://www.theinformation.com/articles/why-youtube-could-give-google-an-edge-in-ai: “Why YouTube Could Give Google an Edge in AI ”⁠, Jon Victor
link-bibliography⁠
https://arxiv.org/abs/2305.20050#openai: “Let’s Verify Step by Step ”⁠, Hunter Lightman, Vineet Kosaraju, Yura Burda …, Harri Edwards, Bowen Baker, Teddy Lee, ⁠Jan Leike, ⁠John Schulman, Ilya Sutskever⁠, Karl Cobbe
link-bibliography⁠
https://arxiv.org/abs/2305.07759#microsoft: “TinyStories: How Small Can Language Models Be and Still Speak Coherent English? ”⁠, Ronen Eldan⁠, Yuanzhi Li
link-bibliography⁠
https://arxiv.org/abs/2305.01569: “Pick-A-Pic: An Open Dataset of User Preferences for Text-To-Image Generation ”⁠, Yuval Kirstain, Adam Polyak, Uriel Singer …, Shahbuland Matiana, Joe Penna, Omer Levy⁠
link-bibliography⁠
https://arxiv.org/abs/2304.05538: “ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification ”⁠, Mohammad Reza Taesiri, Giang Nguyen⁠, Sarra Habchi …, Cor-Paul Bezemer, Anh Nguyen
link-bibliography⁠
https://arxiv.org/abs/2304.02015#alibaba: “How Well Do Large Language Models Perform in Arithmetic Tasks? ”⁠, Zheng Yuan, Hongyi Yuan, Chuanqi Tan …, Wei Wang, Songfang Huang
link-bibliography⁠
https://arxiv.org/abs/2302.14520: “Large Language Models Are State-Of-The-Art Evaluators of Translation Quality ”⁠, Tom Kocmi, Christian Federmann
link-bibliography⁠
https://arxiv.org/abs/2302.03169: “Data Selection for Language Models via Importance Resampling ”⁠, Sang Michael Xie, Shibani Santurkar, ⁠Tengyu Ma, ⁠Percy Liang⁠
link-bibliography⁠
https://arxiv.org/abs/2212.13138#google: “Med-PaLM: Large Language Models Encode Clinical Knowledge ”⁠, Karan Singhal, Shekoofeh Azizi, Tao Tu …, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield⁠, Blaise Aguera y Arcas⁠, Dale Webster⁠, Greg S. Corrado, Yossi Matias⁠, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, Vivek Natarajan
link-bibliography⁠
https://arxiv.org/abs/2212.03533#microsoft: “Text Embeddings by Weakly-Supervised Contrastive Pre-Training ”⁠, Liang Wang, Nan Yang, Xiaolong Huang …, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei⁠
link-bibliography⁠
https://arxiv.org/abs/2211.15533: “The Stack: 3 TB of Permissively Licensed Source Code ”⁠, Denis Kocetkov, Raymond Li, Loubna Ben Allal …, Jia Li⁠, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell⁠, Sean Hughes, ⁠Thomas Wolf, ⁠Dzmitry Bahdanau, Leandro von Werra, Harm de Vries
link-bibliography⁠
https://arxiv.org/abs/2211.06679#baai: “AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities ”⁠, Zhongzhi Chen, Guang Liu, Bo-Wen Zhang …, Fulong Ye, Qinghong Yang, Ledell Wu
link-bibliography⁠
https://arxiv.org/abs/2211.01786: “BLOOMZ/mT0: Crosslingual Generalization through Multitask Finetuning ”⁠, ⁠Niklas Muennighoff, Thomas Wang⁠, Lintang Sutawika …, Adam Roberts⁠, ⁠Stella Biderman, Teven Le Scao⁠, M. Saiful Bari, ⁠Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev⁠, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, ⁠Zaid Alyafeai, Albert Webson, Edward Raff, ⁠Colin Raffel
link-bibliography⁠
https://arxiv.org/abs/2210.11610#google: “Large Language Models Can Self-Improve ”⁠, Jiaxin Huang, Shixiang Shane Gu⁠, Le Hou …, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han⁠
link-bibliography⁠
https://arxiv.org/abs/2210.07792#eleutherai: “CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning ”⁠, Louis Castricato, Alexander Havrilla, Shahbul …, Matiana, Michael Pieler, Anbang Ye, Ian Yang, Spencer Frazier, Mark Riedl
link-bibliography⁠
https://aclanthology.org/2022.cai-1.2.pdf: “Most Language Models Can Be Poets Too: An AI Writing Assistant and Constrained Text Generation Studio ”⁠, Allen Roush⁠, Sanjay Basu, Akshay Moorthy, Dmitry Dubovoy
link-bibliography⁠
https://arxiv.org/abs/2210.03350#allen: “Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle) ”⁠, Ofir Press, Muru Zhang, Sewon Min …, Ludwig Schmidt⁠, ⁠Noah A. Smith, Mike Lewis⁠
link-bibliography⁠
https://arxiv.org/abs/2209.00840: “FOLIO: Natural Language Reasoning With First-Order Logic ”⁠, Simeng Han, Hailey Schoelkopf, Yilun Zhao …, Zhenting Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekaterina Zubova, Yujie Qiao, Matthew Burtell, David Peng, Jonathan Fan, Yixin Liu, Brian Wong⁠, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Shafiq Joty, Alexander R. Fabbri, Wojciech Kryscinski, Xi Victoria Lin, ⁠Caiming Xiong, Dragomir Radev⁠
link-bibliography⁠
https://www.anthropic.com/red_teaming.pdf: “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned ”⁠, ⁠Deep Ganguli, Liane Lovitt, ⁠Jackson Kernion …, ⁠Amanda Askell, Yuntao Bai⁠, Saurav Kadavath⁠, Ben Mann, ⁠Ethan Perez, Nicholas Schiefer, Kamal Ndousse, ⁠Andy L. Jones, ⁠Samuel R. Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, ⁠Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield Dodds, Tom Henighan, Danny Hernandez⁠, Tristan Hume, Josh Jacobson, Scott Johnston⁠, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei⁠, Tom B. Brown⁠, Nicholas Joseph, Sam McCandlish⁠, Chris Olah, Jared Kaplan, ⁠Jack Clark⁠
link-bibliography⁠
https://arxiv.org/abs/2208.08831#deepmind: “Discovering Bugs in Vision Models Using Off-The-Shelf Image Generation and Captioning ”⁠, Olivia Wiles, Isabela Albuquerque, Sven Gowal
link-bibliography⁠
https://arxiv.org/abs/2208.05516: “Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP ”⁠, Thao Nguyen, Gabriel Ilharco, ⁠Mitchell Wortsman …, Sewoong Oh, Ludwig Schmidt⁠
link-bibliography⁠
https://arxiv.org/abs/2207.13061: “NewsStories: Illustrating Articles With Visual Summaries ”⁠, Reuben Tan, Bryan A. Plummer, Kate Saenko …, J. P. Lewis, Avneesh Sud, Thomas Leung
link-bibliography⁠
https://arxiv.org/abs/2206.15474: “Forecasting Future World Events With Neural Networks ”⁠, ⁠Andy Zou, Tristan Xiao, Ryan Jia …, Joe Kwon, Mantas Mazeika⁠, Richard Li⁠, Dawn Song⁠, ⁠Jacob Steinhardt, ⁠Owain Evans, ⁠Dan Hendrycks⁠
link-bibliography⁠
https://arxiv.org/abs/2205.09665#bair: “Automated Crossword Solving ”⁠, Eric Wallace⁠, Nicholas Tomlin, Albert Xu …, Kevin Yang, Eshaan Pathak, Matthew Ginsberg, Dan Klein⁠
link-bibliography⁠
https://arxiv.org/abs/2205.09073#google: “Dialog Inpainting: Turning Documents into Dialogues ”⁠, Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao …, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu
link-bibliography⁠
https://arxiv.org/abs/2205.03983#google: “Building Machine Translation Systems for the Next Thousand Languages ”⁠, Ankur Bapna, Isaac Caswell, Julia Kreutzer …, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao⁠, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yanping Huang, Zhifeng Chen, Yonghui Wu⁠, Macduff Hughes
link-bibliography⁠
https://arxiv.org/abs/2205.04596#google: “When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet ”⁠, Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes …, Sara Fridovich-Keil, Rebecca Roelofs
link-bibliography⁠
https://arxiv.org/abs/2205.01397: “Data Determines Distributional Robustness in Contrastive Language Image Pre-Training (CLIP) ”⁠, Alex Fang, Gabriel Ilharco, ⁠Mitchell Wortsman …, Yuhao Wan, Vaishaal Shankar, Achal Dave, Ludwig Schmidt⁠
link-bibliography⁠
https://arxiv.org/abs/2204.07705: “Tk-Instruct: Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks ”⁠, ⁠Yizhong Wang, ⁠Swaroop Mishra, Pegah Alipoormolabashi …, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Shailaja Keyur Sampat, Savan Doshi, Siddhartha Mishra, Sujan Reddy, Sumanta Patro, Tanay Dixit, Xudong Shen, Chitta Baral, Yejin Choi⁠, ⁠Noah A. Smith, ⁠Hannaneh Hajishirzi, ⁠Daniel Khashabi
link-bibliography⁠
https://arxiv.org/abs/2204.03067: “ByT5 Model for Massively Multilingual Grapheme-To-Phoneme Conversion ”⁠, Jian Zhu, Cong Zhang, David Jurgens
link-bibliography⁠
https://arxiv.org/abs/2203.11096: “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-Shot Transfer Learning ”⁠, Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer
link-bibliography⁠
https://arxiv.org/abs/2202.12211#google: “Self-Distilled StyleGAN: Towards Generation from Internet Photos ”⁠, Ron Mokady, Michal Yarom, Omer Tov …, Oran Lang, Daniel Cohen-Or, Tali Dekel, Michal Irani⁠, Inbar Mosseri
link-bibliography⁠
https://arxiv.org/abs/2202.06767#huawei: “Wukong: 100 Million Large-Scale Chinese Cross-Modal Pre-Training Dataset and A Foundation Framework ”⁠, Jiaxi Gu, Xiaojun Meng, Guansong Lu …, Lu Hou⁠, Minzhe Niu, Hang Xu, Xiaodan Liang, Wei Zhang, Xin Jiang⁠, Chunjing Xu
link-bibliography⁠
https://arxiv.org/abs/2202.00273: “StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets ”⁠, Axel Sauer, Katja Schwarz, Andreas Geiger
link-bibliography⁠
https://arxiv.org/abs/2201.12086#salesforce: “BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation ”⁠, Junnan Li, Dongxu Li, ⁠Caiming Xiong, Steven Hoi
link-bibliography⁠
https://arxiv.org/abs/2201.08371#facebook: “SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models ”⁠, Mannat Singh, Laura Gustafson, Aaron Adcock …, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick⁠, Piotr Dollár, ⁠Laurens van der Maaten
link-bibliography⁠
https://swabhs.com/assets/pdf/wanli.pdf#allen: “WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation ”⁠, Alisa Liu, Swabha Swayamdipta, ⁠Noah A. Smith, Yejin Choi⁠
link-bibliography⁠
https://arxiv.org/abs/2201.04684: “BigDatasetGAN: Synthesizing ImageNet With Pixel-Wise Annotations ”⁠, Daiqing Li, Huan Ling, Seung Wook Kim …, Karsten Kreis, Adela Barriuso, Sanja Fidler⁠, ⁠Antonio Torralba
link-bibliography⁠
https://arxiv.org/abs/2112.15283#baidu: “ERNIE-ViLG: Unified Generative Pre-Training for Bidirectional Vision-Language Generation ”⁠, Han Zhang⁠, Weichong Yin, Yewei Fang …, Lanxin Li, Boqiang Duan, Zhihua Wu, ⁠Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
link-bibliography⁠
https://arxiv.org/abs/2112.09332#openai: “WebGPT: Browser-Assisted Question-Answering With Human Feedback ”⁠, Reiichiro Nakano, ⁠Jacob Hilton, Suchir Balaji …, Jeff Wu Long Ouyang, Christina Kim⁠, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger⁠, Kevin Button, Matthew Knight, Benjamin Chess, ⁠John Schulman
link-bibliography⁠
https://arxiv.org/abs/2111.10050#google: “BASIC: Combined Scaling for Open-Vocabulary Image Classification ”⁠, Hieu Pham, Zihang Dai⁠, Golnaz Ghiasi …, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu⁠, Mingxing Tan, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/2111.09162: “It’s About Time: Analog Clock Reading in the Wild ”⁠, Charig Yang, Weidi Xie, Andrew Zisserman⁠
link-bibliography⁠
https://arxiv.org/abs/2111.08267: “Solving Probability and Statistics Problems by Program Synthesis ”⁠, Leonard Tang, Elizabeth Ke, Nikhil Singh …, Nakul Verma⁠, Iddo Drori
link-bibliography⁠
https://arxiv.org/abs/2111.02114#laion: “LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs ”⁠, Christoph Schuhmann, Richard Vencu, Romain Beaumont …, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, Aran Komatsuzaki
link-bibliography⁠
https://arxiv.org/abs/2110.14168#openai: “Training Verifiers to Solve Math Word Problems ”⁠, Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian …, ⁠Jacob Hilton, Reiichiro Nakano, Christopher Hesse, ⁠John Schulman
link-bibliography⁠
https://elifesciences.org/articles/66039: “A Connectome of the Drosophila Central Complex Reveals Network Motifs Suitable for Flexible Navigation and Context-Dependent Action Selection ”⁠, Brad K. Hulse, Hannah Haberkern, Romain Franconville …, Daniel B. Turner-Evans, Shin-ya Takemura, Tanya Wolff, Marcella Noorman, Marisa Dreher, Chuntao Dan, Ruchi Parekh, Ann M. Hermundstad, Gerald M. Rubin⁠, Vivek Jayaraman
link-bibliography⁠
https://arxiv.org/abs/2109.07958: “TruthfulQA: Measuring How Models Mimic Human Falsehoods ”⁠, Stephanie Lin⁠, ⁠Jacob Hilton, ⁠Owain Evans
link-bibliography⁠
https://laion.ai/blog/laion-400-open-dataset/: “LAION-400-Million Open Dataset ”⁠, Christoph Schuhmann
link-bibliography⁠
https://arxiv.org/abs/2106.04560#google: “Scaling Vision Transformers ”⁠, Xiaohua Zhai⁠, Alexander Kolesnikov, ⁠Neil Houlsby, Lucas Beyer⁠
link-bibliography⁠
https://arxiv.org/abs/2104.02133#google: “SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network ”⁠, ⁠William Chan, Daniel Park, Chris Lee …, Yu Zhang, Quoc V. Le⁠, Mohammad Norouzi⁠
link-bibliography⁠
https://arxiv.org/abs/2103.14749: “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks ”⁠, Curtis G. Northcutt, Anish Athalye, Jonas Mueller
link-bibliography⁠
https://arxiv.org/abs/2102.05918#google: “ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision ”⁠, Chao Jia, Yinfei Yang, Ye Xia⁠ …, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le⁠, Yunhsuan Sung, Zhen Li, Tom Duerig
link-bibliography⁠
https://arxiv.org/abs/2102.01951#scaling&org=deepmind: “Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling ”⁠, Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya …, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, Phil Blunsom
link-bibliography⁠
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf: “CLIP: Learning Transferable Visual Models From Natural Language Supervision ”⁠, Alec Radford⁠, ⁠Jong Wook Kim, Chris Hallacy …, Aditya A. Ramesh⁠, Gabriel Goh⁠, Sandhini Agarwal⁠, Girish Sastry, ⁠Amanda Askell, Pamela Mishkin⁠, ⁠Jack Clark⁠, Gretchen Krueger⁠, Ilya Sutskever⁠
link-bibliography⁠
https://openai.com/index/clip/: “CLIP: Connecting Text and Images: We’re Introducing a Neural Network Called CLIP Which Efficiently Learns Visual Concepts from Natural Language Supervision. CLIP Can Be Applied to Any Visual Classification Benchmark by Simply Providing the Names of the Visual Categories to Be Recognized, Similar to the ‘Zero-Shot’ Capabilities of GPT-2 and GPT-3 ”⁠, Alec Radford⁠, Ilya Sutskever⁠, ⁠Jong Wook Kim …, Gretchen Krueger⁠, Sandhini Agarwal⁠
link-bibliography⁠
https://arxiv.org/abs/2101.00027#eleutherai: “The Pile: An 800GB Dataset of Diverse Text for Language Modeling ”⁠, Leo Gao⁠, ⁠Stella Biderman, Sid Black …, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima⁠, ⁠Shawn Presser⁠, Connor Leahy
link-bibliography⁠
https://arxiv.org/abs/2010.14571#google: “Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus ”⁠, Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna
link-bibliography⁠
https://arxiv.org/abs/2009.03300: “MMLU: Measuring Massive Multitask Language Understanding ”⁠, ⁠Dan Hendrycks⁠, Collin Burns⁠, ⁠Steven Basart …, ⁠Andy Zou, Mantas Mazeika⁠, Dawn Song⁠, ⁠Jacob Steinhardt
link-bibliography⁠
https://arxiv.org/abs/1911.05507#deepmind: “Compressive Transformers for Long-Range Sequence Modeling ”⁠, Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy Lillicrap⁠
link-bibliography⁠
https://paperswithcode.com/task/language-modelling: “Language Modeling State-Of-The-Art Leaderboards ”⁠, paperswithcode.com
link-bibliography⁠
https://arxiv.org/abs/1905.00537: “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems ”⁠, ⁠Alex Wang, Yada Pruksachatkun⁠, Nikita Nangia⁠ …, ⁠Amanpreet Singh, ⁠Julian Michael, ⁠Felix Hill, Omer Levy⁠, ⁠Samuel R. Bowman
link-bibliography⁠
https://arxiv.org/abs/1808.01097: “CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images ”⁠, Sheng Guo, Weilin Huang, Haozhi Zhang …, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, Dinglong Huang
link-bibliography⁠
https://arxiv.org/abs/1808.01340#deepmind: “A Short Note about Kinetics-600 ”⁠, Joao Carreira, Eric Noland, Andras Banki-Horvath …, Chloe Hillier, Andrew Zisserman⁠
link-bibliography⁠
2018-sharma.pdf#google: “Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning ”⁠, Piyush Sharma, Nan Ding, Sebastian Goodman, Radu Soricut
link-bibliography⁠
https://arxiv.org/abs/1805.00932#facebook: “Exploring the Limits of Weakly Supervised Pretraining ”⁠, Dhruv Mahajan, Ross Girshick⁠, Vignesh Ramanathan …, Kaiming He⁠, Manohar Paluri, Yixuan Li⁠, Ashwin Bharambe, ⁠Laurens van der Maaten
link-bibliography⁠
https://arxiv.org/abs/1707.08819: “A Downsampled Variant of ImageNet As an Alternative to the CIFAR Datasets ”⁠, Patryk Chrabaszcz, ⁠Ilya Loshchilov, ⁠Frank Hutter
link-bibliography⁠
https://arxiv.org/abs/1705.05640: “WebVision Challenge: Visual Learning and Understanding With Web Data ”⁠, Wen Li, Limin Wang, Wei Li⁠ …, Eirikur Agustsson, Jesse Berent, Abhinav Gupta, Rahul Sukthankar, Luc Van Gool
link-bibliography⁠
https://arxiv.org/abs/1704.05179: “SearchQA: A New Q&A Dataset Augmented With Context from a Search Engine ”⁠, Matthew Dunn, Levent Sagun, Mike Higgins …, V. Ugur Guney, Volkan Cirik, ⁠Kyunghyun Cho
link-bibliography⁠
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1097_Paper.pdf: “N-Gram Counts and Language Models from the Common Crawl ”⁠, Christian Buck, Kenneth Heafield, Bas van Ooyen
link-bibliography⁠
2011-torralba.pdf: “Unbiased Look at Dataset Bias ”⁠, ⁠Antonio Torralba, Alexei A. Efros⁠
link-bibliography⁠
2008-huang.pdf: “Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments ”⁠, Gary B. Huang, Marwan Mattar, Tamara Berg⁠, Eric Learned-Miller
link-bibliography⁠

[Quote Of The Day]

[Site Of The Day]

[Annotation Of The Day]

[adblock public service announcement]