Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
SimpleStrat: Diversifying Language Model Generation with Stratification
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
Seeing Faces in Things: A Model and Dataset for Pareidolia
H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark
How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names
ImagiNet: A Multi-Content Dataset for Generalizable Synthetic Image Detection via Contrastive Learning
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
DataComp-LM: In search of the next generation of training sets for language models
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
Newswire: A Large-Scale Structured Database of a Century of Historical News
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
LLMs achieve adult human performance on higher-order theory of mind tasks
DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ
Can Language Models Explain Their Own Classification Behavior?
Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models
GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Building a Large Japanese Web Corpus for Large Language Models
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
How Tech Giants Cut Corners to Harvest Data for AI: OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems
Vulnerability Detection with Code Language Models: How Far Are We?
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
RewardBench: Evaluating Reward Models for Language Modeling
Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics
Hierarchical Feature Warping and Blending for Talking Head Animation
Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Investigating Continual Pretraining in Large Language Models: Insights and Implications
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
DE-COP: Detecting Copyrighted Content in Language Models Training Data
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench
I am a Strange Dataset: Metalinguistic Tests for Language Models
Generative AI for Math: Part I—MathPile: A Billion-Token-Scale Pretraining Corpus for Math
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation
Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach
StarVector: Generating Scalable Vector Graphics Code from Images
TinyGSM: achieving >80% on GSM8k with small language models
EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models
Retrieving Conditions from Reference Images for Diffusion Models
Sequential Modeling Enables Scalable Learning for Large Vision Models
Efficient Transformer Knowledge Distillation: A Performance Review
Dazed & Confused: A Large-Scale Real-World User Study of reCAPTCHAv2
Instruction-Following Evaluation for Large Language Models
In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images
FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
UltraFeedback: Boosting Language Models with High-quality Feedback
MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
SlimPajama-DC: Understanding Data Combinations for LLM Training
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting
FIMO: A Challenge Formal Dataset for Automated Theorem Proving
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain
Android in the Wild: A Large-Scale Dataset for Android Device Control
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Instruction Mining: High-Quality Instruction Data Selection for Large Language Models
HEADLINES: A Massive Scale Semantic Similarity Dataset of Historical English
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews
Understanding Social Reasoning in Language Models with Language Models
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
AI Is a Lot of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging—and not going anywhere
Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model
Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia
SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Multi-Party Chat (MultiLIGHT): Conversational Agents in Group Settings with Humans and Models
ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification
Parsing-Conditioned Anime Translation: A New Dataset and Method
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
How well do Large Language Models perform in Arithmetic tasks?
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Data Selection for Language Models via Importance Resampling
Off-the-Grid MARL (OG-MARL): Datasets with Baselines for Offline Multi-Agent Reinforcement Learning
The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Interactive-Chain-Prompting (INTERCPT): Ambiguity Resolution for Crosslingual Conditional Generation with Interaction
How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others
Text Embeddings by Weakly-Supervised Contrastive Pre-training
UniSumm: Unified Few-shot Summarization with Multi-Task Pre-Training and Prefix-Tuning
A Creative Industry Image Generation Dataset Based on Captions
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation
BLOOMZ/mT0: Crosslingual Generalization through Multitask Finetuning
Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning
Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio
Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
Learning to Generalize with Object-centric Agents in the Open World Survival Game Crafter
Why do tree-based models still outperform deep learning on tabular data?
Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Dataset Condensation via Efficient Synthetic-Data Parameterization
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
Housekeep: Tidying Virtual Households using Commonsense Reasoning
Instruction Induction: From Few Examples to Natural Language Task Descriptions
Down and Across: Introducing Crossword-Solving as a New NLP Benchmark
SymphonyNet: Symphony Generation with Permutation Invariant Language Model
Building Machine Translation Systems for the Next Thousand Languages
When does dough become a bagel? Analyzing the remaining mistakes on ImageNet
Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)
Tk-Instruct: Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
ByT5 model for massively multilingual grapheme-to-phoneme conversion
CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy
Self-Distilled StyleGAN: Towards Generation from Internet Photos
Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models
CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
SynthBio: A Case Study in Faster Curation of Text Datasets
BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation
A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants
WebGPT: Browser-assisted question-answering with human feedback
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
BASIC: Combined Scaling for Open-Vocabulary Image Classification
Solving Probability and Statistics Problems by Program Synthesis
Few-Shot Self-Rationalization with Natural Language Prompts
AnimeCeleb: Large-Scale Animation CelebHeads Dataset for Head Reenactment
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
An Explanation of In-context Learning as Implicit Bayesian Inference
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
A connectome of the Drosophila central complex reveals network motifs suitable for flexible navigation and context-dependent action selection
HTCN: Harmonious Text Colorization Network for Visual-Textual Presentation Design
T0: Multitask Prompted Training Enables Zero-Shot Task Generalization
Situated Dialogue Learning through Procedural Environment Generation
MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
Transfer Learning for Pose Estimation of Illustrated Characters
MuSiQue: Multi-hop Questions via Single-hop Question Composition
QASPER: A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
NaturalProofs: Mathematical Theorem Proving in Natural Language
Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence (VitaminC)
Are NLP Models really able to Solve Simple Math Word Problems?
Measuring Mathematical Problem Solving With the MATH Dataset
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
A massive 7T fMRI dataset to bridge cognitive and computational neuroscience
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
CLIP: Learning Transferable Visual Models From Natural Language Supervision
CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Selective Eye-gaze Augmentation To Enhance Imitation Learning In Atari Games
MoGaze: A Dataset of Full-Body Motions that Includes Workspace Geometry and Eye-Gaze
End-to-End Chinese Landscape Painting Creation Using Generative Adversarial Networks
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
Open-Domain Question Answering Goes Conversational via Question Rewriting
A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
CoVoST 2 and Massively Multilingual Speech-to-Text Translation
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
TyDiQA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective
Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
Libri-Light: A Benchmark for ASR with Limited or No Supervision
SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling
How Machine Learning Can Help Unlock the World of Ancient Japan
CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Restoring ancient text using deep learning (Pythia): a case study on Greek epigraphy
CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
PubMedQA: A Dataset for Biomedical Research Question Answering
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
LVIS: A Dataset for Large Vocabulary Instance Segmentation
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
A large single-participant fMRI dataset for probing brain responses to naturalistic stimuli in space and time
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
ImageNet-Sketch: Learning Robust Global Representations by Penalizing Local Predictive Power
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
The MineRL 2019 Competition on Sample Efficient Reinforcement Learning using Human Priors
ProductNet: a Collection of High-Quality Datasets for Product Representation Learning
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset
LIGHT: Learning to Speak and Act in a Fantasy Text Adventure Game
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
A Replication Study: Machine Learning Models Are Capable of Predicting Sexual Orientation From Facial Images
Do We Train on Test Data? Purging CIFAR of Near-Duplicates
The RobotriX: An eXtremely Photorealistic and Very-Large-Scale Indoor Dataset of Sequences with Robot Trajectories and Interactions
Natural Questions: A Benchmark for Question Answering Research
A Style-Based Generator Architecture for Generative Adversarial Networks
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images
Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
Know What You Don’t Know: Unanswerable Questions for SQuAD
BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning
Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
FEVER: a large-scale dataset for Fact Extraction and VERification
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
SCUT-FBP5500: A Diverse Benchmark Dataset for Multi-Paradigm Facial Beauty Prediction
11K Hands: Gender recognition and biometric identification using a large dataset of hand images
Progressive Growing of GANs for Improved Quality, Stability, and Variation
WebVision Database: Visual Learning and Understanding from Web Data
A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
Driver Identification Using Automobile Sensor Data from a Single Turn
StreetStyle: Exploring world-wide clothing styles from millions of photos
WebVision Challenge: Visual Learning and Understanding With Web Data
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
BAM! The Behance Artistic Media Dataset for Recognition Beyond Photography
SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine
RACE: Large-scale ReAding Comprehension Dataset From Examinations
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Deep Learning the City: Quantifying Urban Perception At A Global Scale
The LAMBADA dataset: Word prediction requiring a broad discourse context
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Amazon Reviews: Image-based Recommendations on Styles and Substitutes
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments
Building a Large Annotated Corpus of English: The Penn Treebank
DataGemma: AI Open Models Connecting LLMs to Google’s Data Commons
Scale AI Secures $1B Funding at $14B Valuation As Its CEO Predicts Big Revenue Growth and Profitability by Year-End [On Very High Quality Data]
No Robots: Look Ma, an instruction dataset that wasn’t generated by GPTs!
FineWeb: Decanting the Web for the Finest Text Data at Scale
Solving Math Word Problems: We’ve Trained a System That Solves Grade School Math Problems With Nearly Twice the Accuracy of a Fine-Tuned GPT-3 Model. It Solves about 90% As Many Problems As Real Kids: a Small Sample of 9-12 Year Olds Scored 60% on a Test from Our Dataset, While Our System Scored 55% on Those Same Problems. This Is Important Because Today’s AI Is Still Quite Weak at Commonsense Multistep Reasoning, Which Is Easy Even for Grade School Kids. We Achieved These Results by Training Our Model to Recognize Its Mistakes, so That It Can Try Repeatedly Until It Finds a Solution That Works
2023-pilaut-figure1-interactivechainpromptingforqaabouttranslationambiguities.jpg
2020-caswell-table2-examplesofmisleadingtextlanguageassociations.png
http://cl-informatik.uibk.ac.at/cek/holstep/ckfccs-holstep-submitted.pdf
https://ai.facebook.com/research/publications/ego4d-unscripted-first-person-video-from-around-the-world-and-a-benchmark-suite-for-egocentric-perception
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=12d941c445ec477501f78b15dcf84f98173121cf
https://karpathy.github.io/2011/04/27/manually-classifying-cifar10/
https://openaccess.thecvf.com/content_cvpr_2014/papers/Andriluka_2D_Human_Pose_2014_CVPR_paper.pdf
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37648.pdf
https://www.bloomberg.com/news/features/2023-04-24/a-high-school-teacher-s-free-image-database-powers-ai-unicorns
https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajamacr
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
https%253A%252F%252Farxiv.org%252Fabs%252F2410.07095%2523openai.html
ImagiNet: A Multi-Content Dataset for Generalizable Synthetic Image Detection via Contrastive Learning
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
https%253A%252F%252Farxiv.org%252Fabs%252F2406.18518%2523salesforce.html
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
https%253A%252F%252Farxiv.org%252Fabs%252F2406.13121%2523google.html
DataComp-LM: In search of the next generation of training sets for language models
LLMs achieve adult human performance on higher-order theory of mind tasks
https%253A%252F%252Farxiv.org%252Fabs%252F2405.18870%2523google.html
DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ
https%253A%252F%252Farxiv.org%252Fabs%252F2405.02793%2523google.html
GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic
https%253A%252F%252Farxiv.org%252Fabs%252F2405.00332%2523scale.html
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
How Tech Giants Cut Corners to Harvest Data for AI: OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems
https%253A%252F%252Fwww.nytimes.com%252F2024%252F04%252F06%252Ftechnology%252Ftech-giants-harvest-data-artificial-intelligence.html.html
Vulnerability Detection with Code Language Models: How Far Are We?
https%253A%252F%252Farxiv.org%252Fabs%252F2403.18802%2523deepmind.html
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
StarVector: Generating Scalable Vector Graphics Code from Images
EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models
Efficient Transformer Knowledge Distillation: A Performance Review
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images
Jonathan Frankle—Chief Neural Network Scientist at Databricks
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
https%253A%252F%252Farxiv.org%252Fabs%252F2310.03214%2523google.html
UltraFeedback: Boosting Language Models with High-quality Feedback
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
SlimPajama-DC: Understanding Data Combinations for LLM Training
https%253A%252F%252Farxiv.org%252Fabs%252F2309.10818%2523cerebras.html
From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers
https%253A%252F%252Farxiv.org%252Fabs%252F2307.08701%2523samsung.html
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models
ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews
Understanding Social Reasoning in Language Models with Language Models
AI Is a Lot of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging—and not going anywhere
https%253A%252F%252Fwww.theverge.com%252Ffeatures%252F23764584%252Fai-artificial-intelligence-data-notation-labor-scale-surge-remotasks-openai-chatbots.html
Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model
%252Fdoc%252Fai%252Fanime%252Fdanbooru%252F2023-yi.pdf.html
https%253A%252F%252Fwww.theinformation.com%252Farticles%252Fwhy-youtube-could-give-google-an-edge-in-ai.html
https%253A%252F%252Farxiv.org%252Fabs%252F2305.20050%2523openai.html
SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models
https%253A%252F%252Farxiv.org%252Fabs%252F2305.11840%2523google.html
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
https%253A%252F%252Farxiv.org%252Fabs%252F2305.07759%2523microsoft.html
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification
How well do Large Language Models perform in Arithmetic tasks?
https%253A%252F%252Farxiv.org%252Fabs%252F2304.02015%2523alibaba.html
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Data Selection for Language Models via Importance Resampling
https%253A%252F%252Farxiv.org%252Fabs%252F2212.13138%2523google.html
Text Embeddings by Weakly-Supervised Contrastive Pre-training
https%253A%252F%252Farxiv.org%252Fabs%252F2212.03533%2523microsoft.html
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
https%253A%252F%252Farxiv.org%252Fabs%252F2211.06679%2523baai.html
BLOOMZ/mT0: Crosslingual Generalization through Multitask Finetuning
https%253A%252F%252Farxiv.org%252Fabs%252F2210.11610%2523google.html
CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning
https%253A%252F%252Farxiv.org%252Fabs%252F2210.07792%2523eleutherai.html
Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio
https%253A%252F%252Faclanthology.org%252F2022.cai-1.2.pdf.html
Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)
https%253A%252F%252Farxiv.org%252Fabs%252F2210.03350%2523allen.html
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
https%253A%252F%252Fwww.anthropic.com%252Fred_teaming.pdf.html
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning
https%253A%252F%252Farxiv.org%252Fabs%252F2208.08831%2523deepmind.html
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
https%253A%252F%252Farxiv.org%252Fabs%252F2205.09665%2523bair.html
https%253A%252F%252Farxiv.org%252Fabs%252F2205.09073%2523google.html
Building Machine Translation Systems for the Next Thousand Languages
https%253A%252F%252Farxiv.org%252Fabs%252F2205.03983%2523google.html
When does dough become a bagel? Analyzing the remaining mistakes on ImageNet
https%253A%252F%252Farxiv.org%252Fabs%252F2205.04596%2523google.html
Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)
Tk-Instruct: Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
ByT5 model for massively multilingual grapheme-to-phoneme conversion
CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
Self-Distilled StyleGAN: Towards Generation from Internet Photos
https%253A%252F%252Farxiv.org%252Fabs%252F2202.12211%2523google.html
Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework
https%253A%252F%252Farxiv.org%252Fabs%252F2202.06767%2523huawei.html
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
https%253A%252F%252Farxiv.org%252Fabs%252F2201.12086%2523salesforce.html
SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models
https%253A%252F%252Farxiv.org%252Fabs%252F2201.08371%2523facebook.html
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
https%253A%252F%252Fswabhs.com%252Fassets%252Fpdf%252Fwanli.pdf%2523allen.html
BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation
https%253A%252F%252Farxiv.org%252Fabs%252F2112.15283%2523baidu.html
WebGPT: Browser-assisted question-answering with human feedback
https%253A%252F%252Farxiv.org%252Fabs%252F2112.09332%2523openai.html
BASIC: Combined Scaling for Open-Vocabulary Image Classification
https%253A%252F%252Farxiv.org%252Fabs%252F2111.10050%2523google.html
Solving Probability and Statistics Problems by Program Synthesis
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
https%253A%252F%252Farxiv.org%252Fabs%252F2111.02114%2523laion.html
https%253A%252F%252Farxiv.org%252Fabs%252F2110.14168%2523openai.html
A connectome of the Drosophila central complex reveals network motifs suitable for flexible navigation and context-dependent action selection
https%253A%252F%252Felifesciences.org%252Farticles%252F66039.html
https%253A%252F%252Flaion.ai%252Fblog%252Flaion-400-open-dataset%252F.html
https%253A%252F%252Farxiv.org%252Fabs%252F2106.04560%2523google.html
Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
https%253A%252F%252Farxiv.org%252Fabs%252F2102.05918%2523google.html
Mind the Gap: Assessing Temporal Generalization in Neural Language Models § Scaling
https%253A%252F%252Farxiv.org%252Fabs%252F2102.01951%2523scaling%2526org%253Ddeepmind.html
CLIP: Learning Transferable Visual Models From Natural Language Supervision
https%253A%252F%252Fcdn.openai.com%252Fpapers%252FLearning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf.html
CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3
https%253A%252F%252Fopenai.com%252Findex%252Fclip%252F.html
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
https%253A%252F%252Farxiv.org%252Fabs%252F2101.00027%2523eleutherai.html
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
https%253A%252F%252Farxiv.org%252Fabs%252F2010.14571%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F1911.05507%2523deepmind.html
https%253A%252F%252Fpaperswithcode.com%252Ftask%252Flanguage-modelling.html
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images
https%253A%252F%252Farxiv.org%252Fabs%252F1808.01340%2523deepmind.html
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
%252Fdoc%252Fai%252Fnn%252Fdiffusion%252F2018-sharma.pdf%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F1805.00932%2523facebook.html
A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets
WebVision Challenge: Visual Learning and Understanding With Web Data
SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine
http%253A%252F%252Fwww.lrec-conf.org%252Fproceedings%252Flrec2014%252Fpdf%252F1097_Paper.pdf.html
Wikipedia Bibliography: