A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review
DataComp-LM: In search of the next generation of training sets for language models
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
Autonomous Data Selection with Language Models for Mathematical Texts
Rephrasing the Web (WARP): A Recipe for Compute and Data-Efficient Language Modeling
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity?
SlimPajama-DC: Understanding Data Combinations for LLM Training
Anchor Points: Benchmarking Models with Much Fewer Examples
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data
Data Selection for Language Models via Importance Resampling
Beyond neural scaling laws: beating power law scaling via data pruning
Unadversarial Examples: Designing Objects for Robust Vision
Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study
Machine Teaching for Bayesian Learners in the Exponential Family
FineWeb: Decanting the Web for the Finest Text Data at Scale
2024-lin-figure1-rho1lossacceleratestrainingofllmswhilereducingdatause.jpg
2024-lin-figure2-exampleoftokenlevelcleaningofnoisyinternettextdata.png
2024-lin-figure3-thefourkindsoftokensarealreadylearnedlearnableunlearnableandworsening.png
2024-lin-figure6-rho1trainingcurvesvsdownstreamlosssplitbytokentype.png
2024-lin-figure8-perplexityofrho1filteredtokensoverthecourseoftraining.png
DataComp-LM: In search of the next generation of training sets for language models
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
https%253A%252F%252Farxiv.org%252Fabs%252F2404.07965%2523microsoft.html
Autonomous Data Selection with Language Models for Mathematical Texts
Rephrasing the Web (WARP): A Recipe for Compute and Data-Efficient Language Modeling
https%253A%252F%252Farxiv.org%252Fabs%252F2401.16380%2523apple.html
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
https%253A%252F%252Farxiv.org%252Fabs%252F2312.05328%2523deepmind.html
https%253A%252F%252Farxiv.org%252Fabs%252F2309.17425%2523apple.html
SlimPajama-DC: Understanding Data Combinations for LLM Training
https%253A%252F%252Farxiv.org%252Fabs%252F2309.10818%2523cerebras.html
Data Selection for Language Models via Importance Resampling
Beyond neural scaling laws: beating power law scaling via data pruning
Wikipedia Bibliography: