Bibliography:

  1. ‘active learning’ tag

  2. Making Anime Faces With StyleGAN

  3. A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

  4. Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review

  5. Improving Pretraining Data Using Perplexity Correlations

  6. DataComp-LM: In search of the next generation of training sets for language models

  7. Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

  8. Rho-1: Not All Tokens Are What You Need

  9. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

  10. A Study in Dataset Pruning for Image Super-Resolution

  11. How to Train Data-Efficient LLMs

  12. Autonomous Data Selection with Language Models for Mathematical Texts

  13. Rephrasing the Web (WARP): A Recipe for Compute and Data-Efficient Language Modeling

  14. Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

  15. Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity?

  16. Data Filtering Networks

  17. SlimPajama-DC: Understanding Data Combinations for LLM Training

  18. Anchor Points: Benchmarking Models with Much Fewer Examples

  19. When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

  20. Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

  21. Data Selection for Language Models via Importance Resampling

  22. Beyond neural scaling laws: beating power law scaling via data pruning

  23. Unadversarial Examples: Designing Objects for Robust Vision

  24. Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

  25. Dataset Distillation

  26. Machine Teaching for Bayesian Learners in the Exponential Family

  27. FineWeb: Decanting the Web for the Finest Text Data at Scale

  28. design#future-tag-features

    [Transclude the forward-link's context]

  29. 2024-lin-figure1-rho1lossacceleratestrainingofllmswhilereducingdatause.jpg

  30. 2024-lin-figure2-exampleoftokenlevelcleaningofnoisyinternettextdata.png

  31. 2024-lin-figure3-thefourkindsoftokensarealreadylearnedlearnableunlearnableandworsening.png

  32. 2024-lin-figure6-rho1trainingcurvesvsdownstreamlosssplitbytokentype.png

  33. 2024-lin-figure8-perplexityofrho1filteredtokensoverthecourseoftraining.png

  34. https://aclanthology.org/2023.findings-emnlp.18/

  35. https://github.com/Guang000/Awesome-Dataset-Distillation

  36. 5294f2e4b637c7d1b221963608b3ffa465a2ac32.html

  37. DataComp-LM: In search of the next generation of training sets for language models

  38. Luke Zettlemoyer

  39. https%253A%252F%252Farxiv.org%252Fabs%252F2406.11794.html

  40. Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

  41. https%253A%252F%252Farxiv.org%252Fabs%252F2405.20541.html

  42. Rho-1: Not All Tokens Are What You Need

  43. https%253A%252F%252Farxiv.org%252Fabs%252F2404.07965%2523microsoft.html

  44. Autonomous Data Selection with Language Models for Mathematical Texts

  45. https%253A%252F%252Farxiv.org%252Fabs%252F2402.07625.html

  46. Rephrasing the Web (WARP): A Recipe for Compute and Data-Efficient Language Modeling

  47. https%253A%252F%252Farxiv.org%252Fabs%252F2401.16380%2523apple.html

  48. Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

  49. https%253A%252F%252Farxiv.org%252Fabs%252F2312.05328%2523deepmind.html

  50. Data Filtering Networks

  51. https%253A%252F%252Farxiv.org%252Fabs%252F2309.17425%2523apple.html

  52. SlimPajama-DC: Understanding Data Combinations for LLM Training

  53. https%253A%252F%252Farxiv.org%252Fabs%252F2309.10818%2523cerebras.html

  54. Data Selection for Language Models via Importance Resampling

  55. Percy Liang

  56. https%253A%252F%252Farxiv.org%252Fabs%252F2302.03169.html

  57. Beyond neural scaling laws: beating power law scaling via data pruning

  58. Robert Geirhos

  59. https%253A%252F%252Farxiv.org%252Fabs%252F2206.14486.html