When Do You Need Billions of Words of Pretraining Data?
https://www.youtube.com/watch?v=iNhrW0Nt7zs
EleutherAI/gpt-Neo: An Implementation of Model Parallel GPT-2 and GPT-3-Style Models Using the Mesh-Tensorflow Library.
Language Models are Unsupervised Multitask Learners
https://huggingface.co/datasets/roneneldan/TinyStories
GPT-3: Language Models are Few-Shot Learners
https://openai.com/index/gpt-4-research/
Scaling Laws for Neural Language Models
Chinchilla: Training Compute-Optimal Large Language Models
https://arxiv.org/pdf/2305.07759.pdf#page=6&org=microsoft
https://arxiv.org/pdf/2305.07759.pdf#page=7&org=microsoft