-
When Do You Need Billions of Words of Pretraining Data?
-
https://www.youtube.com/watch?v=iNhrW0Nt7zs
-
EleutherAI/gpt-Neo: An Implementation of Model Parallel GPT-2 and GPT-3-Style Models Using the Mesh-Tensorflow Library.
-
Language Models are Unsupervised Multitask Learners
-
https://huggingface.co/datasets/roneneldan/TinyStories
-
GPT-3: Language Models are Few-Shot Learners
-
https://openai.com/index/gpt-4-research/
-
Scaling Laws for Neural Language Models
-
Chinchilla: Training Compute-Optimal Large Language Models
-
https://arxiv.org/pdf/2305.07759.pdf#page=6&org=microsoft
-
https://arxiv.org/pdf/2305.07759.pdf#page=7&org=microsoft
-