Bibliography (15):

  1. When Do You Need Billions of Words of Pretraining Data?

  2. https://www.youtube.com/watch?v=iNhrW0Nt7zs

  3. EleutherAI/gpt-Neo: An Implementation of Model Parallel GPT-2 and GPT-3-Style Models Using the Mesh-Tensorflow Library.

  4. Language Models are Unsupervised Multitask Learners

  5. https://huggingface.co/datasets/roneneldan/TinyStories

  6. GPT-3: Language Models are Few-Shot Learners

  7. https://openai.com/index/gpt-4-research/

  8. Scaling Laws for Neural Language Models

  9. Chinchilla: Training Compute-Optimal Large Language Models

  10. https://arxiv.org/pdf/2305.07759.pdf#page=6&org=microsoft

  11. https://arxiv.org/pdf/2305.07759.pdf#page=7&org=microsoft