Bibliography (5):

  1. Attention Is All You Need

  2. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

  3. GPT-3: Language Models are Few-Shot Learners