Bibliography (7):

GPT-3: Language Models are Few-Shot Learners
Attention Is All You Need
RoBERTa: A Robustly Optimized BERT Pretraining Approach
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Language Models are Unsupervised Multitask Learners
https://github.com/microsoft/LoRA
Wikipedia Bibliography:
1. https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam :
  
  https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam