Bibliography (4):

Language Models are Unsupervised Multitask Learners
Attention Is All You Need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Wikipedia Bibliography:
1. Power law