RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Accelerating Deep Learning by Focusing on the Biggest Losers
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
The Pile: An 800GB Dataset of Diverse Text for Language Modeling