Bibliography (4):

Attention Is All You Need
MAE: Masked Autoencoders Are Scalable Vision Learners
Contrastive Representation Learning: A Framework and Review
https://github.com/zinengtang/TVLT