Bibliography (3):

MAE: Masked Autoencoders Are Scalable Vision Learners
Attention Is All You Need
https://github.com/facebookresearch/AudioMAE