Bibliography (5):

VideoGPT: Video Generation using VQ-VAE and Transformers
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
https://www.youtube.com/watch?v=WZj7vW2mTJo
https://songweige.github.io/projects/tats/index.html