Bibliography (7):

Contrastive Representation Learning: A Framework and Review
CoCa: Contrastive Captioners are Image-Text Foundation Models
A Short Note about Kinetics-600
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
https://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/Kuehne_etal_iccv11.pdf
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
Dense-Captioning Events in Videos