“VideoCLIP: Contrastive Pre-Training for Zero-Shot Video-Text Understanding”, Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze Luke Zettlemoyer Christoph Feichtenhofer2021-09-28 (, ; similar)⁠:

We present VideoCLIP, a contrastive approach to pre-train an unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval.

Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches.

Code is made available at Github.