“VideoCLIP: Contrastive Pre-Training for Zero-Shot Video-Text Understanding”, 2021-09-28 (; similar):
We present VideoCLIP, a contrastive approach to pre-train an unified model for zero-shot video and text understanding, without using any labels on downstream tasks.
VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval.
Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches.
Code is made available at Github.