“TF-T2V: A Recipe for Scaling up Text-To-Video Generation With Text-Free Videos”, Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang2023-12-25 (, )⁠:

Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (eg. 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from video platforms like YouTube.

Motivated by this, we come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end, we employ a content branch and a motion branch, which are jointly optimized with weights shared.

Following such a pipeline, we study the effect of doubling the scale of training set (ie. video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID 9.67 → 8.19 and FVD 484 → 441), demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID 8.19 → 7.64 and FVD 441 → 366) after reintroducing some text labels for training.

Finally, we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms.

Code and models will be publicly available at https://tf-t2v.github.io/.

Figure 9: Scaling trend under semi-supervised settings. In the experiment, labeled WebVid10M and text-free videos from Internal10M are leveraged.

Scaling trend under semi-supervised settings: In Figure 9, we vary the number of text-free videos and explore the scaling trend of TF-T2V under the semi-supervised settings. From the results, we can observe that FVD (↓) gradually decreases as the number of text-free videos increases, revealing the strong scaling potential of our TF-T2V.