“Learning from Videos to Understand the World”, Geoffrey Zweig, Polina Kuznetsova, Michael Auli, Francois Fagan2021-03-12 (, , , ; similar)⁠:

…Although we’ve just scratched the surface, using semi-supervised and self-supervised learning on the videos uploaded to Facebook has already improved our computer vision and speech recognition systems. Within six months of developing Generalized Data Transformations (GDT), a state-of-the-art, self-supervised framework for video understanding, we’ve built and deployed an AI model in Instagram Reels’ recommendation system. And this is just the beginning of our Learning from Videos project. Early experiments in applying self-supervised learning to real-world videos also show a 20% reduction in speech recognition errors, which could improve a wide range of applications like auto-captioning and tasks that help flag harmful content like hate speech. And we’re researching ways to apply new capabilities, like multimodal video retrieval, in order to make it easier for people to surface key moments in time from their trove of digital memories.

Improving Reels recommendations with self-supervision: Finding similar Reels fits particularly well with self-supervised models because Reels tend to be highly stylized, featuring common patterns across trendy videos. Popular videos often consist of the same music set to the same dance moves, but created and acted by different people. Self-supervised models automatically learn “themes”, group them together, and implicitly make them available to the recommendation system. We’re using self-supervision to suggest videos that are relevant to recently watched videos, while filtering out near-duplicates—without explicit training labels for each classification task. To achieve this, we leveraged Generalized Data Transformations (GDT), our state-of-the-art method for building video embeddings, which systematically learns the relationships between the sound and images in a video. Since building this technology last year, we’ve pioneered the large-scale application of GDT to the representation of Reels data, by training a series of models on a data set of millions of Reels and videos from Instagram…We ran the model in production and made its output available in real time to the ranking system. Using this approach, we were able to run online A/B tests that showed positive results.

Better speech recognition for more languages and domains: Recently, speech models have been able to successfully learn the entire structure of language using mostly raw speech data—and to improve on traditional, supervised methods. Our latest technique for learning speech representations, called wav2vec 2.0, works by first masking a portion of the speech and then learning to predict masked speech units. To provide an idea of the speed of progress, wav2vec 2.0 and self-training requires only 10 minutes of transcribed audio to achieve very good speech recognition results on the LibriSpeech industry benchmark. The same results required nearly 1,000 hours of transcribed audio just one year ago.

…To test the method on real-world data, we applied wav2vec 2.0 on millions of hours of unlabeled videos and just 100 hours of labeled data. We achieved strong improvements of about 20% relative word error reduction, compared with supervised-only baselines with the 100 hours. This proves, for the first time, that self-supervised learning with wav2vec 2.0 is effective for real-world data sets that are not as curated as the LibriSpeech corpus used in the original paper. The video data we trained wav2vec on is largely varied, and we found that wav2vec performs particularly well for subdomains and accents where little labeled data exists.

As a next step, we’re now working on scaling wav2vec 2.0 to more data and more languages. These models will reduce labeling for new automatic speech recognition domains (eg. like AR glasses and virtual gaming), improve the performance of low-resource and medium-resource models, and improve other speech and audio tasks. As part of these efforts, we’re currently working on training a multilingual model with millions of hours of speech from 25 languages.

Jointly learning video, audio, text to recall digital memories: …Recent self-supervised learning advances have made it possible to create a joint representation of audio, visual, and textual signals in a single vector space. As part of our latest research efforts, we are using the combination of Facebook videos and their associated text (title, caption, descriptions) as the key lever for multimodal understanding…We’ve previously achieved this for images rather than videos using billions of public images and thousands of hashtags…In this research model, we extract a visual clip—which is a short sequence of visual frames—from a video every second. Our system analyzes this sequence using a convolutional neural network (CNN) to produce a vector of numbers that represents the information in the clip. This information is aggregated across time, both with another CNN and with an attention model. The output of this process is an overall representation of the information in the visual part of the video. We follow a similar process with audio…As a next step, we’re now working on scaling this feature up to millions of videos before we can start testing the feature in production.

…Our Learning from Videos project signals a paradigm shift in the way machines are able to understand videos, sending us on the path to build smarter AI systems. This work will allow us to move away from AI that requires people to look at and label videos by hand, and will make it possible for us to build AI systems that use the most advanced techniques, such as self-supervision, to improve recommendations, search, and retrieval, and other important applications for everyone on Facebook. As our systems continuously learn, they will become more reliable, efficient, and personalized, so that sharing and rediscovering moments can one day be effortless. We are excited to continue our research in the space as we share more of our findings and work to productionize cutting-edge AI research that improves our core technology systems, unlocking new experiences for the billions of people around the world who use our products and services every day.