[code, demo, video, blog] We present ImageBind, an approach to learn a joint embedding across 6 different modalities—images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.
ImageBind can leverage recent large scale vision-language models [such as CLIP], and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.
The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models.
Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.
Figure 1: ImageBind’s joint embedding space enables novel multimodal capabilities. By aligning 6 modalities’ embedding into a common space, ImageBind enables: (1) Cross-Modal Retrieval, which shows emergent alignment of modalities such as audio, depth or text, that aren’t observed together; (2) Adding embeddings from different modalities naturally composes their semantics; (3) Audio-to-Image generation, by using our audio embeddings with a pre-trained DALLE-2 decoder designed to work with CLIP text embeddings.
…A major obstacle in learning a true joint embedding is the absence of large quantities of multimodal data where all modalities are present together…we leverage the binding property of images and we show that just aligning each modality’s embedding to image embeddings leads to an emergent alignment across all of the modalities. In practice, ImageBind leverages web-scale (image, text) paired data and combines it with naturally occurring paired data such as (video, audio), (image, depth) etc. to learn a single joint embedding space. This allows ImageBind to implicitly align the text embeddings to other modalities such as audio, depth etc., enabling zero-shot recognition capabilities on that modality without explicit semantic or textual pairing. Moreover, we show that it can be initialized with large-scale vision-language models such as CLIP, thereby leveraging the rich image and text representations of these models. Thus, ImageBind can be applied to a variety of different modalities and tasks with little training.
We use large-scale image-text paired data along with naturally paired ‘self-supervised’ data across 4 new modalities—audio, depth, thermal, and Inertial Measurement Unit (IMU) readings—and show strong emergent zero-shot classification and retrieval performance on tasks for each of these modalities. These emergent properties improve as the underlying image representation is made stronger. On audio classification and retrieval benchmarks, ImageBind’s emergent zero-shot classification matches or outperforms specialist models trained with direct audio-text supervision on benchmarks like ESC, Clotho, & AudioCaps. ImageBind representations also outperform specialist supervised models on few-shot evaluation benchmarks. Finally, we show that ImageBind’s joint embeddings can be used for a wide variety of compositional tasks as illustrated in Figure 1, including cross-modal retrieval, combining embeddings via arithmetic, detecting audio sources in images, and generating images given audio input.
[Basically just an InfoNCE loss applied on every pair of modalities pairwise in a single embedding?]
Figure 5: Object detection with audio queries. Simply replacing Detic’s CLIP-based ‘class’ embeddings with our audio embeddings leads to an object detector promptable with audio. This requires no re-training of any model.
Figure 6: Scaling the image encoder size while keeping the other modality encoders’ size fixed. We measure the performance on the emergent zero-shot classification of depth, audio, thermal, and IMU modalities. Scaling the image encoder substantially improves the zero-shot classification results suggesting that a stronger visual representation improves the ‘binding’ of modalities.
…5.1. Scaling the Image Encoder: The central idea in ImageBind is aligning the embeddings of all modalities to image embeddings. Thus, the image embeddings plays a central role in the emergent alignment of unseen modalities and we study their effect on the emergent zero-shot performance. We vary the size of the image encoder and train an encoder for the depth, audio etc. modalities to match the image representation. To isolate the effect of the image representation, we fix the size of the other modality encoders. We use the pretrained CLIP (ViT-B and ViT-L) and OpenCLIP (ViT-H) image and text encoders for this experiment. Our results in Figure 6 show that ImageBind’s emergent zero-shot performance on all modalities improves with better visual features. For depth and audio classification, the stronger ViT-H vs. the ViT-B image encoder, provides a gain of 7% and 4% respectively. Thus, stronger visual features can improve recognition performance even on non-visual modalities.