Beating OpenAI CLIP with 100x less data and compute

Efficient pre-training of Vision-Language transformers for Semantic Search

Hello from the Unum AI team! We have been silently pre-training numerous Multi-Modal Models for Semantic Search for the last year. Now we are bundling it with our DataBase and releasing several extremely performant checkpoints on the HuggingFace portal!

So how have we done that? Isn’t training models like OpenAI CLIP only reserved for Google-scale companies? We don’t think so, and today we will dissect how to train a more precise Vision-Language transformer than the multi-lingual mCLIP, which also works much faster!

Semantic Search Vertical

Semantic search is built from 3 pieces.


Embeddings service	used to vectorize the content and the queries
Nearest-Neighbors index	used to search for closest vectors
Persistent database	used to store and retrieve the data

Countless companies are rushing into the space. Similar to how “Postgres on RocksDB” was the hottest topic of 2021, “Future of Search” is the theme of 2023.

Product	Providers
Embeddings service	OpenAI, Co:here, HuggingFace, Unum UForm
Nearest-Neighbors index	Elastic, Algolia, Pinecone, Unum USearch
Persistent database	MongoDB, CockroachDB, Neo4J, Unum UStore

In this post, we will not bore you with advances in Computational Geometry applied to indexing or the advanced Linux kernel bypass techniques we have spent years designing.

This post focuses just on the Embeddings, the Representation Learning part. And as the multi-modal representations started with CLIP, we will do the same!

CLIP Fundamentals

“Contrastive Language-Image Pre-training” aims to align vector representations (embeddings) of text and image encoder models by training them with contrastive loss function.

BERT model is often used for the text encoder.
ViT is used for images, with older versions referencing ResNets.

The original plan was:

Take the CommonCrawl dataset,
Extract 400 Million img tags from HTML pages,
Pair each image with 128 tokens of text around it.

Contrastive Loss is also trivial. To more experienced ML/CS practitioners, it reminds Force-Based Graph drawing algorithms.

Sample a random batch of text and image pairs.
Build a similarity matrix between vectors produced by BERT and ViT.
Compare it against the identity matrix, i.e. ground truth.
Use the difference as the loss.

The ablation studies from the original CLIP paper suggest that bigger global batch sizes with significantly more negative samples lead to better models. Still, growing the global batch size means allocating more GPUs.

The original CLIP was trained on 500x A100 Nvidia GPUs.
The latest Open_CLIP trained on 1024x GPUs.

Where is the catch? How to compete in Language-Vision Pre-training (VLP) on 100x fewer GPUs?

Data-efficient Approaches

TLDR: If you want to be efficient, you need more, often cross-modal pre-training tasks and higher quality data.

ALBEF by Salesforce

ALBEF was trained on two datasets, with 4 and 14 Million text-image pairs. The smaller version takes a day to train on 8 GPUs. Unlike CLIP, the quality of data was a priority, rather than quantity. Here is a list of their innovations:

Multimodal encoder utilizes the embeddings from the text-encoder applying cross-attention on the output features of the image encoder output.
Image-Text Matching (ITM) loss function based on the probability of the image complementing the text.
Masked Language Modeling (MLM) task is stacked on top of the multimodal encoder.

We can use a multimodal encoder as a re-ranker during inference!

ViCHA

For ViCHA pre-training authors used 1.1 M and 800 K dataset sizes. On 4 GPUs this takes less than a day.

Innovations:

Hierarchical Image-Text Contrastive (H-ITC) alignment compares representations across layers.
Visual Concepts Extraction (VCE) used Stanford Scene Graph Parser to implement something similar to object detection on the attached textual captions.
Masked Image Modeling (SSL) on top of the image encoder.

UForm by Unum

We trained on the setup of 3x workstations, with 4x RTX 3090 consumer-grade GPUs in each, connected over 200 GBit InfiniBand HDR. A single experiment would take around a day. Let’s evaluate the results.

Zero-Shot Image Retrieval, English-only

To produce UForm checkpoints, we further extended the ALBEF setup, and filtered the datasets. Garbage-in, garbage-out, after all. Let’s start with the Flickr dataset.

Model	Dataset	Recall@1	Recall@5	Recall@10
CLIP	400 M	0.687	0.906	0.952
ALBEF	14 M	0.759	0.963	0.981
ViCHA	1.1 M	0.726	0.911	0.950

UForm	4 M	0.727	0.915	0.949

With 100x smaller dataset ALBEF, ViCHA, and our UForm all essentially match or surpass original OpenAI CLIP. Good start! AI evaluation, however, is much trickier, than even database benchmarks. ALBEF experiments, for one, were fine-tuned on MS-COCO before evaluating on Flickr, so the reported results are not indeed zero-shot. Let’s compare the MS-COCO results.

Model	Dataset	Recall@1	Recall@5	Recall@10
CLIP	400 M	0.378	0.624	0.722
ViCHA	1.1 M	0.471	0.738	0.828

UForm	4 M	0.510	0.761	0.838

ALBEF wasn’t evaluated on the MS-COCO dataset.

Zero-shot Image Retrieval, Multilingual

Classic approach to add multilingual capabilities to CLIP, is to distill that knowledge from much larger Language Models. We took a different path, and added a few more cross-lingual pre-training tasks. Of course.

Model	English	German	Spanish	French	Italian	Russian	Japanese	Korean	Turkish	Chinese
mCLIP	95.0	93.0	93.6	93.1	93.1	90.0	84.2	89.0	93.0	94.0
UForm	96.6	93.3	94.7	94.0	93.9	90.6	88.0	92.5	94.8	93.4

UForm outperforms mCLIP on every language, except for Chinese.

Inference Speed

Accuracy improvement is fine for a PhD paper, but for production, it has to be efficient. We would previously go to extreme lengths to manually quantize third-party models. Now we are shipping ours!

Let’s compare our uform to bert-base-uncased used in CLIP, and the distilbert-base-uncased, the smallest commonly used transformer.

Model	Multilingual	Backend	Samples per Second	Speedup
`bert-base-uncased`	No	PyTorch	1’612
`distilbert-base-uncased`	No	PyTorch	3’174	x 1.96
`MiniLM-L12`	Yes	PyTorch	3’604	x 2.24
`MiniLM-L6`	No	PyTorch	6’107	x 3.79

`uform`	Yes	PyTorch	6’809	x 4.22
UForm	Yes	-	20’787	x 12.89

All of those measurements were conducted on a consumer-grade Nvidia RTX 3090 GPU. Best of all, you don’t need to wait for our UForm SaaS product, you can already start using our Neural Networks!

Try It!

There must be a paywall, right? No! Go get it on the Hugging Face portal!

import uform
from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

model = uform.get_model('unum-cloud/uform-vl-multilingual') # or 'english'

image_info = model.preprocess_image(image).unsqueeze(0)
text_info = model.preprocess_text(text)

image_embedding = model.encode_image(image_info)
text_embedding = model.encode_text(text_info)
joint_embedding = model.encode_multimodal(image=image_info, text=text_info)

UForm models come with a homonymous package, already available on our GitHub. They not only extend the Hugging Face transformers library to support Mid-Fusion used in our models, but also bring nifty helpers for all things Multi-Modal!

We are actively adding new modalities, like documents, audio, and video, and you can request early access on our Discord. Who knows, maybe once we reach 1’000 stars on GitHub, we will release our pre-training libraries as well 😉