“Text and Code Embeddings by Contrastive Pre-Training”, Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, Peter Welinder, Lilian Weng2022-01-24 (, , ; similar)⁠:

[blog; examples; FIQA benchmarks; Viable’s case-study] Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture.

In this work, we show that contrastive learning pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models.

On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MS MARCOMS MARCO, Natural Questions and TriviaQA benchmarks, respectively.

Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.

Figure 1: Average performance of unsupervised cpt-text models of different sizes across 22 tasks consisting of linear-probe classification, text search, and sentence similarity tasks.

…we leverage naturally occurring paired data to construct training data with no explicit labels. Text embedding models are trained on paired text data where we consider neighboring pieces of text on the Internet as positive pairs. Code embedding models treat the top-level docstring in a function along with its implementation as a (text, code) pair. The training signal of the contrastive objective on its own is not sufficient to learn useful representations and we overcome this by initializing our model with other pretrained models (Brown et al 2020; Chen et al 2021). Finally, we find that it is critical to use a sufficiently large batch to achieve the optimal performance. We show that this simple recipe combining pre-trained model initialization, large-batch contrastive learning and training at scale, can produce text and code embeddings that possess a broad range of capabilities.

We train a series of unsupervised text embedding models (cpt-text) of different sizes, ranging from 300M to 175b parameters, and observe a consistent performance improvement with increasing model sizes (Figure 1). On classification accuracy averaging across 7 linear-probe classification tasks in SentEval (Conneau & Kiela2018), our largest unsupervised model achieves new state-of-the-art results with a relative improvement of 4% and 1.8% over the previous best unsupervised (Giorgi et al 2020) and supervised (Gao et al 2021) text embedding models, respectively.

…Next, we train code embedding models (cpt-code) using the same recipe. Our models learn via (text, code) pairs, extracted from open source code. We evaluate our model on CodeSearchNet (Husain et al 2020), a commonly used code search benchmark, where the task is to find the most relevant code snippet given a natural language query. Our models achieve new state-of-the-art results with a 20.8% relative improvement over the previous best result (Guo et al 2021). Unlike text embedding models, we observe no performance improvement on code search when increasing the number of parameters of cpt-code from 300M to 1.2B.

Table 9: Performance of the cpt-text 300M model on NQ dev set given different training batch sizes.
Batch Size MRR@10
1,536 71.4
12,28 84.7

Finally, we experiment with fine-tuning our models on several supervised datasets and study the transfer learning performance. When fine-tuned on NLI (Natural Language Inference) datasets, we see a further boost in linear-probe classification, outperforming the previous best transfer method (Gao et al 2021) by 2.2%. On SST-2 sentiment classification (Socher et al 2013), we find that our representations are sufficiently descriptive that even a simple k-NN classifier achieves results comparable to a linear-probe classifier. Interestingly, zero-shot performance with our embeddings outperforms the supervised neural network models introduced along with the release of the SST-2 dataset. We also fine-tune the unsupervised model on MS MARCO and evaluate it on a suite of zero-shot search tasks in the BEIR benchmark (Thakur et al 2021). In the transfer setting, our models achieve a 5.2% relative improvement over previous methods (Izacard et al 2021) and is comparable even with methods (Santhanam et al 2021; Formal et al 2021; Wang et al 2020) that demand substantially more computation at test time.

3.4.1. Effect Of Batch Size: Our ablation study highlights the effect of the model’s batch size on the final performance. Table 9 compares the performance of S (300M) cpt-text model trained with different batch sizes on the NQ development set. Since we train with in-batch negative samples, a larger batch increases the chances of having hard negatives in a batch, resulting in a substantial performance boost. [as usual for contrastive learning or GANs]