“One Embedder, Any Task: Instruction-Finetuned Text Embeddings (INSTRUCTOR)”, Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah Smith, Luke Zettlemoyer, Tao Yu2022-12-19 (, , , ; backlinks)⁠:

We introduce INSTRUCTOR, a new method for computing text embeddings given task instructions: every text input is embedded together with instructions explaining the use case (eg. task and domain descriptions). Unlike encoders from prior work that are more specialized, INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains, without any further training.

We first annotate instructions for 330 diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive loss.

We evaluate INSTRUCTOR on 70 embedding evaluation tasks (66 of which are unseen during training), ranging from classification and information retrieval to semantic textual similarity and text generation evaluation. INSTRUCTOR, while having an order of magnitude fewer parameters than the previous best model, achieves state-of-the-art performance, with an average improvement of 3.4% compared to the previous best results on the 70 diverse datasets. [#1 as of 2023-01-31 on the MTEB leaderboard]

Our analysis suggests that INSTRUCTOR is robust to changes in instructions, and that instruction finetuning mitigates the challenge of training a single model on diverse datasets.

Our model, code, and data are available at https://instructor-embedding.github.io/.

4.3 Complexity of Instructions: Here we further analyze the role of instructions over varying degrees of their complexity. Specifically, we consider 4 levels of instruction complexity: N/A (no instructions), dataset tags, simple instructions, and detailed instructions (the original instruction format, §2.3). In the dataset tag setup, each example is prepended with its dataset name. For instance, on the Natural Questions dataset, the query is formatted as “Natural Questions; Input: who sings the song Love Story”. In the simple instruction setup, we use one or two words to describe the domain (eg. for Natural Questions, the input query is “Wikipedia Questions; Input: who sings the song Love Story”). Figure 5 shows their average performances across all task categories. Even with trivial dataset tags, INSTRUCTOR outperforms the original GTR model, illustrating the effectiveness of instructions for diverse training.

As more information is provided in the instruction (from tag to simple and from simple to detail), we observe consistent improvements.

Figure 5: Average performance over varying degrees of instruction details. As the instructions become more detailed, the performance improves. N/A: no instructions are given; tag: dataset names are prepended; simple: one word or two for the task domain are given (eg. Wikipedia); and detailed: our proposed instructions (§2.3).

4.4 Model Sizes and Instruction Finetuning: Figure 6 studies the influence of model sizes. Specifically, we use GTR-Base (0.1B), GTR-Large (0.3B), and GTR-XL (1.5B). They are pretrained on the same corpus and differ only in the encoder size (the embedding sizes are the same). We compare models of various sizes and report the average performance across all the categories. As the encoder transformer model scales up, the performance continues to increase for both GTR and INSTRUCTOR. Nonetheless, the improvement in INSTRUCTOR is more pronounced, perhaps because embeddings with instructions benefit from larger capacities. This implies that large models are more generalizable to compute texts in various domains and task types, providing embeddings for general purposes. Further scale-ups are left to future work.

Figure 6: Average performance comparisons with varying sizes of models. INSTRUCTOR benefits more from scaling up, perhaps because instructions require additional computations.

4.5 Instructions Mitigate Domain Shifts: One advantage of instruction-based finetuning is that it improves models’ ability to generalize to unseen domains and tasks. To demonstrate this effectiveness, we found 3 unseen domains that INSTRUCTOR was not trained on: geography, biology, and civil comments. As shown in Table 3, INSTRUCTOR largely improves (above the average improvement) GTR-Large’s performance on all 3 domains, indicating that instructions can help more when applying models to unseen or uncommon domains.

Table 3: Results of GTR-Large and INSTRUCTOR on unseen domains: geography, biology and civil comments. Domain-specific datasets benefit particularly from instruction finetuning. More results can be found in Table 9 & 10 in the appendix; they also show similar trends.
Model Geography Biology
GTR-Large 53.4 25.7 71.8
INSTRUCTOR 62.3 30.2 77.2
Relative gain (%) +16.7 +17.5 +7.5

[Possible use-case: customized semantic maps eg. argument maps. Instead of a fixed layout suitable for only 1 use (or none at all), the user describes the current task (like planning a vacation) as the prompt, and then INSTRUCTOR embeds all the relevant datapoints, and 2 or 3 principal components are extracted from the embedding, and the items are laid out spatially. Presumably the layout would make more sense than a universal generic map, while being far easier for the user to understand, edit, or use.]