“Precise Zero-Shot Dense Retrieval without Relevance Labels”, Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan2022-12-20 (, )⁠:

While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available.

In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings (HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (eg. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder (eg. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder’s dense bottleneck filtering out the incorrect details.

Our experiments show that HyDE outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (eg. web search, QA, fact verification) and languages (eg. sw, ko, ja).

…HyDE appears unsupervised. No model is trained in HyDE: both the generative model and the contrastive encoder remain intact. Supervision signals were only involved in instruction learning of our backbone LLM.

Figure 1: An illustration of the HyDE model. Documents snippets are shown. HyDE serves all types of queries without changing the underlying GPT-3 and Contriever/mContriever models.

[Does this scale? Interesting possibility: the smarter the model, the better it will hallucinate documents which look as if they ‘answer’ the question or help the task; does that mean that it will also get better retrievals, especially if the retrieval database scales the number of documents, as it scales?]

…5.1 Effect of Different Generative Models: In Table 4, we show HyDE using other instruction-following language models. In particular, we consider a 52-billion Cohere model (command-xlarge-20221108) and a 11-billion FLAN model (FLAN-T5-XXL; Wei et al 2022). Generally, we observe that all models bring improvement to the unsupervised Contriever, with larger models bringing larger improvements. At the time when this paper is written, the Cohere model is still experimental without much detail disclosed. We can only tentatively hypothesize that training techniques may have also played some role in the performance difference.