[cf. self-distillation, eg. Hanet al2021] Many important questions (eg. “How to eat healthier?”) require conversation to establish context and explore in depth. However, conversational question answering (ConvQA) systems have long been stymied by scarce training data that is expensive to collect.
To address this problem, we propose a new technique for synthetically generating diverse and high-quality dialog data: dialog inpainting. Our approach takes the text of any document and transforms it into a two-person dialog between the writer and an imagined reader: we treat sentences from the article as utterances spoken by the writer, and then use a dialog inpainter to predict what the imagined reader asked or said in between each of the writer’s utterances.
By applying this approach to passages from Wikipedia and the web, we produce WikiDialog & WebDialog, two datasets totaling 19 million diverse information-seeking dialogues—1,000× larger than the largest existing ConvQA dataset. Furthermore, human raters judge the answer adequacy and conversationality of WikiDialog to be as good or better than existing manually-collected datasets.
Using our inpainted data to pre-train ConvQA retrieval systems, we substantially advance state-of-the-art across 3 benchmarks (QReCC, OR-QuAC, TREC CAsT) yielding up to 40% relative gains on standard evaluation metrics.
Figure 4: Retriever performance on QReCC when T5-Base DE▷ WikiDialogPT is trained with (a) varying fine-tuning data sizes, (b) different sizes inpainter models, and (c) varying pre-training data sizes. Results in (a) do not include mined hard-negatives.
…Does our method scale with inpainting model size and data size? We now explore if our dialog inpainting method can benefit from scaling up along 2 dimensions: the inpainter model size, and the inpainted WikiDialog data size. Results are show in Figure 4b & 4c.
From Figure 4b, we observe that retriever performance increases with inpainter model size with one exception: the T5-XL model slightly outperforms T5-XXL; we hypothesize this is due to insufficient hyperparameter search for T5-XXL. Surprisingly, the quality of data generated by T5-Small is already sufficient to substantially outperform current state-of-the-art methods.
In Figure 4c, we evaluate how retrievers pre-trained with 10K–11M dialogues sampled from WikiDialog perform on QReCC. We observe a roughly log-linear relationship between performance and pre-training data size that has not yet plateaued: simply inpainting more passages may further increase retrieval performance.