“Large-Scale Self-Supervised and Semi-Supervised Learning for Speech Translation”, 2021-04-14 (; similar):
In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with Common Crawl.
Our experiments improve over the previous state-of-the-art by 2.6 BLEU on average on all 4 considered CoVoST 2 language pairs via a simple recipe of combining wav2vec 2.0 pretraining, a single iteration of self-training and decoding with a language model.
Different to existing work, our approach does not leverage any other supervision than ST data. Code and models will be publicly released.