“Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”, 2020-06-20 (; similar):
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. The framework, known as wav2vec 2.0, masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
Experiments using all labeled data of LibriSpeech achieve 1.8/3.3 WER (Word Error Rate) on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state-of-the-art on the 100 hour subset while using 100× less labeled data. Using just 10 minutes of labeled data and pre-training on 53k hours of unlabeled data, it still achieves 4.8/8.2 WER.
This demonstrates the feasibility of speech recognition with limited amounts of labeled data.