“Robust Self-Supervised Audio-Visual Speech Recognition”, 2022-01-05 (; similar):
Audio-based automatic speech recognition (ASR) degrades in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker.
In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. This approach leverages the largest available AVSR benchmark dataset, LRS3, showcasing significant advancements in the field.
Our approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER (Word Error Rate) of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
This demonstrates a significant stride in AVSR technology, potentially changing the way automated systems interpret and transcribe human speech in challenging acoustic scenarios.