When scaled to 680,000 hours of multilingual and multitask supervision [much sourced from YouTube—at least 1 million hours], the resulting Whisper models:
generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. [It can even translate between languages!]
We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
…In addition to scale, our work also focuses on broadening the scope of weakly supervised pre-training beyond English-only speech recognition to be both multilingual and multitask. Of those 680,000 hours of audio, 117,000 hours cover 96 other languages. The dataset also includes 125,000 hours of X → English translation data. We find that for sufficiently large models there is no drawback and even benefits to joint multilingual and multitask training.
Our work suggests that simple scaling of weakly supervised pre-training has been underappreciated so far for speech recognition. We achieve these results without the need for the self-supervision or self-training techniques that have been a mainstay of recent large-scale speech recognition work.
Data Processing: …In contrast to a lot of work on speech recognition, we train Whisper models to predict the raw text of transcripts without any substantial standardization, relying on the expressiveness of sequence-to-sequence models to learn to map between utterances and their transcribed form.
…We construct the dataset from audio that is paired with transcripts on the Internet. This results in a very diverse dataset covering a broad distribution of audio from many environments, recording setups, speakers, and languages. While diversity in audio quality can help train a model to be robust, diversity in transcript quality is not similarly beneficial. Initial inspection showed a large amount of subpar transcripts in the raw dataset. To address this, we developed several automated filtering methods to improve transcript quality.
…Many transcripts on the internet are not actually human-generated but the output of existing ASR systems. Recent research has shown that training on datasets of mixed human and machine-generated data can substantially impair the performance of translation systems (Ghorbaniet al2021). In order to avoid learning “transcript-ese”, we developed many heuristics to detect and remove machine-generated transcripts from the training dataset. Many existing ASR systems output only a limited subset of written language which removes or normalizes away aspects that are difficult to predict from only audio signals such as complex punctuation (exclamation points, commas, and question marks), formatting whitespace such as paragraphs, or stylistic aspects such as capitalization. An all-uppercase or all-lowercase transcript is very unlikely to be human generated. While many ASR systems include some level of inverse text normalization, it is often simple or rule-based and still detectable from other unhandled aspects such as never including commas. We also use an audio language detector, which was created by fine-tuning a prototype model trained on a prototype version of the dataset on VoxLingua107 (Valk & Allumäe2021) to ensure that the spoken language matches the language of the transcript according to CLD2.
…2. Model: Since the focus of our work is on studying the capabilities of large-scale supervised pre-training for speech recognition, we use an off-the-shelf architecture to avoid confounding our findings with model improvements. We chose an encoder-decoder Transformer (Vaswaniet al2017) as this architecture has been well validated to scale reliably.
The decoder uses learned position embeddings and tied input-output token representations (Press & Wolf2017). The encoder and decoder have the same width and number of transformer blocks. Figure 1 summarizes the model architecture.
We use the same byte-level BPE text tokenizer used in GPT-2 (Sennrichet al2015; Radfordet al2019) for the English-only models and refit the vocabulary (but keep the same size) for the multilingual models to avoid excessive fragmentation on other languages since the GPT-2 BPE vocabulary is English-only.
Figure 1: Overview of our approach.
A sequence-to-sequence Transformer model is trained on many different speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline.
The multitask training format uses a set of special tokens that serve as task specifiers or classification targets, as further explained in §2.3.
[Note: it is possible to train just the decoder Transformer on pure autoregressive text data to finetune it without corresponding audio data: “Text-only transcription (allows dataset-specific fine-tuning)” (at bottom right). This would be particularly useful with user personalization, or for dynamic evaluation while proofreading Whisper transcripts (especially for fixing Whisper’s numerous errors transcribing rare/novel proper nouns/technical terminology).]
…Due to only training for a few epochs, over-fitting is not a large concern, and we do not use any data augmentation or regularization and instead rely on the diversity contained within such a large dataset to encourage generalization and robustness. Please see Appendix F for full training hyperparameters.3
[LLM confabulation] During early development and evaluation we observed that Whisper models had a tendency to transcribe plausible but almost always incorrect guesses for the names of speakers. This happens because many transcripts in the pre-training dataset include the name of the person who is speaking, encouraging the model to try to predict them, but this information is only rarely inferable from only the most recent 30 seconds of audio context. To avoid this, we fine-tune Whisper models briefly on the subset of transcripts that do not include speaker annotations which removes this behavior.
Figure 3: Correlation of pre-training supervision amount with downstream speech recognition performance.
The amount of pre-training speech recognition data for a given language is very predictive of zero-shot performance on that language in Fleurs.
…5. Translation We study the translation capabilities of Whisper models by measuring their performance on the X → English subset of CoVoST2 (Wanget al2020b). We compare with Maestro, mSLAM, and XLS-R, the highest-performing prior work.
We achieve a new state-of-the-art of 29.1 BLEU zero-shot without using any of the CoVoST2 training data. We attribute this to the 68,000 hours of X → English translation data for these languages in our pre-training dataset which, although noisy, is vastly larger than the 861 hours of training data for X → English translation in CoVoST2. Since Whisper evaluation is zero-shot, it does particularly well on the lowest resource grouping of CoVoST2, improving over mSLAM by 6.7 BLEU. Conversely, the best Whisper model does not actually improve over Maestro and mSLAM on average for the highest resource languages.
For an additional analysis on an even wider set of languages, we also re-purpose Fleurs, which is a speech recognition dataset, as a translation dataset. Since the same sentences are transcribed for every language we use the English transcripts as reference translations. In Figure 4 we visualize the correlation between the amount of translation training data per language and the resulting zero-shot BLEU score on Fleurs. While there is a clear trend of improvement with increasing training data, the squared correlation coefficient is much lower than the 0.83 observed for speech recognition and only 0.24. e suspect this is partly caused by the noisier training data due to errors in audio language identification. As an example, Welsh (CY) is an outlier with much worse than expected performance at only 13 BLEU despite supposedly having 9,000 hours of translation data. This large amount of Welsh translation data is surprising, ranking 4th overall for translation data and ahead of some of the most spoken languages in the world like French, Spanish, and Russian. Inspection shows the majority of supposedly Welsh translation data is actually English audio with English captions where the English audio was mis-classified as Welsh by the language identification system, resulting in it being included as translation training data rather transcription data according to our dataset creation rules.
Figure 4: Correlation of pre-training supervision amount with downstream translation performance.
The amount of pre-training translation data for a given language is only moderately predictive of Whisper’s zero-shot performance on that language in Fleurs.
Figure 6: Whisper is competitive with state-of-the-art commercial and open-source ASR systems in long-form transcription.
The distribution of word error rates from 6 ASR systems on 7 long-form datasets are compared, where the input lengths range from a few minutes to a few hours. The boxes show the quartiles of per-example WERs, and the per-dataset aggregate WERs are annotated on each box. Our model outperforms the best open source model (NVIDIA STT [Conformer]) on all datasets, and in most cases, commercial ASR systems as well.
Figure 7: Whisper’s performance is close to that of professional human transcribers.
This plot shows the WER distributions of 25 recordings from the Kincaid46 dataset transcribed by Whisper, the same 4 commercial ASR systems from Figure 6 (A-D), one computer-assisted human transcription service (E) and 4 human transcription services (F-I). The box plot is superimposed with dots indicating the WERs on individual recordings, and the aggregate WER over the 25 recordings are annotated on each box.
…we study the zero-shot generalization of Whisper models as a function of the model size. Our analysis is summarized in Figure 8.
With the exception of English speech recognition, performance continues to increase with model size across multilingual speech recognition, speech translation, and language identification. The diminishing returns for English speech recognition could be due to saturation effects from approaching human-level performance
Figure 8: Zero-shot Whisper performance scales reliably across tasks and languages with increasing model size.Lightly shaded lines represent individual datasets or languages, showing that performance is more varied than the smooth trends in aggregate performance. Large V2 distinguished with a dashed orange line since it includes several changes that are not present for the smaller models in this analysis.
…4.2. Dataset Scaling At 680,000 hours of labeled audio, the Whisper dataset is one of the largest ever created in supervised speech recognition. Exactly how important is the raw dataset size to Whisper’s performance? To study this, we trained a series of medium-sized models on sub-sampled versions of the dataset which are 0.5%, 1%, 2%, 4%, and 8% of the full dataset size and compared their performance with the same medium-sized model trained on the whole dataset
Table 6: Performance improves with increasing dataset size. English speech recognition performance refers to an average over 12 datasets while the Multilingual speech recognition reports performance on the overlapping subset of languages in Fleurs and X → English translation reports average BLEU on CoVoST2. Dataset size reported in hours.
Dataset size
English WER WER (↓)
Multilingual WER (↓)
X → English BLEU (↑)
3,405
30.5
92.4
0.2
6,811
19.6
72.7
1.7
13,621
14.4
56.6
7.9
27,243
12.3
45.0
13.9
54,486
10.9
36.4
19.2
681,070
9.9
29.2
24.8
All increases in the dataset size result in improved performance on all tasks, although we see large variability in improvement rates across tasks and sizes. Performance improves rapidly on English speech recognition from 3,000 to 13,000 hours and then slows down noticeably between 13,000 and 54,000 hours. Using the full dataset, which corresponds to another 12.5× increase in size results in only a further 1 point drop in WER.
This mirrors the diminishing returns observed with model size scaling for English speech recognition and could similarly be explained by saturation effects when approaching human-level performance. Improvements in WER follow a power-law trend for multilingual speech recognition till 54,000 hours and then deviate from this trend, improving only a further 7 points when increasing to the full dataset size. For X → English translation, performance is practically zero when training on 7,000 hours of audio or less, and then follows a roughly log-linear improvement trend till 54,000 hours before also showing diminishing returns when further scaling to the full dataset size.
[cf. Chinchilla] The general trend across tasks of diminishing returns when moving from 54,000 hours to our full dataset size of 680,000 hours could suggest that the current best Whisper models are under-trained relative to dataset size and performance could be further improved by a combination of longer training and larger models. It could also suggest that we are nearing the end of performance improvements from dataset size scaling for speech recognition. Further analysis is needed to characterize “scaling laws” for speech recognition in order to decided between these explanations.
[The “scissors cross” in scaling generalist models:]
Figure 9: Multitask and multilingual transfer improves with scale.
For small models, performance on English speech recognition degrades when trained jointly in a multitask and multilingual setup. However, multilingual and multitask models benefit more from scale and eventually outperform models trained on English-data-only. 95% bootstrap estimate confidence intervals are shown.