“Careless Whisper: Speech-To-Text Hallucination Harms”, 2024-02-12 (; similar):
Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions.
We evaluate Open AI’s Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper’s transcriptions were highly accurate, we find that:
roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio.
We thematically analyze the Whisper-hallucinated content, finding that 38% of confabulations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority.
We then study why confabulations occur by observing the disparities in confabulation rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group.
We find that confabulations disproportionately occur for individuals who speak with longer shares of non-vocal durations—a common symptom of aphasia.
We call on industry practitioners to ameliorate these language-model-based confabulations in Whisper, and to raise awareness of potential biases amplified by confabulations in downstream applications of speech-to-text models.
[This may be due to Whisper having extremely localized attention patterns and hardly incorporating any history/context (whereas even small-context LLMs typically have contexts orders of magnitude larger), so confabulations can build & spiral within seconds.]