WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

03/01/2023
by   Max Bain, et al.
0

Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/18/2023

OxfordVGG Submission to the EGO4D AV Transcription Challenge

This report presents the technical details of our submission on the EGO4...
research
05/10/2021

What shall we do with an hour of data? Speech recognition for the un- and under-served languages of Common Voice

This technical report describes the methods and results of a three-week ...
research
09/03/2018

LRS3-TED: a large-scale dataset for visual speech recognition

This paper introduces a new multi-modal dataset for visual and audio-vis...
research
03/10/2018

Speech Recognition: Keyword Spotting Through Image Recognition

The problem of identifying voice commands has always been a challenge du...
research
10/08/2021

Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition

End-to-end models have achieved state-of-the-art results on several auto...
research
05/16/2020

Large scale weakly and semi-supervised learning for low-resource video ASR

Many semi- and weakly-supervised approaches have been investigated for o...
research
05/17/2019

The Audio Auditor: Participant-Level Membership Inference in Internet of Things Voice Services

Voice interfaces and assistants implemented by various services have bec...

Please sign up or login with your details

Forgot password? Click here to reset