Word Order Does Not Matter For Speech Recognition

10/12/2021

∙

In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3 LibriSpeech, which closely matches with the supervised baseline's performance.

READ FULL TEXT

Word Order Does Not Matter For Speech Recognition

Sign in with Google

Consider DeepAI Pro