Word Order Does Not Matter For Speech Recognition

10/12/2021
by   Vineel Pratap, et al.
0

In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3 LibriSpeech, which closely matches with the supervised baseline's performance.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset