Log In Sign Up

Cycle-consistency training for end-to-end speech recognition

This paper presents a method to train end-to-end automatic speech recognition (ASR) models using unpaired data. Although the end-to-end approach can eliminate the need for expert knowledge such as pronunciation dictionaries to build ASR systems, it still requires a large amount of paired data, i.e., speech utterances and their transcriptions. Cycle-consistency losses have been recently proposed as a way to mitigate the problem of limited paired data. These approaches compose a reverse operation with a given transformation, e.g., text-to-speech (TTS) with ASR, to build a loss that only requires unsupervised data, speech in this example. Applying cycle consistency to ASR models is not trivial since fundamental information, such as speaker traits, are lost in the intermediate text bottleneck. To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal. This is achieved by training a Text-To-Encoder model and defining a loss based on the encoder reconstruction error. Experimental results on the LibriSpeech corpus show that the proposed cycle-consistency training reduced the word error rate by 14.7 data, using an additional 360 hours of audio data without transcriptions. We also investigate the use of text-only data mainly for language modeling to further improve the performance in the unpaired data training scenario.


page 1

page 2

page 3

page 4


Improving Proper Noun Recognition in End-to-End ASR By Customization of the MWER Loss Criterion

Proper nouns present a challenge for end-to-end (E2E) automatic speech r...

Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

The advances in attention-based encoder-decoder (AED) networks have brou...

Speech-text based multi-modal training with bidirectional attention for improved speech recognition

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, ...

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Text to speech (TTS) and automatic speech recognition (ASR) are two dual...

Improving Semi-supervised End-to-end Automatic Speech Recognition using CycleGAN and Inter-domain Losses

We propose a novel method that combines CycleGAN and inter-domain losses...

Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

Sequence-to-sequence ASR models require large quantities of data to atta...

1 Introduction

In recent years, automatic speech recognition (ASR) technology has been widely used as an effective user interface for various devices such as car navigation systems, smart phones, and smart speakers. The recognition accuracy has dramatically improved with the help of deep learning techniques

[1], and reliability of speech interfaces has been greatly enhanced. However, building ASR systems is very costly and time consuming. Current systems typically have a module-based architecture including an acoustic model, a pronunciation dictionary, and a language model, which rely on phonetically-designed phone units and word-level pronunciations using linguistic assumptions. To build a language model, text preprocessing such as tokenization for some languages that do not explicitly have word boundaries is also required. Consequently, it is not easy for non-experts to develop ASR systems for new applications, especially for new languages.

End-to-end ASR has the goal of simplifying the module-based architecture into a single-network architecture within a deep learning framework, in order to address these issues [2, 3, 4, 5, 6]. End-to-end ASR methods typically rely only on paired acoustic and language data, without the need for extra linguistic knowledge, and train the model with a single algorithm. Therefore, this approach makes it feasible to build ASR systems without expert knowledge. However, even with the end-to-end ASR framework, a large amount of training data is still crucial to assure high recognition accuracy, where the data need to be paired acoustic and language data of many utterances spoken by various speakers [7]. Nowadays, it is easy to collect audio and text data independently from the world wide web, but difficult to find paired data in different languages. Transcribing existing audio data or recording texts spoken by various speakers are also very expensive.

There are several approaches that tackle the problem of limited paired data in the literature  [8, 9, 10, 11, 12]. In particular, cycle consistency has recently been introduced in machine translation (MT) [13] and image transformation [14], and enables one to optimize deep networks using unpaired data. The idea is based on the assumption that if one model that converts input data to output data and another model that reconstructs the input data from the output data are both trained successfully, the input data and its reconstruction should be similar. For example, suppose an English-to-French MT system translates an English sentence to a French sentence, and then a French-to-English MT system back-translates the French sentence to an English sentence. In this case, we can train the English-to-French system so that the difference between the English sentence and its back-translation becomes smaller, for which we only need English sentences. The French-to-English MT system can also be trained in the same manner using only French sentences.

Applying the concept of cycle consistency to ASR is quite challenging. As is the case in MT, the output of ASR is a discrete distribution over the set of all possible sentences. It is therefore not possible to build an end-to-end differentiable loss that back-propagates error through the most probable sentence in this step. Since the set of possible sentences is exponentially large in the size of the sentence, it is not possible to exactly marginalize over all possible sentences either.

Furthermore, unlike in MT and image transformation, in ASR, the input and output domains are very different and do not contain the same information. The output text does not include speaker and prosody information, which is eliminated through feature extraction and decoding. Hence, the speech reconstructed by the TTS system does not have the original speaker and prosody information and can result in a strong missmatch.

Previous approaches related to cycle consistency in end-to-end ASR [9, 12] circumvent these problems by avoiding back-propagating the error beyond the discrete steps and adding a speaker network to transfer the information not present in the text. This requires, however, alternating between speech-to-speech and text-to-text losses to be able to update both ASR and TTS. Therefore this method is not strictly cycle-consistency training, as used in MT and image transformation, since the gradients are not cycled both through ASR and TTS simultaneously. In addition, it also requires training an additional speaker network.

In this work, we propose an alternative approach that uses an end-to-end differentiable loss in the cycle-consistency manner, no longer requiring to alternate between losses or to train an additional speaker network. The idea rests on the two following principles.

  1. Encoder-state-level cycle consistency:
    We use ASR encoder state sequences for computing the cycle consistency instead of waveform or spectral features. This uses a normal TTS Tacotron2 end-to-end model [15] modified to reconstruct the encoder state sequence instead of speech. We call this a text-to-encoder (TTE) model [8], which we introduced in our prior work on data augmentation. This approach reduces the mismatch between the original and the reconstruction by avoiding the problem of missing para-linguistic information.

  2. Expected end-to-end loss:

    We use an expected loss approximated with a sampling-based method. In other words, we sample multiple sentences from the ASR model, generate an encoder state sequence for each, and compute the consistency loss for each sentence by comparing each encoder state sequence with the original. Then, the mean loss can be used to backpropagate the error to the ASR model via the REINFORCE algorithm 

    [16]. This allows us to update the ASR system when the TTE is used to compute the loss, unlike [9].

The proposed approach allows therefore training with unpaired data, even if only speech is available. Furthermore, since error is backpropagated into the ASR system as well, additional unsupervised losses can be used, such as language models. We demonstrate the efficacy of the proposed method in a semi-supervised training condition on the LibriSpeech corpus.

2 Cycle-consistency training for ASR

2.1 Basic concept

The proposed method consists of an ASR encoder-decoder, a TTE encoder-decoder, and consistency loss computation as shown in Fig. 1. In this framework, we need only audio data for backpropagation. First, an input audio feature sequence is fed to the ASR encoder, and an encoder state sequence is obtained. The ASR attention decoder converts the state sequence into a character sequence. In a second stage, the TTE encoder receives the character sequence and encodes it. The TTE attention decoder generates an ASR encoder state sequence. Finally, the cycle-consistency loss is computed by comparing the original state sequence and the generated one, and then the backpropagation is performed into the ASR encoder-decoder.

Figure 1: Cycle-consistency training for ASR.

2.2 Attention-based ASR model

The ASR model we used is an attention-based encoder-decoder. This model directly estimates the posterior

, where represents a sequence of input

-dimensional feature vectors, and

represents a sequence of output characters in the label set . The posterior

is factorized with a probabilistic chain rule as follows:


where represents the subsequence , and
is calculated as follows:


where represents an attention weight, the corresponding attention weight vector (sequence of attention weights ), and the hidden states of the encoder and decoder networks, respectively, a character-wise hidden vector, which is a weighted summarization of the hidden vectors using the attention weight vector , and represents a linear layer with a trainable matrix and bias parameters.

All of the above networks are optimized using back-propagation to minimize the following objective function:


where represents the ground truth for the previous characters, i.e., they are trained in a teacher-forcing manner.

In the inference stage, the character sequence is predicted as


2.3 Tacotron2-based TTE model

For the TTE model, we use the Tacotron2 architecture, which has demonstrated superior performance in the field of text-to-speech synthesis [15]. In our framework, the network predicts the ASR encoder state and the end-of-sequence probability at each frame from a sequence of input characters as follows:


where is a shallow feed-forward network to convert the network outputs before feedback to the decoder,

is a convolutional neural network to refine the network outputs, and

and represent predicted hidden states of the ASR encoder before and after refinement by Postnet. Note that the indices and of the encoder and decoder states are reversed in comparison to the ASR formulation in Eqs. (2)-(6

), and that we use an additional activation function

in Eqs. (14) and (16) to avoid mismatching of the ranges of the outputs, in contrast to the original Tacotron2 architecture [15].

All of the networks are jointly optimized to minimize the following objective function:


where represents mean square error, represent an L1 norm, and the last two terms represent the binary cross entropy for the end-of-sequence probability.

2.4 Cycle-consistency training

In this work, we use the TTE loss in Eq. (18) to measure the cycle consistency, where we compute the TTE loss using the ASR encoder state sequence and the TTE-generated sequence from the ASR character sequence. However, the argmax function in Eq. (8) to output the character sequence is not differentiable, and the consistency loss cannot be propagated through TTE to ASR directly. To solve this problem, we introduce the expected loss. The expected loss can be computed with a sampling technique as


where denotes the state sequence predicted by the TTE model for a given character sequence , and denotes the original state sequence given by the ASR encoder for the input feature sequence . represents the number of samples drawn from the ASR model.

For back-propagation, we utilize the policy gradient technique in the REINFORCE algorithm [16], where the gradient can be computed as


for each sample , where is the baseline value [16]. We used the mean value of over samples for in this work.

3 Related work

There is some prior work on unpaired data training for end-to-end ASR. Our prior work on back-translation-style data augmentation [8] focused on the use of text-only data. We introduced a TTE model and used the synthesized encoder state sequences to train the ASR decoder without audio information. However, in this paper, we focus on the use of audio-only data and take a different approach.

In [11], the model consists of a text-to-text auto-encoder and a speech-to-text encoder-decoder which share the speech and text encodings and can be trained jointly using paired and unpaired data. Audio-only data is basically used to enhance the speech encodings, but not used to reduce recognition errors unlike our cycle-consistency approach. Furthermore, the text encoder is much simpler than our TTE model. In our work, the TTE model can hopefully generate better speech encodings to compute the consistency loss.

The speech chain model [9] is the most similar architecture to ours. As described in Section 1

, the ASR model is trained with synthesized speech and the TTS model is trained with ASR hypotheses for unpaired data. Therefore, the models are not tightly connected with each other, i.e., one model cannot be updated directly with the help of the other model to reduce the recognition or synthesis errors. Our approach can utilize the other model in the loss function to reduce the errors. We also employ a TTE model, which benefits from the reduction of speaker variations in the loss function and of computational complexity.

4 Experiments

4.1 Conditions

We conducted several experiments using the LibriSpeech corpus [17], which consists of two sets of clean speech data (100 hours + 360 hours), and other (noisy) speech data (500 hours) for training. We used 100 hours of the clean speech data to train the initial ASR model and the text-to-encoder (TTE) model, and the audio of 360 hours of clean speech data to retrain the ASR model using the cycle-consistency loss. We used five hours of clean development data as a validation set, and five hours of clean test data as an evaluation set.

We extracted an acoustic feature vector consisting of an 80-dimensional log mel-filter bank and three-dimensional pitch features using the open-source speech recognition toolkit Kaldi [18]

. The ASR encoder had an eight-layered bidirectional long short-term memory with 320 cells including projection layers 

[19] (BLSTMP), and the ASR decoder had a one-layered LSTM with 300 cells. In the second and third layers from the bottom of the ASR encoder, sub-sampling was performed to reduce the utterance length from down to . The ASR attention network used location-aware attention [4]. For decoding, we used a beam search algorithm with beam size of 20. We manually set the maximum and minimum lengths of the output sequence to 0.2 and 0.8 times the length of the subsampled input sequence, respectively.

The architecture of the TTE model followed the original Tacotron2 settings [15]

. The input characters were converted into 512-dimensional character embeddings. The TTE encoder consisted of a three-layered 1D convolutional neural network (CNN) containing 512 filters with size 5, a batch normalization, and rectified linear unit (ReLU) activation function, and a one-layered BLSTM with 512 units (256 units for forward processing, the rest for backward processing). Although the attention mechanism of the TTE model was based on location-aware attention 

[4], we additionally cumulated the attention weight feedback to the next step to accelerate attention learning. The TTE decoder consisted of a two-layered LSTM with 1024 units. Prenet was a two-layered feed forward network with 256 units and ReLU activation. Postnet was a five-layered CNN containing 512 filters with the shape 5, a batch normalization, and tanh activation function except in the final layer. Dropout [20] with a probability of 0.5 was applied to all of the convolution and Prenet layers. Zoneout [21] with a probability of 0.1 was applied to the decoder LSTM. During generation, we applied dropout to Prenet in the same manner as in [15], and set the threshold value of the end-of-sequence probability at 0.75 to prevent from cutting off the end of the input sequence.

In cycle-consistency training, five sequences of characters were drawn from the ASR model for each utterance, where each character was drawn repeatedly from the Softmax distribution of ASR until it encountered the end-of-sequence label ‘<eos>’. During training, we also used the 100-hour paired data to regularize the model parameters in a teacher-forcing manner, i.e., the parameters were updated alternately by cross-entropy loss with paired data and the cycle-consistency loss with unpaired data.

All the models were trained using the end-to-end speech processing toolkit ESPnet [22]

with a single GPU (Titan Xp). Character error rate (CER) and word error rate (WER) were used as evaluation metrics.

Figure 2: Learning curve.
CER / WER [%]
Validation Evaluation
Baseline 11.2 / 24.9 11.1 / 25.2
Cycle-consistency loss 19.5 / 21.5 19.4 / 21.5
CE loss (1 best) 47.8 / 86.8 48.8 / 89.3
CE loss (5 samples) 13.3 / 28.2 12.3 / 27.7
Oracle 14.7 / 11.4 14.6 / 11.8
Table 1: ASR performance using different training methods.
CER / WER [%]
Validation Evaluation
Baseline + LM 11.9 / 22.6 11.9 / 22.9
Cycle consistency + LM 10.2 / 19.6 19.9 / 19.5
Table 2: ASR performance with LM shallow fusion.

4.2 Results

First, we show the changes of the consistency loss for training data and the validation accuracy for development data in Fig. 2

, where the accuracy was computed based on the prediction with ground truth history. The consistency loss successfully decreased as the number of epochs increased. Although the validation accuracy did not improve smoothly, it reached a better value than that for the first epoch. We chose the 6th-epoch model for the following ASR experiments.

Table 1 shows the ASR performance using different training methods. Compared with the baseline result given by the initial ASR model, we can confirm that our proposed cycle-consistency training reduced the word error rate from 25.2%111Our baseline WER is worse than that reported in [17]

for the 100-hour training setup. This is because we did not use any pronunciation lexicon or word-based language model for end-to-end ASR to evaluate the basic performance. Such end-to-end systems typically underperform conventional DNN/HMM systems with n-gram language model when using this size of training data.

to 21.5%, a relative reduction of 14.7%. Thus, the results demonstrate that the proposed method works for ASR training with unpaired data. To verify the effectiveness of our approach, we further examined more straightforward methods, in which we simply used cross-entropy (CE) loss for unpaired data, where the target was chosen as the one best ASR hypothesis or sampled in the same manner as the cycle-consistency training. To alleviate the impact of the ASR errors, we weighted the CE loss by 0.1 for unpaired data while we did not down-weight the paired data. However, the error rates increased significantly in the 1-best condition. Even in the 5-sample condition, we could not obtain better performance than the baseline. We also conducted additional experiments under an oracle condition, where the 360-hour paired data were used together with the 100-hour data using the standard CE loss. The error rates can be considered the upper bound of this framework. We can see that there is still a big gap to the upper bound and further challenges need to be overcome to reach this goal.

Finally, we combined the ASR model with a character-based language model (LM) in a shallow fusion technique [23]. An LSTM-based LM was trained using text-only data from the 500-hour noisy set excluding audio data, and used for decoding. As shown in Table 2, the use of text-only data yielded further improvement reaching 19.5% WER (an 8% error reduction), which is the best number we have achieved so far for this unpaired data setup.

5 Conclusion

In this paper, we proposed a novel method to train end-to-end automatic speech recognition (ASR) models using unpaired data. The method employs an attention-based ASR model and a Tacotron2-based text-to-encoder (TTE) model to compute a cycle-consistency loss using audio data only. Experimental results on the LibriSpeech corpus demonstrated that the proposed cycle-consistency training reduced the word error rate by 14.7% from an initial model trained with 100-hour paired data, using an additional 360 hours of audio-only data without transcriptions. We also investigated the use of text-only data from 500-hour utterances for language modeling, and obtained a further error reduction of 8%. Accordingly, we achieved 22.7% error reduction in total for this unpaired data setup. Future work includes joint training of ASR and TTE model using both sides of the cycle-consistency loss, and the use of additional loss functions to make the training better.