In recent years, automatic speech recognition (ASR) technology has been widely used as an effective user interface for various devices such as car navigation systems, smart phones, and smart speakers. The recognition accuracy has dramatically improved with the help of deep learning techniques, and reliability of speech interfaces has been greatly enhanced. However, building ASR systems is very costly and time consuming. Current systems typically have a module-based architecture including an acoustic model, a pronunciation dictionary, and a language model, which rely on phonetically-designed phone units and word-level pronunciations using linguistic assumptions. To build a language model, text preprocessing such as tokenization for some languages that do not explicitly have word boundaries is also required. Consequently, it is not easy for non-experts to develop ASR systems for new applications, especially for new languages.
End-to-end ASR has the goal of simplifying the module-based architecture into a single-network architecture within a deep learning framework, in order to address these issues [2, 3, 4, 5, 6]. End-to-end ASR methods typically rely only on paired acoustic and language data, without the need for extra linguistic knowledge, and train the model with a single algorithm. Therefore, this approach makes it feasible to build ASR systems without expert knowledge. However, even with the end-to-end ASR framework, a large amount of training data is still crucial to assure high recognition accuracy, where the data need to be paired acoustic and language data of many utterances spoken by various speakers . Nowadays, it is easy to collect audio and text data independently from the world wide web, but difficult to find paired data in different languages. Transcribing existing audio data or recording texts spoken by various speakers are also very expensive.
There are several approaches that tackle the problem of limited paired data in the literature [8, 9, 10, 11, 12]. In particular, cycle consistency has recently been introduced in machine translation (MT)  and image transformation , and enables one to optimize deep networks using unpaired data. The idea is based on the assumption that if one model that converts input data to output data and another model that reconstructs the input data from the output data are both trained successfully, the input data and its reconstruction should be similar. For example, suppose an English-to-French MT system translates an English sentence to a French sentence, and then a French-to-English MT system back-translates the French sentence to an English sentence. In this case, we can train the English-to-French system so that the difference between the English sentence and its back-translation becomes smaller, for which we only need English sentences. The French-to-English MT system can also be trained in the same manner using only French sentences.
Applying the concept of cycle consistency to ASR is quite challenging. As is the case in MT, the output of ASR is a discrete distribution over the set of all possible sentences. It is therefore not possible to build an end-to-end differentiable loss that back-propagates error through the most probable sentence in this step. Since the set of possible sentences is exponentially large in the size of the sentence, it is not possible to exactly marginalize over all possible sentences either.
Furthermore, unlike in MT and image transformation, in ASR, the input and output domains are very different and do not contain the same information. The output text does not include speaker and prosody information, which is eliminated through feature extraction and decoding. Hence, the speech reconstructed by the TTS system does not have the original speaker and prosody information and can result in a strong missmatch.
Previous approaches related to cycle consistency in end-to-end ASR [9, 12] circumvent these problems by avoiding back-propagating the error beyond the discrete steps and adding a speaker network to transfer the information not present in the text. This requires, however, alternating between speech-to-speech and text-to-text losses to be able to update both ASR and TTS. Therefore this method is not strictly cycle-consistency training, as used in MT and image transformation, since the gradients are not cycled both through ASR and TTS simultaneously. In addition, it also requires training an additional speaker network.
In this work, we propose an alternative approach that uses an end-to-end differentiable loss in the cycle-consistency manner, no longer requiring to alternate between losses or to train an additional speaker network. The idea rests on the two following principles.
Encoder-state-level cycle consistency:
We use ASR encoder state sequences for computing the cycle consistency instead of waveform or spectral features. This uses a normal TTS Tacotron2 end-to-end model  modified to reconstruct the encoder state sequence instead of speech. We call this a text-to-encoder (TTE) model , which we introduced in our prior work on data augmentation. This approach reduces the mismatch between the original and the reconstruction by avoiding the problem of missing para-linguistic information.
Expected end-to-end loss:
We use an expected loss approximated with a sampling-based method. In other words, we sample multiple sentences from the ASR model, generate an encoder state sequence for each, and compute the consistency loss for each sentence by comparing each encoder state sequence with the original. Then, the mean loss can be used to backpropagate the error to the ASR model via the REINFORCE algorithm. This allows us to update the ASR system when the TTE is used to compute the loss, unlike .
The proposed approach allows therefore training with unpaired data, even if only speech is available. Furthermore, since error is backpropagated into the ASR system as well, additional unsupervised losses can be used, such as language models. We demonstrate the efficacy of the proposed method in a semi-supervised training condition on the LibriSpeech corpus.
2 Cycle-consistency training for ASR
2.1 Basic concept
The proposed method consists of an ASR encoder-decoder, a TTE encoder-decoder, and consistency loss computation as shown in Fig. 1. In this framework, we need only audio data for backpropagation. First, an input audio feature sequence is fed to the ASR encoder, and an encoder state sequence is obtained. The ASR attention decoder converts the state sequence into a character sequence. In a second stage, the TTE encoder receives the character sequence and encodes it. The TTE attention decoder generates an ASR encoder state sequence. Finally, the cycle-consistency loss is computed by comparing the original state sequence and the generated one, and then the backpropagation is performed into the ASR encoder-decoder.
2.2 Attention-based ASR model
The ASR model we used is an attention-based encoder-decoder. This model directly estimates the posterior, where represents a sequence of input
-dimensional feature vectors, andrepresents a sequence of output characters in the label set . The posterior
is factorized with a probabilistic chain rule as follows:
where represents the subsequence , and
is calculated as follows:
where represents an attention weight, the corresponding attention weight vector (sequence of attention weights ), and the hidden states of the encoder and decoder networks, respectively, a character-wise hidden vector, which is a weighted summarization of the hidden vectors using the attention weight vector , and represents a linear layer with a trainable matrix and bias parameters.
All of the above networks are optimized using back-propagation to minimize the following objective function:
where represents the ground truth for the previous characters, i.e., they are trained in a teacher-forcing manner.
In the inference stage, the character sequence is predicted as
2.3 Tacotron2-based TTE model
For the TTE model, we use the Tacotron2 architecture, which has demonstrated superior performance in the field of text-to-speech synthesis . In our framework, the network predicts the ASR encoder state and the end-of-sequence probability at each frame from a sequence of input characters as follows:
where is a shallow feed-forward network to convert the network outputs before feedback to the decoder,
is a convolutional neural network to refine the network outputs, andand represent predicted hidden states of the ASR encoder before and after refinement by Postnet. Note that the indices and of the encoder and decoder states are reversed in comparison to the ASR formulation in Eqs. (2)-(6
), and that we use an additional activation functionin Eqs. (14) and (16) to avoid mismatching of the ranges of the outputs, in contrast to the original Tacotron2 architecture .
All of the networks are jointly optimized to minimize the following objective function:
where represents mean square error, represent an L1 norm, and the last two terms represent the binary cross entropy for the end-of-sequence probability.
2.4 Cycle-consistency training
In this work, we use the TTE loss in Eq. (18) to measure the cycle consistency, where we compute the TTE loss using the ASR encoder state sequence and the TTE-generated sequence from the ASR character sequence. However, the argmax function in Eq. (8) to output the character sequence is not differentiable, and the consistency loss cannot be propagated through TTE to ASR directly. To solve this problem, we introduce the expected loss. The expected loss can be computed with a sampling technique as
where denotes the state sequence predicted by the TTE model for a given character sequence , and denotes the original state sequence given by the ASR encoder for the input feature sequence . represents the number of samples drawn from the ASR model.
3 Related work
There is some prior work on unpaired data training for end-to-end ASR. Our prior work on back-translation-style data augmentation  focused on the use of text-only data. We introduced a TTE model and used the synthesized encoder state sequences to train the ASR decoder without audio information. However, in this paper, we focus on the use of audio-only data and take a different approach.
In , the model consists of a text-to-text auto-encoder and a speech-to-text encoder-decoder which share the speech and text encodings and can be trained jointly using paired and unpaired data. Audio-only data is basically used to enhance the speech encodings, but not used to reduce recognition errors unlike our cycle-consistency approach. Furthermore, the text encoder is much simpler than our TTE model. In our work, the TTE model can hopefully generate better speech encodings to compute the consistency loss.
, the ASR model is trained with synthesized speech and the TTS model is trained with ASR hypotheses for unpaired data. Therefore, the models are not tightly connected with each other, i.e., one model cannot be updated directly with the help of the other model to reduce the recognition or synthesis errors. Our approach can utilize the other model in the loss function to reduce the errors. We also employ a TTE model, which benefits from the reduction of speaker variations in the loss function and of computational complexity.
We conducted several experiments using the LibriSpeech corpus , which consists of two sets of clean speech data (100 hours + 360 hours), and other (noisy) speech data (500 hours) for training. We used 100 hours of the clean speech data to train the initial ASR model and the text-to-encoder (TTE) model, and the audio of 360 hours of clean speech data to retrain the ASR model using the cycle-consistency loss. We used five hours of clean development data as a validation set, and five hours of clean test data as an evaluation set.
We extracted an acoustic feature vector consisting of an 80-dimensional log mel-filter bank and three-dimensional pitch features using the open-source speech recognition toolkit Kaldi 
. The ASR encoder had an eight-layered bidirectional long short-term memory with 320 cells including projection layers (BLSTMP), and the ASR decoder had a one-layered LSTM with 300 cells. In the second and third layers from the bottom of the ASR encoder, sub-sampling was performed to reduce the utterance length from down to . The ASR attention network used location-aware attention . For decoding, we used a beam search algorithm with beam size of 20. We manually set the maximum and minimum lengths of the output sequence to 0.2 and 0.8 times the length of the subsampled input sequence, respectively.
The architecture of the TTE model followed the original Tacotron2 settings 
. The input characters were converted into 512-dimensional character embeddings. The TTE encoder consisted of a three-layered 1D convolutional neural network (CNN) containing 512 filters with size 5, a batch normalization, and rectified linear unit (ReLU) activation function, and a one-layered BLSTM with 512 units (256 units for forward processing, the rest for backward processing). Although the attention mechanism of the TTE model was based on location-aware attention, we additionally cumulated the attention weight feedback to the next step to accelerate attention learning. The TTE decoder consisted of a two-layered LSTM with 1024 units. Prenet was a two-layered feed forward network with 256 units and ReLU activation. Postnet was a five-layered CNN containing 512 filters with the shape 5, a batch normalization, and tanh activation function except in the final layer. Dropout  with a probability of 0.5 was applied to all of the convolution and Prenet layers. Zoneout  with a probability of 0.1 was applied to the decoder LSTM. During generation, we applied dropout to Prenet in the same manner as in , and set the threshold value of the end-of-sequence probability at 0.75 to prevent from cutting off the end of the input sequence.
In cycle-consistency training, five sequences of characters were drawn from the ASR model for each utterance, where each character was drawn repeatedly from the Softmax distribution of ASR until it encountered the end-of-sequence label ‘<eos>’. During training, we also used the 100-hour paired data to regularize the model parameters in a teacher-forcing manner, i.e., the parameters were updated alternately by cross-entropy loss with paired data and the cycle-consistency loss with unpaired data.
All the models were trained using the end-to-end speech processing toolkit ESPnet 
with a single GPU (Titan Xp). Character error rate (CER) and word error rate (WER) were used as evaluation metrics.
|CER / WER [%]|
|Baseline||11.2 / 24.9||11.1 / 25.2|
|Cycle-consistency loss||9.5 / 21.5||9.4 / 21.5|
|CE loss (1 best)||47.8 / 86.8||48.8 / 89.3|
|CE loss (5 samples)||13.3 / 28.2||12.3 / 27.7|
|Oracle||4.7 / 11.4||4.6 / 11.8|
|CER / WER [%]|
|Baseline + LM||11.9 / 22.6||11.9 / 22.9|
|Cycle consistency + LM||10.2 / 19.6||9.9 / 19.5|
First, we show the changes of the consistency loss for training data and the validation accuracy for development data in Fig. 2
, where the accuracy was computed based on the prediction with ground truth history. The consistency loss successfully decreased as the number of epochs increased. Although the validation accuracy did not improve smoothly, it reached a better value than that for the first epoch. We chose the 6th-epoch model for the following ASR experiments.
Table 1 shows the ASR performance using different training methods.
Compared with the baseline result given by the initial ASR model, we can confirm that our proposed cycle-consistency training reduced the word error rate from 25.2%111Our baseline WER is worse than that reported in  for the 100-hour training setup. This is because we did not use any pronunciation lexicon or word-based language model for end-to-end ASR to evaluate the basic performance. Such end-to-end systems typically underperform conventional DNN/HMM systems with n-gram language model when using this size of training data.
for the 100-hour training setup. This is because we did not use any pronunciation lexicon or word-based language model for end-to-end ASR to evaluate the basic performance. Such end-to-end systems typically underperform conventional DNN/HMM systems with n-gram language model when using this size of training data.to 21.5%, a relative reduction of 14.7%. Thus, the results demonstrate that the proposed method works for ASR training with unpaired data. To verify the effectiveness of our approach, we further examined more straightforward methods, in which we simply used cross-entropy (CE) loss for unpaired data, where the target was chosen as the one best ASR hypothesis or sampled in the same manner as the cycle-consistency training. To alleviate the impact of the ASR errors, we weighted the CE loss by 0.1 for unpaired data while we did not down-weight the paired data. However, the error rates increased significantly in the 1-best condition. Even in the 5-sample condition, we could not obtain better performance than the baseline. We also conducted additional experiments under an oracle condition, where the 360-hour paired data were used together with the 100-hour data using the standard CE loss. The error rates can be considered the upper bound of this framework. We can see that there is still a big gap to the upper bound and further challenges need to be overcome to reach this goal.
Finally, we combined the ASR model with a character-based language model (LM) in a shallow fusion technique . An LSTM-based LM was trained using text-only data from the 500-hour noisy set excluding audio data, and used for decoding. As shown in Table 2, the use of text-only data yielded further improvement reaching 19.5% WER (an 8% error reduction), which is the best number we have achieved so far for this unpaired data setup.
In this paper, we proposed a novel method to train end-to-end automatic speech recognition (ASR) models using unpaired data. The method employs an attention-based ASR model and a Tacotron2-based text-to-encoder (TTE) model to compute a cycle-consistency loss using audio data only. Experimental results on the LibriSpeech corpus demonstrated that the proposed cycle-consistency training reduced the word error rate by 14.7% from an initial model trained with 100-hour paired data, using an additional 360 hours of audio-only data without transcriptions. We also investigated the use of text-only data from 500-hour utterances for language modeling, and obtained a further error reduction of 8%. Accordingly, we achieved 22.7% error reduction in total for this unpaired data setup. Future work includes joint training of ASR and TTE model using both sides of the cycle-consistency loss, and the use of additional loss functions to make the training better.
-  Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
Alex Graves and Navdeep Jaitly,
“Towards end-to-end speech recognition with recurrent neural networks,”in
International Conference on Machine Learning (ICML), 2014, pp. 1764–1772.
-  Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
-  Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 577–585.
-  Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4835–4839.
-  Takaaki Hori, Shinji Watanabe, and John R. Hershey, “Joint CTC/attention decoding for end-to-end speech recognition,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL): Human Language Technologies: long papers, 2017.
-  Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” arXiv preprint arXiv:1512.02595, 2015.
-  Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, and Kazuya Takeda, “Back-translation-style data augmentation for end-to-end asr,” arXiv preprint arXiv:1807.10893, 2018.
-  Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, “Listening while speaking: Speech chain by deep learning,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017, pp. 301–308.
-  Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner, and Shinji Watanabe, “Multi-modal data augmentation for end-to-end ASR,” in Proc. Interspeech 2018, 2018, pp. 2394–2398.
-  Shigeki Karita, Shinji Watanabe, Tomoharu Iwata, Atsunori Ogawa, and Marc Delcroix, “Semi-supervised end-to-end speech recognition,” Proc. Interspeech 2018, pp. 2–6, 2018.
-  Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, “Machine speech chain with one-shot speaker adaptation,” in Proc. Interspeech 2018, 2018, pp. 887–891.
-  Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma, “Dual learning for machine translation,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds., pp. 820–828. Curran Associates, Inc., 2016.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros,
“Unpaired image-to-image translation using cycle-consistent adversarial networks,”arXiv preprint, 2017.
-  Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, et al., “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017.
Ronald J Williams,
“Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210.
-  Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely, “The Kaldi speech recognition toolkit,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Dec. 2011.
-  Haşim Sak, Andrew Senior, and Françoise Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014.
-  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal, “Zoneout: Regularizing rnns by randomly preserving hidden activations,” arXiv preprint arXiv:1606.01305, 2016.
-  Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al., “ESPnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
-  Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in INTERSPEECH, 2017.