Automatic speech recognition ( automatic speech recognition (ASR)) systems have increased their performance steadily over the years. The introduction of neural networks into the area of speech recognition led to various improvements. Hybrid approaches [1, 2]
replaced traditional Gaussian mixture models by learning a function between the input speech features andhidden markov model states in a discriminative fashion. However, these approaches are composed of several independently optimized modules, i.e., an acoustic model, a pronunciation model, and a language model. As they are not optimized jointly, useful information cannot be shared between them. Furthermore, specific knowledge is necessary for each module to retrieve the optimal result.
Recently, sequence-to-sequence (Seq2Seq) models are gaining popularity in the community [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] since they fuse all aforementioned modules into a single end-to-end model, which directly outputs characters. Works like [10, 6] have already shown that Seq2Seq models can be superior to hybrid systems  if enough data is available. Seq2Seq models can be categorized into approaches based on connectionist temporal classification (CTC) [3, 4], on transducer [16, 17, 18] and on attention [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].
In CTC, a recurrent neural network (RNN) learns alignments between unlabeled input speech features and a transcript. The basic idea is to assume the conditional independence of the outputs and marginalize over all possible alignments . For ASR, this assumption is not valid, as consecutive outputs are highly correlated. Transducer models relax the conditional independence and add another RNN to learn the dependencies between all previous input speech features and the output . Attention models also combine two RNNs with an additional attention network. One RNN
acts as an encoder to transform the input data into a robust feature space. The attention model creates a glimpse given the last hidden layer of the encoder, the previous time-step attention vector and the previous time-step decoder output. The decoderRNN then utilizes the glimpse and the previous decoder output to generate chars .
In our work, we propose a novel regularization technique by utilizing an additional decoder to improve attention models. This newly added decoder is optimized on time-reversed labels. Since we primarily focus on improving the training process, we utilize the decoder only during the optimization phase and discard it later in the inference. Thus, the network architecture of a basic attention model is not changed during decoding.
decoder is trained on time-reversed target labels and acts as a regularizer during optimization. Their work focused mainly on the advantage of using the additional information to improve the beam search in decoding. They applied a constant scalar value, which attached a more significant weight on the loss function of the standardL2R decoder. Furthermore, they trained their models on Japanese words whereby label and time-reversed label sequences were equal. Another comparable work has been published in the domain of speech synthesis. In , they also utilized a second R2L decoder, combined both losses and added another regularizing function for the L2R and R2L decoder outputs. Similar to , they trained only on equal sequence lengths. In the English language, however, byte pair encodings for encoding the target transcripts seem superior [10, 6]. As encoding a time-reversed transcript produces unequal sequence lengths between L2R and R2L decoders, regularization of these sequences is challenging. To the best of our knowledge, an in-depth study on how to solve this problem and leveraging the newly added decoder during the optimization process has not been done for attention models. Our contributions are the following:
2 Proposed Method
2.1 Attention Model
The standard attentional Seq2Seq model contains three major components: the encoder, the attention module and the decoder. Let be a given input sequence of speech features and let
be the target output sequence of length. The encoder transforms the input sequence into a latent space:
where encodes essential aspects of the input sequence, i.e., characteristics of the speech signal. The resulting hidden encoder states and the hidden decoder state are fed into the attention module to predict proper alignments between the input and output sequences:
where are the attention weights and is the output of a scoring function:
Depending on the task, there are several ways to implement scoring functions. We choose the content-based and location-aware attention from  for scoring. Based on the attention weights , a context vector is created to summarize all information in the hidden states of the encoder for the current prediction:
The decoder generates the output distribution using the context vector and the decoder hidden state :
with being the predicted target label of the previous prediction step. The resulting model is optimized by cross-entropy loss .
2.2 Adding a Backward Decoder
For a traditional attention model, the char distribution is generated by a single L2R decoder. This distribution is dependent on the past and thus, has no information about the future context. For this reason, we extend the model by adding a second R2L decoder, which is trained on time-reversed output labels to generate . The reverse distribution contains beneficial information for the L2R decoder since it has no access to future labels. The R2L decoder contains an individual attention network, which includes a likewise scoring mechanism as the L2R decoder. The decoders learn to create the posterior for the L2R and for the R2L case, respectively. Thus, represents the attention and decoder parameters for target labels, which are typically time encoded (e.g., cat) and are the attention and decoder parameter of the time-reversed target labels (e.g., tac).
In an ideal case, the posteriors of both decoders should satisfy the following condition:
as both networks receive the same amount of information. However, the decoders depend on a different context, i.e., the L2R on past context and the R2L on future context, which results in a similar but not equal training criterion.
2.3 Regularization for Equal Sequence Lengths
If we apply chars as target values for training the attention model, we are dealing with equal output sequence lengths since there is no difference between the forward and reverse encoding of a word. Therefore, we extend the loss similar to  with a regularization term to retrieve the global loss :
where defines a weighting factor for the losses, and is a regularizer term weighted by . We apply the distance between the decoder outputs and with as regularization. Thus, is defined it as:
The regularization term forces the network to minimize the distance between outputs of the L2R and R2L decoders. Therefore, the L2R network gets access to outputs that are based on future context information to utilize its knowledge and increase the overall performance. Note that this kind of regularization is only feasible as we are dealing with equal sequence lengths, which makes it simple to create .
2.4 Regularization for Unequal Sequence Lengths
We can extend the approach above by applying BPE units instead of chars. However, in contrast to chars, we face the problem of obtaining unequal sequence lengths for L2R and for R2L decoders with . Since the time-reversed chars are encoded differently, the proposed regularization in Equation 9 is not feasible. We resolve this issue utilizing a differentiable version of the dynamic time warping (DTW) algorithm as a distance measurement between two temporal sequences of arbitrary lengths the so-called soft-DTW algorithm. By defining a soft version of the min operator with a softening parameter :
we can rewrite the soft-DTW loss as a regularization term similar as above:
Here, is the inner product of two matrices, is an alignment matrix of a set which are binary matrices that contain paths from to by only applying , and moves through this matrix and is defined by a distance function (e.g., Euclidean distance). Based on the inner product, we retrieve an alignment cost for all possible alignments between and . Since we force the network to also minimize , it has to learn a good match between the different sequence lengths of the L2R and R2L decoders.
3.1 Training Details
datasets. TEDLIUMv2 has approximate 200 h of training data with a 150 k lexicon. In order to verify our approach on a larger scale, we also utilize the larger dataset LibriSpeech, which contains 960 h of training data with a given 200 k lexicon.
We preprocess both datasets by extracting 80-dimensional log Mel features using Kaldi  and adding the corresponding pitch features, which results in an 83-dimensional feature vector. Furthermore, we apply chars and BPE units as target labels. The chars are directly extracted from the datasets, whereas the BPE units are created by SentencePiece  which is a language-independent sub-word tokenizer. For all experiments, we select 100 BPE units, which seems sufficient  for our approach. Moreover, we do not utilize any dropout layers, augmentation techniques, or language models, as we focus our evaluation onto the additional decoder and how to deploy it in the training stage.
, where every block is composed of two 2D convolutional layers and a 2D max-pooling layer. The first block contains convolutional layers with 64 filters, whereas the second block contains convolutional layers with 128 filters, respectively. All convolutions have a filter size of
and a stride length of one. Each max-pooling layer has a kernel size ofand a stride length of two. On top of the network, we finalize the encoder with four bidirectional long short-term memory projected (BLSTMP)  layers. Every BLSTMP layer has 1024 cells and a projection layer of size 1024. The output of the encoder is then utilized by the L2R and R2L attention networks, which create the context vectors for the decoders.
Each decoder is a single LSTM network with 1024 cells. As our resources are limited and end-to-end training is considered challenging, we perform a three-stage training scheme inspired by . In the first stage, we train a standard attentional network with a L2R decoder. Then, we apply the pretrained encoder, freeze its weights and train the R2L model. Finally, we combine both networks into one model to receive the final architecture similar to Figure 2. In all stages, we optimize the network with Adadelta  initialized with an . If we do not observe any improvement of the accuracy on the validation set, we decay by a factor of and increment a patience counter by one. We apply an early stopping of the training if the patient counter exceeds three. The batch-size is set to 30 for all training steps.
Depending on the target labels in the third training stage, i.e., chars or BPE units, we deploy two different techniques to regularize the L2R decoder. For chars, forward sequences and backward sequences have equal lengths. Thus, we add a regularizer identical to Equation 8 and scale it with for the smaller and for the bigger dataset. On the other hand, for BPE units, we utilize the soft-DTW from Equation 11 as a regularizer since it represents a distance measurement between the unequal sequence lengths and , which we want to minimize. Here, we set and scale the regularization with for both datasets. Besides the added regularizations for chars and BPE units, we regularize the L2R network further by applying in all the experiments. Thereby, we ensure that the overall training is focused on the L2R decoder network. Thus, the R2L decoder and the regularization techniques only support the L2R decoder to further improve its performance. Later in the decoding phase, we remove the R2L network since it is only necessary during the training stages. As a result, we are not changing or adding complexity to the final model during decoding.
|Dual Decoder Reg||15.68||15.94||16.75||17.42||7.24||19.96||7.02||20.95||7.17||20.01||7.33||20.63|
3.2 Benchmark Details
We evaluate our approach on five different setups:
Forward: The model is trained with a standard L2R decoder, which is the baseline for all experiments.
Backward: The model is trained on time-reversed target labels, which results in a R2L decoder.
Backward Fixed: Similar to the Backward experiment, however, we take the pretrained encoder from the L2R model and freeze its weights during training.
For the smaller dataset TEDLIUMv2, we observe a clear difference in WERs between the Forward and the Backward setup. Ideally, the performance of these setups should be equal, as both networks receive the same amount of information. However, we observe an absolute difference of 1% WER for all evaluation sets, except for the test BPE set. One explanation for this variation may be that the Backward setup is more complex. Since the dataset contains only around 220 h of training data, the number of reverse training samples could not be sufficient. In the bigger dataset LibriSpeech, the first two setups obtain nearly the same WER with only a minor difference. This dataset contains nearly five times the data of the smaller dataset and therefore, the network in the Backward setup receives enough reverse training examples. It seems, that the amount of data seems crucial for the R2L decoder to satisfy Equation 7.
In the Backward Fixed setup, we can verify the strong dependency of the decoder, relying on the high-level representation of features created by the encoder. Although we do not change the information of the target labels by reversing them, the fixed encoder from the Forward setup learned distinct, high-level features, which are based on past context. We observe this by a decline of the WERs in both datasets. Even though, the utilized BLSTMPs in the encoder network receive the complete feature sequence in the input space, they generate high-level features based on past label context, since they do not have access to future labels. As a result, the R2L model applying a fixed encoder from the Forward setup is worse compared to the trainable encoder in the R2L model.
In the Dual Decoder setup, we follow the idea of  to apply the R2L model as a regularizer of the L2R network. Interestingly, the R2L decoder is not able to effectively support the L2R decoder. We recognize only a slight improvement of the WER, which is not consistent in both datasets. Therefore, a simple weighting of the loss during training is not sufficient to enhance the L2R decoder. One reason might be that the L2R decoder receives only implicit information from the R2L decoder by weighting the losses, which is considered not valuable for the optimization of the L2R decoder.
To induce valuable information, we add our proposed regularization terms in the last Dual Decoder Reg setup. The overall network is forced to minimize the added regularization terms explicitly. The L2R decoder can directly utilize information of the R2L decoder to improve its predictions. We receive the overall best WER for the last setup. For the TEDLIUMv2 dataset, we recognize an average relative improvement of 7.2% for the char and 4.4% for the BPE units. For the LibriSpeech dataset, we are able to receive an average relative improvement of 4.9% for the char and 5.1% for the BPE units.
In our experiments, we do not observe a clear advantage of either utilizing chars or BPE units as target values, since the performance on the evaluation sets is not consistent.
Our work presents a novel way to integrate a second decoder for attention models during the training phase. The proposed regularization terms support the standard L2R model to utilize future context information from the R2L decoder, which is usually not available during optimization. We solved the issue of regularizing unequal sequence lengths, which arise applying BPE units as target values, by adding a soft version of the DTW algorithm. Our method outperforms conventional attention models independent of the dataset size. Our regularization technique is simple to integrate into a conventional training scheme, does not change the overall complexity of the standard model, and only adds optimization time.
For future work, we want to investigate if this regularization can also be applied to transformer-based models and how it influences the final performance.
-  H. A. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach. Springer Science & Business Media, 2012, vol. 247.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist
Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent
Neural Networks,” in
Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
-  A. Graves and N. Jaitly, “Towards End-to-End Speech Recognition with Recurrent Neural Networks,” in International conference on machine learning, 2014, pp. 1764–1772.
-  I. Sutskever, O. Vinyals, and Q. Le, “Sequence to sequence learning with neural networks,” Advances in NIPS, 2014.
-  Z. Tüske, K. Audhkhasi, and G. Saon, “Advancing sequence-to-sequence based speech recognition,” Proc. Interspeech 2019, pp. 3780–3784, 2019.
-  C. Weng, J. Cui, G. Wang, J. Wang, C. Yu, D. Su, and D. Yu, “Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition,” in Interspeech, 2018, pp. 761–765.
-  A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
-  W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
-  C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778.
-  J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results,” arXiv preprint arXiv:1412.1602, 2014.
-  D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 4945–4949.
-  L. Lu, X. Zhang, and S. Renais, “On Training the Recurrent Neural Network Encoder-Decoder for Large Vocabulary End-To-End Speech Recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5060–5064.
-  M.-T. Luong, H. Pham, and C. D. Manning, “Effective Approaches to Attention-Based Neural Machine Translation,” arXiv preprint arXiv:1508.04025, 2015.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  A. Graves, “Sequence Transduction with Recurrent Neural Networks,” arXiv preprint arXiv:1211.3711, 2012.
-  A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649.
-  H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model forSequence to Sequence Mapping,” in Interspeech, 2017, pp. 1298–1302.
-  M. Mimura, S. Sakai, and T. Kawahara, “Forward-Backward Attention Decoder,” in Interspeech, 2018, pp. 2232–2236.
-  Y. Zheng, X. Wang, L. He, S. Pan, F. K. Soong, Z. Wen, and J. Tao, “Forward-Backward Decoding for Regularizing End-to-End TTS,” arXiv preprint arXiv:1907.09006, 2019.
-  J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  M. Cuturi and M. Blondel, “Soft-DTW: a differentiable loss function for time-series,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017, pp. 894–903.
-  A. Rousseau, P. Deléglise, and Y. Esteve, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in LREC, 2014, pp. 3935–3939.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF. IEEE Signal Processing Society, 2011.
-  T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” arXiv preprint arXiv:1808.06226, 2018.
-  S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
-  T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,” arXiv preprint arXiv:1706.02737, 2017.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” 2014.
-  M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.