1 Introduction
Automatic speech recognition ( automatic speech recognition (ASR)) systems have increased their performance steadily over the years. The introduction of neural networks into the area of speech recognition led to various improvements. Hybrid approaches [1, 2]
replaced traditional Gaussian mixture models by learning a function between the input speech features and
hidden markov model states in a discriminative fashion. However, these approaches are composed of several independently optimized modules, i.e., an acoustic model, a pronunciation model, and a language model. As they are not optimized jointly, useful information cannot be shared between them. Furthermore, specific knowledge is necessary for each module to retrieve the optimal result.Recently, sequencetosequence (Seq2Seq) models are gaining popularity in the community [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] since they fuse all aforementioned modules into a single endtoend model, which directly outputs characters. Works like [10, 6] have already shown that Seq2Seq models can be superior to hybrid systems [10] if enough data is available. Seq2Seq models can be categorized into approaches based on connectionist temporal classification (CTC) [3, 4], on transducer [16, 17, 18] and on attention [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].
In CTC, a recurrent neural network (RNN) learns alignments between unlabeled input speech features and a transcript. The basic idea is to assume the conditional independence of the outputs and marginalize over all possible alignments [3]. For ASR, this assumption is not valid, as consecutive outputs are highly correlated. Transducer models relax the conditional independence and add another RNN to learn the dependencies between all previous input speech features and the output [17]. Attention models also combine two RNNs with an additional attention network. One RNN
acts as an encoder to transform the input data into a robust feature space. The attention model creates a glimpse given the last hidden layer of the encoder, the previous timestep attention vector and the previous timestep decoder output. The decoder
RNN then utilizes the glimpse and the previous decoder output to generate chars [15].In our work, we propose a novel regularization technique by utilizing an additional decoder to improve attention models. This newly added decoder is optimized on timereversed labels. Since we primarily focus on improving the training process, we utilize the decoder only during the optimization phase and discard it later in the inference. Thus, the network architecture of a basic attention model is not changed during decoding.
A recent study demonstrated that it is beneficial to add a righttoleft (R2L) decoder to a conventional lefttoright (L2R) decoder [19]. The R2L
decoder is trained on timereversed target labels and acts as a regularizer during optimization. Their work focused mainly on the advantage of using the additional information to improve the beam search in decoding. They applied a constant scalar value, which attached a more significant weight on the loss function of the standard
L2R decoder. Furthermore, they trained their models on Japanese words whereby label and timereversed label sequences were equal. Another comparable work has been published in the domain of speech synthesis. In [20], they also utilized a second R2L decoder, combined both losses and added another regularizing function for the L2R and R2L decoder outputs. Similar to [19], they trained only on equal sequence lengths. In the English language, however, byte pair encodings for encoding the target transcripts seem superior [10, 6]. As encoding a timereversed transcript produces unequal sequence lengths between L2R and R2L decoders, regularization of these sequences is challenging. To the best of our knowledge, an indepth study on how to solve this problem and leveraging the newly added decoder during the optimization process has not been done for attention models. Our contributions are the following:2 Proposed Method
2.1 Attention Model
The standard attentional Seq2Seq model contains three major components: the encoder, the attention module and the decoder. Let be a given input sequence of speech features and let
be the target output sequence of length
. The encoder transforms the input sequence into a latent space:(1) 
where encodes essential aspects of the input sequence, i.e., characteristics of the speech signal. The resulting hidden encoder states and the hidden decoder state are fed into the attention module to predict proper alignments between the input and output sequences:
(2) 
where are the attention weights and is the output of a scoring function:
(3) 
Depending on the task, there are several ways to implement scoring functions. We choose the contentbased and locationaware attention from [21] for scoring. Based on the attention weights , a context vector is created to summarize all information in the hidden states of the encoder for the current prediction:
(4) 
The decoder generates the output distribution using the context vector and the decoder hidden state :
(5) 
where is a recurrency, usually a long shortterm memory (LSTM) [22]:
(6) 
with being the predicted target label of the previous prediction step. The resulting model is optimized by crossentropy loss .
2.2 Adding a Backward Decoder
For a traditional attention model, the char distribution is generated by a single L2R decoder. This distribution is dependent on the past and thus, has no information about the future context. For this reason, we extend the model by adding a second R2L decoder, which is trained on timereversed output labels to generate . The reverse distribution contains beneficial information for the L2R decoder since it has no access to future labels. The R2L decoder contains an individual attention network, which includes a likewise scoring mechanism as the L2R decoder. The decoders learn to create the posterior for the L2R and for the R2L case, respectively. Thus, represents the attention and decoder parameters for target labels, which are typically time encoded (e.g., cat) and are the attention and decoder parameter of the timereversed target labels (e.g., tac).
In an ideal case, the posteriors of both decoders should satisfy the following condition:
(7) 
as both networks receive the same amount of information. However, the decoders depend on a different context, i.e., the L2R on past context and the R2L on future context, which results in a similar but not equal training criterion.
2.3 Regularization for Equal Sequence Lengths
If we apply chars as target values for training the attention model, we are dealing with equal output sequence lengths since there is no difference between the forward and reverse encoding of a word. Therefore, we extend the loss similar to [20] with a regularization term to retrieve the global loss :
(8) 
where defines a weighting factor for the losses, and is a regularizer term weighted by . We apply the distance between the decoder outputs and with as regularization. Thus, is defined it as:
(9) 
The regularization term forces the network to minimize the distance between outputs of the L2R and R2L decoders. Therefore, the L2R network gets access to outputs that are based on future context information to utilize its knowledge and increase the overall performance. Note that this kind of regularization is only feasible as we are dealing with equal sequence lengths, which makes it simple to create .
2.4 Regularization for Unequal Sequence Lengths
We can extend the approach above by applying BPE units instead of chars. However, in contrast to chars, we face the problem of obtaining unequal sequence lengths for L2R and for R2L decoders with . Since the timereversed chars are encoded differently, the proposed regularization in Equation 9 is not feasible. We resolve this issue utilizing a differentiable version of the dynamic time warping (DTW) algorithm[23] as a distance measurement between two temporal sequences of arbitrary lengths the socalled softDTW algorithm. By defining a soft version of the min operator with a softening parameter :
(10) 
we can rewrite the softDTW loss as a regularization term similar as above:
(11) 
Here, is the inner product of two matrices, is an alignment matrix of a set which are binary matrices that contain paths from to by only applying , and moves through this matrix and is defined by a distance function (e.g., Euclidean distance). Based on the inner product, we retrieve an alignment cost for all possible alignments between and . Since we force the network to also minimize , it has to learn a good match between the different sequence lengths of the L2R and R2L decoders.
3 Experiments
3.1 Training Details
All our experiments are evaluated on the public TEDLIUMv2 [24] and LibriSpeech [25]
datasets. TEDLIUMv2 has approximate 200 h of training data with a 150 k lexicon. In order to verify our approach on a larger scale, we also utilize the larger dataset LibriSpeech, which contains 960 h of training data with a given 200 k lexicon.
We preprocess both datasets by extracting 80dimensional log Mel features using Kaldi [26] and adding the corresponding pitch features, which results in an 83dimensional feature vector. Furthermore, we apply chars and BPE units as target labels. The chars are directly extracted from the datasets, whereas the BPE units are created by SentencePiece [27] which is a languageindependent subword tokenizer. For all experiments, we select 100 BPE units, which seems sufficient [6] for our approach. Moreover, we do not utilize any dropout layers, augmentation techniques, or language models, as we focus our evaluation onto the additional decoder and how to deploy it in the training stage.
The proposed architecture is created in the ESPnet toolkit [28] and is depicted in Figure 2. We strictly follow the encoder structure from [29]. The encoder consists of two VGG blocks [30], namely
, where every block is composed of two 2D convolutional layers and a 2D maxpooling layer. The first block contains convolutional layers with 64 filters, whereas the second block contains convolutional layers with 128 filters, respectively. All convolutions have a filter size of
and a stride length of one. Each maxpooling layer has a kernel size of
and a stride length of two. On top of the network, we finalize the encoder with four bidirectional long shortterm memory projected (BLSTMP) [31] layers. Every BLSTMP layer has 1024 cells and a projection layer of size 1024. The output of the encoder is then utilized by the L2R and R2L attention networks, which create the context vectors for the decoders.Each decoder is a single LSTM network with 1024 cells. As our resources are limited and endtoend training is considered challenging, we perform a threestage training scheme inspired by [20]. In the first stage, we train a standard attentional network with a L2R decoder. Then, we apply the pretrained encoder, freeze its weights and train the R2L model. Finally, we combine both networks into one model to receive the final architecture similar to Figure 2. In all stages, we optimize the network with Adadelta [32] initialized with an . If we do not observe any improvement of the accuracy on the validation set, we decay by a factor of and increment a patience counter by one. We apply an early stopping of the training if the patient counter exceeds three. The batchsize is set to 30 for all training steps.
Depending on the target labels in the third training stage, i.e., chars or BPE units, we deploy two different techniques to regularize the L2R decoder. For chars, forward sequences and backward sequences have equal lengths. Thus, we add a regularizer identical to Equation 8 and scale it with for the smaller and for the bigger dataset. On the other hand, for BPE units, we utilize the softDTW from Equation 11 as a regularizer since it represents a distance measurement between the unequal sequence lengths and , which we want to minimize. Here, we set and scale the regularization with for both datasets. Besides the added regularizations for chars and BPE units, we regularize the L2R network further by applying in all the experiments. Thereby, we ensure that the overall training is focused on the L2R decoder network. Thus, the R2L decoder and the regularization techniques only support the L2R decoder to further improve its performance. Later in the decoding phase, we remove the R2L network since it is only necessary during the training stages. As a result, we are not changing or adding complexity to the final model during decoding.
TEDLIUMv2[24]  LibriSpeech[25]  

char  BPE  char  BPE  
Methods  dev  test  dev  test  devclean  dev
other 
test
clean 
test
other 
dev
clean 
dev
other 
test
clean 
test
other 
Forward  16.77  17.32  17.83  18.00  7.69  20.67  7.72  21.63  7.59  20.98  7.67  21.92 
Backward  18.12  18.47  18.57  17.99  7.60  20.78  7.54  21.83  7.53  20.94  7.60  21.71 
Backward Fixed  23.34  23.77  25.55  25.01  11.39  28.36  11.75  28.53  12.07  28.63  12.39  29.06 
Dual Decoder  16.47  17.12  17.70  18.08  7.29  20.99  7.60  22.00  7.46  21.29  7.70  22.01 
Dual Decoder Reg  15.68  15.94  16.75  17.42  7.24  19.96  7.02  20.95  7.17  20.01  7.33  20.63 
3.2 Benchmark Details
We evaluate our approach on five different setups:

Forward: The model is trained with a standard L2R decoder, which is the baseline for all experiments.

Backward: The model is trained on timereversed target labels, which results in a R2L decoder.

Backward Fixed: Similar to the Backward experiment, however, we take the pretrained encoder from the L2R model and freeze its weights during training.
Instead of perform the forward and backward beam search as in [19], we only apply a forward beam search deploying the L2R decoder with a beam size of 20.
3.3 Results
In Table 1, we present the results of our approach applying chars and BPE units for the TEDLIUMv2[24] and LibriSpeech [25] datasets.
For the smaller dataset TEDLIUMv2, we observe a clear difference in WERs between the Forward and the Backward setup. Ideally, the performance of these setups should be equal, as both networks receive the same amount of information. However, we observe an absolute difference of 1% WER for all evaluation sets, except for the test BPE set. One explanation for this variation may be that the Backward setup is more complex. Since the dataset contains only around 220 h of training data, the number of reverse training samples could not be sufficient. In the bigger dataset LibriSpeech, the first two setups obtain nearly the same WER with only a minor difference. This dataset contains nearly five times the data of the smaller dataset and therefore, the network in the Backward setup receives enough reverse training examples. It seems, that the amount of data seems crucial for the R2L decoder to satisfy Equation 7.
In the Backward Fixed setup, we can verify the strong dependency of the decoder, relying on the highlevel representation of features created by the encoder. Although we do not change the information of the target labels by reversing them, the fixed encoder from the Forward setup learned distinct, highlevel features, which are based on past context. We observe this by a decline of the WERs in both datasets. Even though, the utilized BLSTMPs in the encoder network receive the complete feature sequence in the input space, they generate highlevel features based on past label context, since they do not have access to future labels. As a result, the R2L model applying a fixed encoder from the Forward setup is worse compared to the trainable encoder in the R2L model.
In the Dual Decoder setup, we follow the idea of [19] to apply the R2L model as a regularizer of the L2R network. Interestingly, the R2L decoder is not able to effectively support the L2R decoder. We recognize only a slight improvement of the WER, which is not consistent in both datasets. Therefore, a simple weighting of the loss during training is not sufficient to enhance the L2R decoder. One reason might be that the L2R decoder receives only implicit information from the R2L decoder by weighting the losses, which is considered not valuable for the optimization of the L2R decoder.
To induce valuable information, we add our proposed regularization terms in the last Dual Decoder Reg setup. The overall network is forced to minimize the added regularization terms explicitly. The L2R decoder can directly utilize information of the R2L decoder to improve its predictions. We receive the overall best WER for the last setup. For the TEDLIUMv2 dataset, we recognize an average relative improvement of 7.2% for the char and 4.4% for the BPE units. For the LibriSpeech dataset, we are able to receive an average relative improvement of 4.9% for the char and 5.1% for the BPE units.
In our experiments, we do not observe a clear advantage of either utilizing chars or BPE units as target values, since the performance on the evaluation sets is not consistent.
4 Conclusion
Our work presents a novel way to integrate a second decoder for attention models during the training phase. The proposed regularization terms support the standard L2R model to utilize future context information from the R2L decoder, which is usually not available during optimization. We solved the issue of regularizing unequal sequence lengths, which arise applying BPE units as target values, by adding a soft version of the DTW algorithm. Our method outperforms conventional attention models independent of the dataset size. Our regularization technique is simple to integrate into a conventional training scheme, does not change the overall complexity of the standard model, and only adds optimization time.
For future work, we want to investigate if this regularization can also be applied to transformerbased models and how it influences the final performance.
References

[1]
T. Robinson, “An Application of Recurrent Nets to Phone Probability Estimation,”
IEEE transactions on Neural Networks, vol. 5, no. 2, 1994.  [2] H. A. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach. Springer Science & Business Media, 2012, vol. 247.

[3]
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist
Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent
Neural Networks,” in
Proceedings of the 23rd international conference on Machine learning
. ACM, 2006, pp. 369–376.  [4] A. Graves and N. Jaitly, “Towards EndtoEnd Speech Recognition with Recurrent Neural Networks,” in International conference on machine learning, 2014, pp. 1764–1772.
 [5] I. Sutskever, O. Vinyals, and Q. Le, “Sequence to sequence learning with neural networks,” Advances in NIPS, 2014.
 [6] Z. Tüske, K. Audhkhasi, and G. Saon, “Advancing sequencetosequence based speech recognition,” Proc. Interspeech 2019, pp. 3780–3784, 2019.
 [7] C. Weng, J. Cui, G. Wang, J. Wang, C. Yu, D. Su, and D. Yu, “Improving Attention Based SequencetoSequence Models for EndtoEnd English Conversational Speech Recognition,” in Interspeech, 2018, pp. 761–765.
 [8] A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
 [9] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
 [10] C.C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “Stateoftheart speech recognition with sequencetosequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778.
 [11] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “Endtoend continuous speech recognition using attentionbased recurrent nn: First results,” arXiv preprint arXiv:1412.1602, 2014.
 [12] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “Endtoend attentionbased large vocabulary speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 4945–4949.
 [13] L. Lu, X. Zhang, and S. Renais, “On Training the Recurrent Neural Network EncoderDecoder for Large Vocabulary EndToEnd Speech Recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5060–5064.
 [14] M.T. Luong, H. Pham, and C. D. Manning, “Effective Approaches to AttentionBased Neural Machine Translation,” arXiv preprint arXiv:1508.04025, 2015.
 [15] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
 [16] A. Graves, “Sequence Transduction with Recurrent Neural Networks,” arXiv preprint arXiv:1211.3711, 2012.
 [17] A. Graves, A.r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649.
 [18] H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent Neural Aligner: An EncoderDecoder Neural Network Model forSequence to Sequence Mapping,” in Interspeech, 2017, pp. 1298–1302.
 [19] M. Mimura, S. Sakai, and T. Kawahara, “ForwardBackward Attention Decoder,” in Interspeech, 2018, pp. 2232–2236.
 [20] Y. Zheng, X. Wang, L. He, S. Pan, F. K. Soong, Z. Wen, and J. Tao, “ForwardBackward Decoding for Regularizing EndtoEnd TTS,” arXiv preprint arXiv:1907.09006, 2019.
 [21] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attentionbased models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
 [22] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [23] M. Cuturi and M. Blondel, “SoftDTW: a differentiable loss function for timeseries,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR.org, 2017, pp. 894–903.
 [24] A. Rousseau, P. Deléglise, and Y. Esteve, “Enhancing the TEDLIUM corpus with selected data for language modeling and more TED talks,” in LREC, 2014, pp. 3935–3939.
 [25] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
 [26] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF. IEEE Signal Processing Society, 2011.
 [27] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” arXiv preprint arXiv:1808.06226, 2018.
 [28] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: Endtoend speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
 [29] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint ctcattention based endtoend speech recognition with a deep cnn encoder and rnnlm,” arXiv preprint arXiv:1706.02737, 2017.
 [30] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [31] H. Sak, A. W. Senior, and F. Beaufays, “Long shortterm memory recurrent neural network architectures for large scale acoustic modeling,” 2014.
 [32] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
Comments
There are no comments yet.