Regularized Forward-Backward Decoder for Attention Models

06/15/2020 ∙ by Tobias Watzel, et al. ∙ Technische Universität München 0

Nowadays, attention models are one of the popular candidates for speech recognition. So far, many studies mainly focus on the encoder structure or the attention module to enhance the performance of these models. However, mostly ignore the decoder. In this paper, we propose a novel regularization technique incorporating a second decoder during the training phase. This decoder is optimized on time-reversed target labels beforehand and supports the standard decoder during training by adding knowledge from future context. Since it is only added during training, we are not changing the basic structure of the network or adding complexity during decoding. We evaluate our approach on the smaller TEDLIUMv2 and the larger LibriSpeech dataset, achieving consistent improvements on both of them.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic speech recognition ( automatic speech recognition (ASR)) systems have increased their performance steadily over the years. The introduction of neural networks into the area of speech recognition led to various improvements. Hybrid approaches [1, 2]

replaced traditional Gaussian mixture models by learning a function between the input speech features and

hidden markov model states in a discriminative fashion. However, these approaches are composed of several independently optimized modules, i.e., an acoustic model, a pronunciation model, and a language model. As they are not optimized jointly, useful information cannot be shared between them. Furthermore, specific knowledge is necessary for each module to retrieve the optimal result.

Recently, sequence-to-sequence (Seq2Seq) models are gaining popularity in the community [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] since they fuse all aforementioned modules into a single end-to-end model, which directly outputs characters. Works like [10, 6] have already shown that Seq2Seq models can be superior to hybrid systems [10] if enough data is available. Seq2Seq models can be categorized into approaches based on connectionist temporal classification (CTC) [3, 4], on transducer [16, 17, 18] and on attention [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].

In CTC, a recurrent neural network (RNN) learns alignments between unlabeled input speech features and a transcript. The basic idea is to assume the conditional independence of the outputs and marginalize over all possible alignments [3]. For ASR, this assumption is not valid, as consecutive outputs are highly correlated. Transducer models relax the conditional independence and add another RNN to learn the dependencies between all previous input speech features and the output [17]. Attention models also combine two RNNs with an additional attention network. One RNN

acts as an encoder to transform the input data into a robust feature space. The attention model creates a glimpse given the last hidden layer of the encoder, the previous time-step attention vector and the previous time-step decoder output. The decoder

RNN then utilizes the glimpse and the previous decoder output to generate chars [15].

In our work, we propose a novel regularization technique by utilizing an additional decoder to improve attention models. This newly added decoder is optimized on time-reversed labels. Since we primarily focus on improving the training process, we utilize the decoder only during the optimization phase and discard it later in the inference. Thus, the network architecture of a basic attention model is not changed during decoding.

Figure 1: Overview of the proposed architecture. The encoder processes input features and shares the output to the forward and backward attention and decoder. The newly added regularizer minimizes the distance between and .

A recent study demonstrated that it is beneficial to add a right-to-left (R2L) decoder to a conventional left-to-right (L2R) decoder [19]. The R2L

decoder is trained on time-reversed target labels and acts as a regularizer during optimization. Their work focused mainly on the advantage of using the additional information to improve the beam search in decoding. They applied a constant scalar value, which attached a more significant weight on the loss function of the standard

L2R decoder. Furthermore, they trained their models on Japanese words whereby label and time-reversed label sequences were equal. Another comparable work has been published in the domain of speech synthesis. In [20], they also utilized a second R2L decoder, combined both losses and added another regularizing function for the L2R and R2L decoder outputs. Similar to [19], they trained only on equal sequence lengths. In the English language, however, byte pair encodings for encoding the target transcripts seem superior [10, 6]. As encoding a time-reversed transcript produces unequal sequence lengths between L2R and R2L decoders, regularization of these sequences is challenging. To the best of our knowledge, an in-depth study on how to solve this problem and leveraging the newly added decoder during the optimization process has not been done for attention models. Our contributions are the following:

  • We introduce an optimization scheme inspired by [20] for attention models in ASR and utilize the added decoder during the training.

  • We propose two novel regularization terms for equal and unequal output sequence lengths and demonstrate their superiority over conventional attention models.

Figure 2: Our approach for integrating a second decoder into the training scheme. The L2R and R2L models are pretrained and share a common encoder. We force the overall model to minimize the regularization term , which integrates future sequence information of the R2L model into the L2R model to improve its prediction.

2 Proposed Method

2.1 Attention Model

The standard attentional Seq2Seq model contains three major components: the encoder, the attention module and the decoder. Let be a given input sequence of speech features and let

be the target output sequence of length

. The encoder transforms the input sequence into a latent space:


where encodes essential aspects of the input sequence, i.e., characteristics of the speech signal. The resulting hidden encoder states and the hidden decoder state are fed into the attention module to predict proper alignments between the input and output sequences:


where are the attention weights and is the output of a scoring function:


Depending on the task, there are several ways to implement scoring functions. We choose the content-based and location-aware attention from [21] for scoring. Based on the attention weights , a context vector is created to summarize all information in the hidden states of the encoder for the current prediction:


The decoder generates the output distribution using the context vector and the decoder hidden state :


where is a recurrency, usually a long short-term memory (LSTM) [22]:


with being the predicted target label of the previous prediction step. The resulting model is optimized by cross-entropy loss .

2.2 Adding a Backward Decoder

For a traditional attention model, the char distribution is generated by a single L2R decoder. This distribution is dependent on the past and thus, has no information about the future context. For this reason, we extend the model by adding a second R2L decoder, which is trained on time-reversed output labels to generate . The reverse distribution contains beneficial information for the L2R decoder since it has no access to future labels. The R2L decoder contains an individual attention network, which includes a likewise scoring mechanism as the L2R decoder. The decoders learn to create the posterior for the L2R and for the R2L case, respectively. Thus, represents the attention and decoder parameters for target labels, which are typically time encoded (e.g., cat) and are the attention and decoder parameter of the time-reversed target labels (e.g., tac).

In an ideal case, the posteriors of both decoders should satisfy the following condition:


as both networks receive the same amount of information. However, the decoders depend on a different context, i.e., the L2R on past context and the R2L on future context, which results in a similar but not equal training criterion.

2.3 Regularization for Equal Sequence Lengths

If we apply chars as target values for training the attention model, we are dealing with equal output sequence lengths since there is no difference between the forward and reverse encoding of a word. Therefore, we extend the loss similar to [20] with a regularization term to retrieve the global loss :


where defines a weighting factor for the losses, and is a regularizer term weighted by . We apply the distance between the decoder outputs and with as regularization. Thus, is defined it as:


The regularization term forces the network to minimize the distance between outputs of the L2R and R2L decoders. Therefore, the L2R network gets access to outputs that are based on future context information to utilize its knowledge and increase the overall performance. Note that this kind of regularization is only feasible as we are dealing with equal sequence lengths, which makes it simple to create .

2.4 Regularization for Unequal Sequence Lengths

We can extend the approach above by applying BPE units instead of chars. However, in contrast to chars, we face the problem of obtaining unequal sequence lengths for L2R and for R2L decoders with . Since the time-reversed chars are encoded differently, the proposed regularization in Equation 9 is not feasible. We resolve this issue utilizing a differentiable version of the dynamic time warping (DTW) algorithm[23] as a distance measurement between two temporal sequences of arbitrary lengths the so-called soft-DTW algorithm. By defining a soft version of the min operator with a softening parameter :


we can rewrite the soft-DTW loss as a regularization term similar as above:


Here, is the inner product of two matrices, is an alignment matrix of a set which are binary matrices that contain paths from to by only applying , and moves through this matrix and is defined by a distance function (e.g., Euclidean distance). Based on the inner product, we retrieve an alignment cost for all possible alignments between and . Since we force the network to also minimize , it has to learn a good match between the different sequence lengths of the L2R and R2L decoders.

3 Experiments

3.1 Training Details

All our experiments are evaluated on the public TEDLIUMv2 [24] and LibriSpeech [25]

datasets. TEDLIUMv2 has approximate 200 h of training data with a 150 k lexicon. In order to verify our approach on a larger scale, we also utilize the larger dataset LibriSpeech, which contains 960 h of training data with a given 200 k lexicon.

We preprocess both datasets by extracting 80-dimensional log Mel features using Kaldi [26] and adding the corresponding pitch features, which results in an 83-dimensional feature vector. Furthermore, we apply chars and BPE units as target labels. The chars are directly extracted from the datasets, whereas the BPE units are created by SentencePiece [27] which is a language-independent sub-word tokenizer. For all experiments, we select 100 BPE units, which seems sufficient [6] for our approach. Moreover, we do not utilize any dropout layers, augmentation techniques, or language models, as we focus our evaluation onto the additional decoder and how to deploy it in the training stage.

The proposed architecture is created in the ESPnet toolkit [28] and is depicted in Figure 2. We strictly follow the encoder structure from [29]. The encoder consists of two VGG blocks [30], namely

, where every block is composed of two 2D convolutional layers and a 2D max-pooling layer. The first block contains convolutional layers with 64 filters, whereas the second block contains convolutional layers with 128 filters, respectively. All convolutions have a filter size of

and a stride length of one. Each max-pooling layer has a kernel size of

and a stride length of two. On top of the network, we finalize the encoder with four bidirectional long short-term memory projected (BLSTMP) [31] layers. Every BLSTMP layer has 1024 cells and a projection layer of size 1024. The output of the encoder is then utilized by the L2R and R2L attention networks, which create the context vectors for the decoders.

Each decoder is a single LSTM network with 1024 cells. As our resources are limited and end-to-end training is considered challenging, we perform a three-stage training scheme inspired by [20]. In the first stage, we train a standard attentional network with a L2R decoder. Then, we apply the pretrained encoder, freeze its weights and train the R2L model. Finally, we combine both networks into one model to receive the final architecture similar to Figure 2. In all stages, we optimize the network with Adadelta [32] initialized with an . If we do not observe any improvement of the accuracy on the validation set, we decay by a factor of and increment a patience counter by one. We apply an early stopping of the training if the patient counter exceeds three. The batch-size is set to 30 for all training steps.

Depending on the target labels in the third training stage, i.e., chars or BPE units, we deploy two different techniques to regularize the L2R decoder. For chars, forward sequences and backward sequences have equal lengths. Thus, we add a regularizer identical to Equation 8 and scale it with for the smaller and for the bigger dataset. On the other hand, for BPE units, we utilize the soft-DTW from Equation 11 as a regularizer since it represents a distance measurement between the unequal sequence lengths and , which we want to minimize. Here, we set and scale the regularization with for both datasets. Besides the added regularizations for chars and BPE units, we regularize the L2R network further by applying in all the experiments. Thereby, we ensure that the overall training is focused on the L2R decoder network. Thus, the R2L decoder and the regularization techniques only support the L2R decoder to further improve its performance. Later in the decoding phase, we remove the R2L network since it is only necessary during the training stages. As a result, we are not changing or adding complexity to the final model during decoding.

TEDLIUMv2[24] LibriSpeech[25]
char BPE char BPE
Methods dev test dev test dev-clean dev-
Forward 16.77 17.32 17.83 18.00 7.69 20.67 7.72 21.63 7.59 20.98 7.67 21.92
Backward 18.12 18.47 18.57 17.99 7.60 20.78 7.54 21.83 7.53 20.94 7.60 21.71
Backward Fixed 23.34 23.77 25.55 25.01 11.39 28.36 11.75 28.53 12.07 28.63 12.39 29.06
Dual Decoder 16.47 17.12 17.70 18.08 7.29 20.99 7.60 22.00 7.46 21.29 7.70 22.01
Dual Decoder Reg 15.68 15.94 16.75 17.42 7.24 19.96 7.02 20.95 7.17 20.01 7.33 20.63
Table 1: Evaluation of our approach on TEDLIUMv2 and LibriSpeech with the resulting WERs for all five setups

3.2 Benchmark Details

We evaluate our approach on five different setups:

  1. Forward: The model is trained with a standard L2R decoder, which is the baseline for all experiments.

  2. Backward: The model is trained on time-reversed target labels, which results in a R2L decoder.

  3. Backward Fixed: Similar to the Backward experiment, however, we take the pretrained encoder from the L2R model and freeze its weights during training.

  4. Dual Decoder: The model consists of a shared encoder from Forward and the pretrained L2R and R2L decoder from Forward and Backward Fixed setups. The combined model is trained with and to investigate solely the effect of the R2L decoder as regularization.

  5. Dual Decoder Reg: The model consists of the Dual Decoder setup. We include the distance [20] for chars and the soft-DTW loss [23] for BPE units as target labels.

Instead of perform the forward and backward beam search as in [19], we only apply a forward beam search deploying the L2R decoder with a beam size of 20.

3.3 Results

In Table 1, we present the results of our approach applying chars and BPE units for the TEDLIUMv2[24] and LibriSpeech [25] datasets.

For the smaller dataset TEDLIUMv2, we observe a clear difference in WERs between the Forward and the Backward setup. Ideally, the performance of these setups should be equal, as both networks receive the same amount of information. However, we observe an absolute difference of 1% WER for all evaluation sets, except for the test BPE set. One explanation for this variation may be that the Backward setup is more complex. Since the dataset contains only around 220 h of training data, the number of reverse training samples could not be sufficient. In the bigger dataset LibriSpeech, the first two setups obtain nearly the same WER with only a minor difference. This dataset contains nearly five times the data of the smaller dataset and therefore, the network in the Backward setup receives enough reverse training examples. It seems, that the amount of data seems crucial for the R2L decoder to satisfy Equation 7.

In the Backward Fixed setup, we can verify the strong dependency of the decoder, relying on the high-level representation of features created by the encoder. Although we do not change the information of the target labels by reversing them, the fixed encoder from the Forward setup learned distinct, high-level features, which are based on past context. We observe this by a decline of the WERs in both datasets. Even though, the utilized BLSTMPs in the encoder network receive the complete feature sequence in the input space, they generate high-level features based on past label context, since they do not have access to future labels. As a result, the R2L model applying a fixed encoder from the Forward setup is worse compared to the trainable encoder in the R2L model.

In the Dual Decoder setup, we follow the idea of [19] to apply the R2L model as a regularizer of the L2R network. Interestingly, the R2L decoder is not able to effectively support the L2R decoder. We recognize only a slight improvement of the WER, which is not consistent in both datasets. Therefore, a simple weighting of the loss during training is not sufficient to enhance the L2R decoder. One reason might be that the L2R decoder receives only implicit information from the R2L decoder by weighting the losses, which is considered not valuable for the optimization of the L2R decoder.

To induce valuable information, we add our proposed regularization terms in the last Dual Decoder Reg setup. The overall network is forced to minimize the added regularization terms explicitly. The L2R decoder can directly utilize information of the R2L decoder to improve its predictions. We receive the overall best WER for the last setup. For the TEDLIUMv2 dataset, we recognize an average relative improvement of 7.2% for the char and 4.4% for the BPE units. For the LibriSpeech dataset, we are able to receive an average relative improvement of 4.9% for the char and 5.1% for the BPE units.

In our experiments, we do not observe a clear advantage of either utilizing chars or BPE units as target values, since the performance on the evaluation sets is not consistent.

4 Conclusion

Our work presents a novel way to integrate a second decoder for attention models during the training phase. The proposed regularization terms support the standard L2R model to utilize future context information from the R2L decoder, which is usually not available during optimization. We solved the issue of regularizing unequal sequence lengths, which arise applying BPE units as target values, by adding a soft version of the DTW algorithm. Our method outperforms conventional attention models independent of the dataset size. Our regularization technique is simple to integrate into a conventional training scheme, does not change the overall complexity of the standard model, and only adds optimization time.

For future work, we want to investigate if this regularization can also be applied to transformer-based models and how it influences the final performance.