CTC-synchronous Training for Monotonic Attention Model

05/10/2020 ∙ by Hirofumi Inaguma, et al. ∙ Kyoto University 0

Monotonic chunkwise attention (MoChA) has been studied for the online streaming automatic speech recognition (ASR) based on a sequence-to-sequence framework. In contrast to connectionist temporal classification (CTC), backward probabilities cannot be leveraged in the alignment marginalization process during training due to left-to-right dependency in the decoder. This results in the error propagation of alignments to subsequent token generation. To address this problem, we propose CTC-synchronous training (CTC-ST), in which MoChA uses CTC alignments to learn optimal monotonic alignments. Reference CTC alignments are extracted from a CTC branch sharing the same encoder. The entire model is jointly optimized so that the expected boundaries from MoChA are synchronized with the alignments. Experimental evaluations of the TEDLIUM release-2 and Librispeech corpora show that the proposed method significantly improves recognition, especially for long utterances. We also show that CTC-ST can bring out the full potential of SpecAugment for MoChA.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Streaming automatic speech recognition (ASR) is a core technology used in simultaneous interpretation such as live captioning, simultaneous translation, and dialogue systems. Recently, the performance of end-to-end (E2E) ASR systems has been nearing that of hybrid systems with much more simplified architectures [google_sota_asr, karita2019comparative]. Therefore, building effective streaming E2E-ASR systems is an important step towards making E2E models applicable in the real world.

The streaming E2E models proposed thus far can be categorized as time-synchronous or label-synchronous models. Time-synchronous models include connectionist temporal classification (CTC) [ctc_graves]

, recurrent neural network transducer (RNN-T)

[rnn_transducer], and recurrent neural aligner (RNA) [recurrent_neural_aligner]. These models can make predictions from the input stream frame-by-frame, so they can easily satisfy the demands of streaming ASR. Label-synchronous models based on an attention-based sequence-to-sequence (S2S) framework [chorowski2015attention] outperform time-synchronous models on several benchmarks in offline scenarios [s2s_comparison_google, s2s_comparison_baidu, huang2019exploring]. However, label-synchronous models are not suitable for streaming ASR because the initial token cannot be generated until all of the speech frames in an utterance are encoded. To make label-synchronous models streamable, several variants have been studied, such as hard monotonic attention [monotonic_attention], monotonic chunkwise attention (MoChA) [mocha], triggered attention [moritz2019triggered_icassp2019], adaptive computation steps [adaptive_computation_steps], and continuous integrate-and-fire [cif]. In this work, we focus on MoChA as a streaming attention-based S2S model since it can be trained efficiently and shows promising results for various ASR tasks [kim2020attention, online_hybrid_ctc_attention, adaptive_mocha, inaguma2020streaming, online_hybrid_ctc_attention_taslp2020].

Figure 1: Visualization of decision boundaries (yellow dots) of LC-BLSTM-40+40 - MoChA (top, T5) and proposed model w/ CTC-ST (bottom, T6). Reference: ”we might be putting lids and casting shadows on their power wouldn’t we want to open doors for them instead”.

The monotonic attention mechanism [monotonic_attention]

enables online linear-time decoding during inference by introducing a discrete binary variable. To train the model using standard backpropagation, the alignment probabilities are marginalized over all possible paths during training. MoChA is an extension of the monotonic attention model and introduces additional soft attention over a small chunk. This lessens the strict monotonic constraints of input-output alignments and leads to more accurate recognition

[mocha]. However, the results of training MoChA have often been reported to be unstable [online_hybrid_ctc_attention, adaptive_mocha, inaguma2020streaming, online_hybrid_ctc_attention_taslp2020]. This is because the attention scores are not globally normalized over all encoder memories, which results in poorly scaled attention scores in the early training stage. Consequently, the gap in attention behaviors widens between training and test time. A possible solution to this problem is to regularize the model so that the total attention scores over all output timesteps sum up to the output sequence length. This was originally proposed in [cif, inaguma2020streaming] and referred to as quantity loss. Although [inaguma2020streaming] reported that quantity loss is not effective for MoChA for large-scale data (3.4k hours), we have found that it is very effective when the training data size is small (

1k hours). However, apart from the scaling issue, alignments in MoChA are susceptible to noise from previous timesteps because the monotonic attention scores are calculated only with the forward algorithm due to left-to-right dependency in the autoregressive decoder. As a result, the backward algorithm cannot be used to estimate accurate alignments, unlike in CTC. This leads to significant error propagation to subsequent token generation, especially for long utterances.

In this work, we propose CTC-synchronous training

(CTC-ST) to learn reliable monotonic alignments by using CTC alignments. Since CTC is optimized with the forward-backward algorithm, the resulting alignments are more reliable than those from MoChA. Moreover, the CTC posterior probabilities tend to peak in spikes

[ctc_graves], so they can be regarded as decision boundaries for token generation and a good reference for MoChA to learn the appropriate timing to generate output tokens. We extract the reference boundaries from the CTC model with forced alignment, and the expected decision boundaries from MoChA are optimized to be close to the corresponding CTC boundaries. Since we jointly optimize the MoChA and CTC branches by sharing the same encoder, the entire model can be trained in an end-to-end fashion. We also propose a curriculum training strategy to stabilize the training.

We experimentally evaluated the TEDLIUM release 2 and Librispeech corpora to demonstrate that the proposed CTC-ST significantly improves recognition. We also investigate reliability of CTC alignments in the proposed CTC-ST by combining it with SpecAugment [specaugment].

2 Streaming attention-based S2S

2.1 Latency-controlled BLSTM (LC-BLSTM)

In order to enable low-latency feature encoding while maintaining bidirectionality, latency-controlled bidirectional long short-term memory (LC-BLSTM) encoder has been introduced to restrict future frames to small chunks (

frames) [latency_controlled_blstm, xue2017improving, online_hybrid_ctc_attention, adaptive_mocha]. LC-BLSTM consists of forward and backward LSTMs similar to the BLSTM but processes a small chunk by sliding a window of size without overlap between adjacent chunks. The backward LSTM processes later frames as future contexts in addition to frames in the current chunk. Thus, the total latency at each chunk is frames. The backward LSTM state is reset at every chunk while the previous state of the forward LSTM is carried over to the next chunk as the initial state. We refer to this as LC-BLSTM-+ in the later experiments.

2.2 Monotonic chunkwise attention (MoChA)

To enable online linear-time decoding for S2S models during inference, the monotonic attention mechanism was proposed, in which discrete binary decision processes were introduced [monotonic_attention]. Unlike the global attention mechanism [chorowski2015attention], all attention scores are assigned to a single memory at each output timestep without global score normalization. As hard attention is not differentiable, the expected alignment probabilities are marginalized over all possible paths during training by introducing a selection probability . is a function of the monotonic energy activation , which is parameterized with the -th decoder state and the -th encoder state as follows:


where , , , , , and are learnable parameters.

Monotonic chunkwise attention (MoChA) [mocha] is an extension of the above method which introduces additional soft chunkwise attention to loosen the strict input-output alignment with hard attention. The soft attention scores over the small frames are calculated from each boundary :


where is a chunkwise attention score over each frame and is the chunk energy activation formulated similar to in Eq. (2) without weight normalization and the offset

. The context vector is calculated as a weighted sum of encoder memories by

and the subsequent token generation processes are the same as in the global S2S model. Both and can be calculated in parallel with the cumulative sum/product and the moving sum operations.

At the time of the test, each token is generated once exceeds a threshold of 0.5. The next token boundary is determined further to the right than the current boundary. For more details on the decoding algorithm, refer to [monotonic_attention, mocha].

2.3 Quantity regularization

Since MoChA does not normalize monotonic attention scores across all encoder memories , there is no guarantee that is satisfied during training. This results in poorly scaled selection probabilities and widens the gap in behaviors between training and testing because can attenuate quickly during marginalization. To address this drawback, we use a simple regularization term , i.e., quantity loss [cif, inaguma2020streaming], to encourage the summation of attention scores over all output timesteps to be close to the reference sequence length : . Scaling properly is expected to encourage the discreteness of during training, which leads to better estimation of in Eq. (3

). The objective function is designed with the interpolation of the negative log-likelihood

, CTC loss , and quantity loss as follows:


where () and (

) are tunable hyperparameters. We perform joint optimization with the CTC objective by sharing the encoder sub-network to encourage monotonicity of the input-output alignment


3 CTC-synchronous training (CTC-ST)

The monotonic attention mechanism in MoChA depends entirely on past alignments because of the left-to-right dependency in Eq. (1). Although the discreteness of the attention scores can be facilitated with quantity regularization described in Section 2.3, the alignment errors in the middle and latter steps cannot be recovered, which leads to the significant error propagation of to the latter tokens as the output sequence becomes longer. Unlike CTC, the decoder is autoregressive, so it is difficult to apply the backward algorithm during the alignment marginalization process. Consequently, the decision boundaries tend to shift to the right side (future) from the corresponding reference acoustic boundaries [adaptive_mocha, kim2020attention, inaguma2020streaming].

On the other hand, since CTC marginalizes all possible alignments with the forward-backward algorithm during training, its decision boundaries are more reliable and tend to shift to the left compared to those of MoChA. We indeed observed a delay of a few frames between the decision boundaries from the initial MoChA model and the corresponding CTC spikes (see Figure 1

). Furthermore, the well-trained CTC posterior probability distributions tend to peak in sharp spikes

[ctc_graves]. Therefore, the decision boundaries from CTC are expected to serve as an effective guide for MoChA to learn accurate alignments. In other words, MoChA can correct error propagation from past decision boundaries with the help of the CTC alignments.

In this work, we propose CTC-synchronous training (CTC-ST) to provide reliable CTC alignments as references to MoChA for learning robust monotonic alignments. The MoChA model is trained to mimic the CTC model to generate the similar decision boundaries. Both the MoChA and CTC branches are jointly optimized by sharing the same encoder, and the decision boundaries as the reference are extracted from the CTC branch. Thus, we can train the model in an end-to-end fashion unlike in [inaguma2020streaming]. Synchronizing both decision boundaries can be regarded as explicit interaction between MoChA and CTC on the decoder side.

We use the most probable CTC path of length after forced alignments with the forward-backward algorithm [moritz2019triggered_icassp2019] and use time indices of non-blank labels in as the reference boundary positions . Note that the leftmost index is used when non-blank labels are repeated without interleaving any blank labels. For the end-of-sentence token, the last input index (i.e., ) is used. is generated with model parameters at each training step on-the-fly and updates as the training continues. The objective function of the CTC-ST is defined as follows:

where is the expected decision boundary of MoChA for the -th token during training. The total objective function in Eq. (4) is modified accordingly as follows:


where () is a tunable parameter, set to 1.0 in this work. Unless otherwise noted, is set to 0 when using CTC-ST.

3.1 Curriculum learning strategy

Since CTC-ST is designed to bring decision boundaries from both MoChA and CTC closer, their alignment probabilities must be peaky to minimize in Eq. (3). In the early training stage, however, attention scores tend to be diffused over several frames and are not normalized to sum up to one. Applying CTC-ST from scratch is ineffective, so we use a curriculum learning strategy instead. First, the MoChA model with a standard BLSTM encoder is trained with random initialization together with Eq. (4) until convergence (stage-1). Next, after loading model parameters in stage-1, future contexts are restricted for the LC-BLSTM encoder and the parameters are optimized with CTC-ST by Eq. (5) (stage-2). Although we mainly use the LC-BLSTM encoder, this two-staged training can also be applied to the MoChA model with the unidirectional LSTM encoder by using the same encoder in both stages.

3.2 Combination with SpecAugment

SpecAugment has been shown to greatly enhance the decoder in the S2S model by performing on-the-fly data augmentation [specaugment]. However, since it introduces time and frequency masks into the input log-mel spectrogram, the recurrency in Eq. (1) can be easily collapsed after the masked region. In our experiments, MoChA was not shown to improve as the global attention model with SpecAugment. In contrast, CTC-ST can recover the attention scores after the masked region with CTC spikes since CTC is formulated on the conditional independence assumption per frame. Therefore, CTC-ST is beneficial for MoChA to learn the monotonic alignments that withstand noisy inputs. We will also analyze the impact of mask size.

Model WER
Offline LSTM - Global attention 11.9
BLSTM - Global attention (T1) 9.5
 + LC-BLSTM-40+20 (seed: T1) 10.1
 + LC-BLSTM-40+40 (seed: T1) 9.7
BLSTM - MoChA 12.6
 + Quantity regularization (T2) 9.8
 + CTC-ST 10.2
Streaming LSTM - MoChA (T3) 15.0
 + CTC-ST (T4, seed: T3) 13.2
LC-BLSTM-40+20 - MoChA (seed: T2) 12.2
 + CTC-ST 10.5
LC-BLSTM-40+40 - MoChA (T5, seed: T2) 11.3
 + CTC-ST (T6) 9.9
Table 1: TEDLIUM2 results. Quantity regularization is used.

4 Experiments

4.1 Experimental setup

We used the TEDLIUM release 2 (210 hours, lecture) [tedlium] and Librispeech (960 hours, reading) [librispeech] corpora for experimental evaluations. We extracted 80-channel log-mel filterbank coefficients computed with a 25-ms window size. The windows were shifted every 10 ms and zero-normalized for each training set using Kaldi [kaldi]. We performed 3-fold speed perturbation [speed_perturbation] on the TEDLIUM2 corpus with factors of 0.9, 1.0, and 1.1. We removed utterances having more than 1600 frames for the GPU memory efficiency. Since (1) TEDLIUM2 has longer utterances in the test set (up to about 40 seconds), (2) the domain is conversation, and (3) the data size is much smaller, it can be regarded as a more challenging corpus for MoChA than Librispeech.

The encoders were composed of two CNN blocks followed by five layers of (LC-)BLSTM [lstm]. Each CNN block was composed of two layers of CNN having a

filter followed by a max-pooling layer with a stride of

, which resulted in -fold frame rate reduction. We set the number of cells in each (LC-)BLSTM layer to per direction. We summed up the LSTM outputs in forward and backward directions at each layer to reduce the input dimension of the subsequent (LC-)BLSTM layer [tuske2019advancing]. The memory cells were doubled when using the unidirectional LSTM encoder. The decoder was a single layer of unidirectional LSTM with -dimensional memory cells. For offline models, we used the location-based attention [chorowski2015attention]. We set the chunk size of MoChA to . in Eq. (2) was initialized with . We used 10k vocabularies based on the Byte Pair Encoding (BPE) algorithm [sennrich2015neural].

Optimization was performed using Adam [adam] with learning rate and it was exponentially decayed. We used dropout and label smoothing [label_smoothing] with probabilities and , respectively. was set to . We set to and on the TEDLIUM2 and Librispeech corpora, respectively. We used a -layer LSTM language model (LM) with memory cells for beam search decoding with a beam width of [shallow_fusion]. Scores were normalized by the number of tokens for MoChA. CTC scores were not leveraged during inference.111

We implemented models with Pytorch

[pytorch]. Detailed hyperparameter settings during training and decoding are available at https://github.com/hirofumi0810/neural_sp.

4.2 Results

Table 1 shows the results for the TEDLIUM2. For offline models, MoChA (T2) approached the performance of the global attention model (T1). Quantity regularization was essential for attaining suitable performance for the baseline MoChA. CTC-ST also improved the performance by a large margin, but it was less effective than quantity regularization when applying from scratch. This is because the scale of attention scores was not adequate in the early training stage. For streaming models, our proposed CTC-ST with the curriculum learning strategy significantly improved the performances of the unidirectional LSTM, LC-BLSTM-40+20, and LC-BLSTM-40+40 MoChA models by , , and %, respectively. Note that quantity regularization was not combined with CTC-ST. Although a larger future context was helpful for boosting performance, the effectiveness of CTC-ST was orthogonal.

Model Quantity regularization CTC-ST WER
Streaming LC-BLSTM-40+40 12.3
 + Curriculum learning from T2 16.9
Table 2: Results of curriculum learning on the TEDLIUM2
Model WER
Offline Transformer [karita2019comparative] 30 40 8.1
BLSTM - Global attention [zeyer2019comparison] N/A N/A 8.8
BLSTM - Global attention - - 9.5
27 100 8.1
Streaming LC-BLSTM-40+40 - MoChA (seed: T2) - - 11.3
27 100 12.8
13 50 11.2
 + CTC-ST - - 9.9
27 100 9.0
13 50 9.0
Table 3: Results with SpecAugment on the TEDLIUM2
Figure 2: WER distributions on the TEDLIUM2
Model WER
clean other
Offline BLSTM - Global attention [karita2019comparative] 3.3 10.8
BLSTM - Global attention (L1) 3.1 9.5
 + SpecAugment (seed: L1) 2.8 7.6
BLSTM - MoChA 3.6 10.5
 + Quantity regularization (L2) 3.3 10.0
Streaming Transformer - CIF [cif] 3.3 9.7
Transformer - Triggered attention [moritz2020streaming_icassp2020] 2.9 7.5
PTDLSTM - Triggered attention [moritz2019streaming_asru2019] 5.9 16.8
LSTM - MoChA + MWER [kim2020attention] 5.6 15.6
LSTM - MoChA + {char, BPE}-CTC [garg2019improved] 4.4 15.2
LC-BLSTM - sMoChA [online_hybrid_ctc_attention] 6.0 16.7
LC-BLSTM - MTA [online_hybrid_ctc_attention_taslp2020] 4.2 12.3
LSTM - MoChA (L3) 5.3 14.5
 + CTC-ST (seed: L3) 4.7 13.6
LC-BLSTM-40+40 - MoChA (seed: L2) 4.1 11.2
 + SpecAugment (, =100) 5.0 9.7
 + SpecAugment (, =50) 4.0 9.5
 + CTC-ST 3.9 11.2
 ++ SpecAugment (, =100) 3.6 9.2
 ++ SpecAugment (, =50) 3.6 9.4
Table 4: Librispeech results. Quantity regularization is used. SpecAugment is used.

Next, we investigated the effectiveness of regularization terms and the curriculum learning strategy for the offline BLSTM-MoChA, shown in Table 2. We used T2 as a seed model except for the first row and optimized the model with either CTC-ST, quantity regularization, or both. Curriculum learning was highly effective and CTC-ST (T6) significantly outperformed the case using quantity regularization (T5). Combining the two did not lead to any further improvements although it was more effective than the model with quantity regularization only. This is likely because CTC-ST and quantity regularization discretize attention scores similarly in the monotonic attention mechanism.

The combination of CTC-ST and SpecAugment is shown in Table 3. We used two time masks with time mask parameter and two frequency masks with frequency mask parameter in [specaugment]. We applied SpecAugent to MoChA in stage-2 only because applying SpecAugment from scratch did not converge. SpecAugment did not improve the performance of the MoChA models without the proposed regularization methods because the attention scores in Eq. (1) can be easily collapsed as mentioned in Section 3.2. In contrast, CTC-ST led to an additional % relative improvement. Moreover, CTC-ST was robust to input mask size.

We plotted the WERs as a function of input lengths in Figure 2. The largest CTC-ST gains came from long utterances. The offline global attention model T1 did not have difficulty in recognizing long utterances, whereas the initial LC-BLSTM MoChA model T5 did. The proposed CTC-ST mitigated this problem (T6). The decision boundaries from the MoChA and CTC branches extracted from T3 and T4 are visualized in Figure 1. We found that the gap between the two boundaries was reduced, and the CTC spikes slightly shifted to the left.

Finally, the results for Librispeech are shown in Table 4. When using the unidirectional LSTM encoder for MoChA, we obtained % and % relative improvements with CTC-ST on the test-clean and test-other sets, respectively. As the training data size is much larger and input sequence lengths are shorter than TEDLIUM2, the gains from quantity regularization for the BLSTM encoder and CTC-ST for the LC-BLSTM encoder were smaller, although both were still beneficial. CTC-ST was effective for obtaining gains from SpecAugment and led to additional improvements of % and % on the test-clean and test-other sets, respectively. The model parameter size was M and fixed through all experiments. Our optimal model requires only ms (= ms () + ms () + ms (CNN)) latency and the decoding complexity is linear.

5 Related work

CTC alignments are used in triggered attention for streaming E2E-ASR [moritz2019triggered_icassp2019, moritz2019streaming_asru2019, moritz2020streaming_icassp2020], in which encoder memories are truncated with every CTC spike and the global attention mechanism is applied subsequently. Our work differs in that (1) we do not use CTC spikes in the test phase, and (2) the decoding complexity of MoChA is linear while that of the triggered attention is quadratic but streamable.

Regarding manipulating monotonic attention, [inaguma2020streaming] uses the external framewise alignments extracted from the hybrid ASR system to reduce latency for token generation in MoChA. In contrast, our proposed CTC-ST does not require any external framewise alignments and is designed to improve recognition.

6 Conclusions

We proposed CTC-synchronous training (CTC-ST) to provide a monotonic attention mechanism in MoChA with reliable alignments extracted from CTC. By jointly training both sub-networks with the shard encoder and generating CTC alignments simultaneously, we enabled effective interaction between MoChA and CTC. Experimental evaluations revealed that CTC-ST significantly improved the performance of MoChA and greatly reduced the gap from the offline models. Further gains with SpecAugment were obtained when CTC-ST was applied, thus verifying its robustness to noisy alignments.