Recent progress of end-to-end (E2E) automatic speech recognition (ASR) models bridges the gap from the state-of-the-art hybrid systems [google_sota_asr]. To make E2E models applicable to simultaneous interpretations in lecture and meeting domains, online streaming processing is necessary. For E2E models, connectionist temporal classification (CTC) [ctc_graves]
and recurrent neural network transducer (RNN-T)[rnn_transducer] have been dominant approaches and reached a level of real applications [he2019streaming, sainath2020streaming]. Meanwhile, attention-based encoder-decoder (AED) models [chorowski2015attention, las] have demonstrated the powerful modeling capability in offline tasks [s2s_comparison_google, s2s_comparison_baidu, rwth_end2end] and a number of streaming models have been investigated for RNN-based models [hou2017gaussian, tjandra2017local, lawson2018learning, adaptive_computation_steps, moritz2019triggered_icassp2019, hard_monotonic_attention, mocha].
Recently, Transformer architecture [vaswani2017attention], based on self-attention and multihead attention, has shown to outperform the RNN counterparts in various domains [karita2019comparative, zeyer2019comparison], and several streaming models have been proposed such as triggered attention [moritz2020streaming_icassp2020], continuous-integrate-and-fire (CIF) [cif], hard monotonic attention [tsunoo2019towards, miao2020transformer], and other variants [tian2019synchronous]
. Triggered attention truncates encoder outputs by using CTC spikes and performs an attention mechanism over all past frames. CIF learns acoustic boundaries explicitly and extracts context vectors from the segmented region. Therefore, these models have adaptive segmentation policies relying on acoustic cues only.
On the other hand, hard monotonic attention detects token boundaries on the decoder side by using lexical information as well. Thus, it is more flexible for modeling non-monotonic alignments and has been investigated in simultaneous machine translation (MT) [arivazhagan2019monotonic]. Recently, hard monotonic attention was extended to the Transformer architecture, named monotonic multihead attention (MMA), by replacing each encoder-decoder attention head in the decoder with a monotonic attention (MA) head [ma2019monotonic]. Unlike a single MA head used in RNN-based models, each MA head can extract source contexts with different pace and learn complex alignments between input and output sequences. Concurrently, similar methods have been investigated for Transformer-based streaming ASR [tsunoo2019towards, miao2020transformer]. Miao et al. [miao2020transformer] simplified the MMA framework by equipping a single MA head with each decoder layer to truncate encoder outputs as in triggered attention and perform attention over all past frames. Tsunoo et al. [tsunoo2019towards] also investigated the MMA framework but resorted to using all past frames to obtain a decent performance. However, looking back to the beginning of input frames lessens the advantage of linear-time decoding with hard monotonic attention as the input length gets longer.
In this work, we investigate the MMA framework using restricted input context for streaming ASR task. To perform streaming recognition with the MMA framework, it is necessary for every MA head to learn alignments properly. This is because the next token is not generated until all heads detect the corresponding token boundaries. If some heads fail to detect the boundaries until seeing the encoder output of the final frame, the next token generation is delayed accordingly. However, with a naïve implementation, we found that proper monotonic alignments are learnt in dominant MA heads only. To prevent this, we propose HeadDrop, in which a part of heads are entirely masked out at random as a regularization during training to encourage all heads to learn alignments properly. Moreover, we propose to prune redundant MA heads in lower decoder layers to further improve consensus among heads on token boundary detection. Chunkwise attention [mocha] on top of each MA head is further extended to the multihead counterpart to extract useful representations and compensate the limited context size. Finally, we propose head-synchronous beam search decoding to guarantee streamable inference.
Experimental evaluations on Librispeech corpus show that our proposed methods effectively encourage MA heads to learn alignments properly, which leads to improvement of ASR performance. Our optimal model enables stable streaming inference on other corpora as well without architecture modification.
2 Transformer ASR architecture
Our Transformer architecture consists of stacked encoder layers followed by front-end CNN blocks, and decoder layers [karita2019comparative]. A CNN block has two CNN layers with a
filter followed by a ReLU activation with a channel size. Frame rate is reduced by a factor of
after every CNN block. Each encoder layer is composed of a self-attention (SAN) sub-layer followed by a position-wise feed-forward network (FFN) sub-layer, wrapped by residual connections and layer normalization[ba2016layer]. A key component of SAN sub-layers is a multihead attention (MHA) mechanism, in which key, value, and query matrices are split into potions with a dimension
after linear transformations and each head performs a scaled-dot attention mechanism:, where , , and represent key, query, and value matrices on each head, respectively. The outputs from all heads are concatenated along the feature dimension followed by linear transformation. A FFN sub-layer is composed of two linear layers with the inner dimension , interleaved with a ReLU activation between them.
Each decoder layer is different from the encoder layer in that (1) additional encoder-decoder attention sub-layer is inserted between SAN and FFN sub-layers, and (2) causal masking is performed to prevent the decoder from attending to the future tokens. We adopt three 1D-convolutional layers for positional embeddings [mohamed2019transformers]
. The entire network is optimized by minimizing the negative log-likelihood and CTC loss with an interpolation weight[karita2019comparative].
3 Monotonic multihead attention (MMA)
3.1 Hard monotonic attention
Hard monotonic attention was originally proposed for online linear-time decoding with RNN-based AED models [hard_monotonic_attention]. At output step , the decoder scans encoder outputs from left to right and stops at an index (token boundary) to attend the corresponding single encoder output . The decoder has options to stop at the current index or move forward to the next index. The next boundary is determined by resuming scanning from the previous boundary. As hard attention is not differentiable, the expected alignments are marginalized over all possible paths during training as follows:
is a selection probability and a monotonic energy functiontakes the -th decoder state and -th encoder output as inputs. Whenever is satisfied at test time, is activated (i.e., set to 1).
3.2 Monotonic chunkwise attention (MoChA)
To relax strict input-output alignment by using the surrounding contexts, monotonic chunkwise attention (MoChA) introduces additional soft attention mechanism on top of hard monotonic attention [mocha]. Given the boundary , chunkwise attention is performed over a fixed window of width from there:
where is a chunk energy parameterized similar to the monotonic energy in Eq. (2) using separate parameters. in Eq. (3) is a continuous value during training, but is a binary value according to at test time.
3.3 Monotonic multihead attention (MMA)
To keep the expressive power of Transformer with the multihead attention mechanism while enabling online linear-time decoding, monotonic multihead attention (MMA) was proposed as an extension of hard monotonic attention [ma2019monotonic]. Each encoder-decoder attention head in the decoder is replaced with a monotonic attention (MA) head in Eq. (1) by defining the monotonic energy function in Eq. (2) as follows:
where and are parameter matrices, and is a learnable offset parameter (initialized with in this work). Unlike a case of a single MA head in Section 3.1, each MA head can attend to input speech with different pace because its decision process regarding timing to activate does not influence each other at each output step. The side effect is that all heads must be activated to generate a token. Otherwise, activated heads must wait for the rest non-activated ones to activate. This will be a problem in the streaming scenario.
Furthermore, unlike previous works [ma2019monotonic, tsunoo2019towards] having one chunkwise attention (CA) head on each MA head, we extend it to the multi-head version having heads per MA head to extract useful representations with multiple views from each boundary (chunkwise multihead attention). Assuming each decoder layer has MA heads, the total number of CA heads is at the layer. The chunk energy for each CA head is designed as in Eq. (4) without .
4 Enhancing monotonic alignments
In Transformer models, there exist many attention heads and residual connections, so it is unclear that all heads contributes to the final predictions. Michel et al. [michel2019sixteen] reported that most heads can be pruned at test time without significant performance degradation in standard MT and BERT [devlin2018bert] architectures. They also revealed that important heads are determined in early training stage. Concurrently, Voita et al. [voita2019analyzing] also reported the similar observations by automatically pruning a part of heads with penalty [louizos2017learning]. In our preliminary experiments, we also observed that not all MA heads learn alignments properly in the MMA-based ASR models and monotonic alignments are learnt only by dominant heads in upper decoder layers. Since in Eq. (1) are not normalized over inputs so as to sum up to one during training, context vectors from heads which do not learn alignments are more likely to become zero vectors at test time. This is a severe problem because (1) it leads to mismatch between training and testing conditions, and (2) the subsequent tokens cannot be generated until all heads are activated. To alleviate this problem, we propose the following methods.
We first propose a regularization method to encourage each MA head to equally contribute to the target task. During training, we stochastically zero out all elements in each MA head (i.e., ) with a fixed probability to force the other heads to learn alignments. The decision of dropping each head is independent of other heads regardless of the depth of the decoder. The output of a MMA function is normalized by dividing it by , where is the number of non-masked MA heads. We name this HeadDrop, inspired by dropout [dropout] and DropConnect [wan2013regularization].
4.2 Pruning monotonic attention heads in lower layers
Although HeadDrop is effective for improving the contribution of each MA head, we found that some MA heads in the lower decoder layers still do not learn alignments properly. Therefore, we propose to prune such redundant heads because they are harmful for streaming decoding. We remove the MMA function in the first decoder layers from the bottom () during both training and test time (see Figure 1). These layers have SAN and FFN sub-layers only and serve as a pseudo language model (LM). The total number of effective MA heads is . This method is also based on findings in [voita2019analyzing] that the lower layers of the Transformer decoder are mostly responsible for language modeling. Another advantage of pruning redundant heads is that inference speed is improved, which is effective for streaming ASR.
5 Head-synchronous beam search decoding
During beam search in the MMA framework, failure of boundary detection in some MA heads in some beam candidates at an output step easily prevents the decoder from continuing streaming inference. This is because other candidates must wait for the hypothesis pruning until all heads in all candidates are activated at each output step. To continue streaming inference, we propose head-synchronous beam search decoding (Algorithm 1). The idea is to force non-activated heads to activate after a small fixed delay. If a head at the -th layer cannot detect a boundary for frames after the leftmost boundary detected by other heads in the same layer, the boundary of such a head is set to the rightmost boundary among already detected boundaries at the current output step (line 16). Therefore, latency between the fastest (rightmost) and slowest (leftmost) boundary positions in the same layer is less than frames. We note that the decisions of boundary detection at the -th layer are dependent on outputs from the -th layer, and at least one head must be activated at each layer to generate a token. In the actual implementation, we search boundaries of all heads in a layer in parallel, thus the loop in line 9 can be ignored. Moreover, we perform batch processing over multiple hypotheses in the beam and cache previous decoder states for efficiency. Note that head synchronization is not performed during training to maintain the divergence of boundary positions. Thus, synchronization can have the ensemble effect for boundary detection.
6 Experimental evaluations
6.1 Experimental setup
We used the 960-hour Librispeech dataset [librispeech] for experimental evaluations. We extracted 80-channel log-mel filterbank coefficients computed with a 25-ms window size shifted every 10 ms using Kaldi [kaldi]. We used a 10k vocabulary based on the Byte Pair Encoding (BPE) algorithm [sennrich2015neural]. For Transformer model configurations, we used (, , , , , , , , ) (, , , , , , , , ) for the baseline MMA models. Adam [adam] optimizer was used for training with Noam learning rate schedule [vaswani2017attention]. Warmup steps and a learning rate constant were set to and , respectively. We averaged model parameters at the last epochs for evaluation. Dropout and label smoothing [label_smoothing] were used with probabilities and , respectively. We set to . We used a 4-layer LSTM LM with memory cells for decoding with a beam width of
. We used decoding hyperparameters (, ) (, ).111Code available at https://github.com/hirofumi0810/neural_sp.
6.2 Evaluation measure of boundary detection
To assess consensus among heads for token boundary detection, we propose metrics to evaluate (1) how well each MA head learns alignments (boundary coverage) and (2) how often the model satisfies the streamable condition (streamability). This is because even if better word error rate (WER) is obtained, the model cannot continue streaming inference if some heads do not learn alignments well, whose evaluation is missing in [miao2020transformer, tsunoo2019towards].
6.2.1 Boundary coverage
During beam search, we count the total number of boundaries ( such that ) up to the -th output step averaged over all MA heads, , for every candidate in the -th utterance:
The boundary coverage is defined as the ratio of to the corresponding hypothesis length of the best candidate and averaged over utterances in the evaluation set as follows:
The streamability is defined as the ratio of utterances satisfying a condition where over all candidates up to the -th output step (i.e, until generation of the best hypothesis is completed) as follows:
where is the delta function and is a hypothesis set at -th output step of the -th utterance. indicates that the model failed streaming recognition somewhere in the -th utterance, i.e., waited for seeing the encoder output of the last frame. However, we note that it does not mean the model leverages additional context.
6.3 Offline ASR results
HeadDrop and pruning MA heads in lower layers improve WER and streamability: Table 1 shows the results for offline MMA models on the Librispeech dev-clean/other sets. Boundary coverage and stremability were averaged over two sets. A naïve implementation A1 showed a very poor performance. By pruning MA heads in lower layers with increasing , WER was significantly reduced, but the boundary coverage was not so high (A5). The proposed HeadDrop also significantly improved WER, and with the increase of , the boundary coverage was drastically improved to almost 100%. We can conclude that MA heads in lower layers are not necessarily important. This is probably because (1) the modalities between input and output sequences are different in the ASR task and (2) hidden states in lower decoder layers tend to represent the previous tokens and play a role of LM, thus are not suitable for alignment.
|ID||HD||dev-clean / dev-other|
|A1||0||4||24||-||8.6 / 16.5||67.40||0.0|
|A2||1||20||7.3 / 16.3||79.02||0.0|
|A3||2||16||6.0 / 14.3||84.81||0.0|
|A4||3||12||5.2 / 12.6||83.84||0.0|
|A5||4||8||3.9 / 11.1||93.85||0.9|
|B1||0||4||24||✓||3.8 / 11.1||60.32||0.0|
|B2||1||20||4.3 / 11.5||74.16||0.0|
|B3||2||16||4.0 / 11.1||98.83||3.7|
|B4||3||12||4.6 / 11.3||99.38||6.5|
|B5||4||8||4.1 / 11.3||99.50||15.8|
|C1||0||1||6||✓||5.0 / 11.7||99.37||15.7|
|C2||2||12||3.8 / 10.9||70.06||0.0|
|ID||dev-clean / dev-other|
|B3||2||4||1||4.1 / 11.0||99.76||21.9|
|B4||3||4.2 / 10.9||99.81||25.4|
|B5||4||4.1 / 11.0||99.78||40.6|
|D1||2||16||1||3.3 / 9.9||99.82||37.5|
|D2||3||3.7 / 10.6||99.84||36.7|
|D3||4||3.5 / 10.3||99.96||60.4|
|E1||2||16||2||3.3 / 10.0||99.82||40.6|
|E2||3||3.6 / 10.3||99.93||51.3|
|E3||4||3.6 / 10.6||99.93||50.3|
|E4||2||4||3.3 / 9.8||99.93||77.9|
|E5||3||3.4 / 9.9||99.94||84.6|
|E6||4||3.7 / 10.2||99.94||63.3|
|Librispeech||TED LIUM2||AISH ELL-1|
|+ data augmentation||2.8||7.0||-||-|
|++ large model||2.5||6.1||-||-|
|Streaming||Triggered attention [moritz2020streaming]||2.8||7.2||-||-|
|MMA () [tsunoo2019towards]||-||-||-||9.7|
|MMA (narrow chunk)||3.5||11.1||11.0||6.9|
|MMA (wide chunk)||3.3||10.7||10.2||6.1|
|+ data augmentation||3.2||8.5||-||-|
|++ large model||2.7||7.1||-||-|
Multiple MA heads in each layer are necessary: Moreover, we examined the effect of the number of MA heads in each layer (C1, C2). C1 with only one MA head per layer showed a very high boundary coverage, but degraded WER very much. The result shows that the single head is focused on learning the alignment, but multiple MA heads are needed to improve the accuracy. C2 with two heads shows an intermediate tendency. This confirms that having multiple MA heads in each layer is effective. In addition, the cause of the alignment issue is the place of MA heads in the decoder, not the total number of heads.
Head-synchronous beam search decoding improves streamability: Next, the results with head-synchronous beam search decoding are shown in Table 2. is set to 8 in all models. Head-synchronous decoding effectively improved both boundary coverage and streamability. We found that if a head cannot detect the boundary around the corresponding actual acoustic boundary, it tends to stop around the next acoustic boundary twice to compensate the failure when using a standard beam search. Head-synchronous decoding alleviated this mismatch of boundary positions and led to small WER improvement except for B3 on the dev-clean set.
Chunkwise multihead attention is effective: Furthermore, we increased the window size and number of heads in chunkwise attention, both of which further improved WER. With , E3 and E6 did not obtain benefits from larger . Increasing to longer than 16 was not effective.
Here, what does the rest % for streamability in E5 account for? We found that the last few tokens corresponding to the tail part of input frames were predicted after head pointers on upper layers reached the last encoder output. For these % utterances, E5 was able to continue streaming decoding until % of the input frames on average. Since the tail part is mostly silence, this does not affect streaming recognition. In our manual analysis, we observed that MA heads in the same layer move forward with the similar pace, and the pace gets faster in upper layers.222Examples available at https://hirofumi0810.github.io/demo/enhancing_mma_asr. This is because decoder hidden states are dependent on the output from the lower layer. Considering streamability performance, we will use the E5 setting for streaming experiments in the next section.
6.4 Streaming ASR results
Finally, we present the results of streaming MMA models for the Librispeech test sets in Table 3. We also included results on TEDLIUM2 [tedlium2] and AISHELL-1 [aishell] to check whether the optimal configuration tuned on Librispeech can work in other corpora as well. We adopted the chunk hopping mechanism [dong2019self] for the online encoder. Following [cif, miao2020transformer], we set the left/current (hop)/right chunk sizes to (narrow chunk) and (wide chunk) [ms]. We used speed perturbation [speed_perturbation] and SpecAugment [specaugment] for data augmentation, but speed perturbation was applied by default for TEDLIUM2 and AISHELL-1. For large models, we used (, , ) (, , ) and other hyperparameters were kept. Head-synchronous decoding was used for all MMA models. We used CTC scores during inference only for standard Transformer models. Our streaming MMA models achieved comparable results to the previous works and standard Transformer models without looking back to the first input frame. Especially, our model outperformed the MMA model with [tsunoo2019towards] by a large margin. Increasing the model size was also effective. The streamabilities of the streaming MMA models on TEDLIUM2 and AISHELL-1 were 80.0%, and 81.5%, respectively. This confirms that the E5 setting generalizes to other corpora without architecture modification.
We tackled the alignment issue in monotonic multihead attention (MMA) for online streaming ASR with HeadDrop regularization and head pruning in lower decoder layers. We also stabilized streamable inference by head-synchronous decoding. Our future work includes investigation of adaptive policies for head pruning and regularization methods to make the most of the MA heads instead of discarding them. Minimum latency training as done in MoChA [inaguma2020streaming] is another interesting direction.