The recent research focus of automatic speech recognition (ASR) is end-to-end (E2E) frameworks [1–11], which can directly map incoming speech signals into characters  or word targets [13, 14]. The E2E frameworks include encoder–decoder networks [1–5], connectionist temporal classification (CTC) [6, 7], and recurrent neural network transducer (RNN-T) [8–11]. Considering the various E2E frameworks, RNN-T based approaches have shown promising results based on word error rate (WER) and decoding speed. Most of studies [4, 15, 16] on RNN-T have been conducted in the same domain, such as Librispeech ; however, the robust performance across the different domains must be involved for the production-level ASR. This indicates that it is important to address the domain-mismatch problem between training and inference.
In , the domain mismatch problem for RNN-T was intensively investigated. First, the encoder network in the RNN-T suffers from overfitting to the training domain, referred to as the in-domain. Second, the RNN-T is vulnerable to long-form utterances during inference because it is generally trained on short training segments. These two problems cause high deletion errors when decoding is conducted on out-domain long-form utterances. To overcome these problems, multiple regularization methods (e.g., variational weight noise ) and dynamic overlapping inference (DOI) which splits long-form utterances into several overlapping segments, have been proposed. Although applying the methods  improved the WER for out-domain long-form utterances, the investigation was conducted only on LSTM-based encoder networks. Furthermore, the long segment length for DOI ( s) was not investigated although the LSTM was designed for long-term context information .
In , a Conformer-based encoder network was proposed for the RNN-T. The Conformer can effectively model the local-global context information through its convolution and self-attention layers . However, the Conformer in  was restricted to only the use of local self-attention instead of full context self-attention to improve the generalization ability to cope with long-form utterances.
In this study, we propose a generalization strategy for RNN-T with a Conformer-based encoder network. The main contributions of this study are as follows. (i) The sparse self-attention layers that can exploit both local and global connections are designed for the Conformer: the generalized global connections robust to the domain mismatch problem are identified by pruning most of the redundant global connections while conserving the important global connections considered by the model. (ii) The state reset method for the prediction network is proposed to cope with long-form utterances by re-initializing the LSTM states of the prediction network when silence is detected during the decoding phase. Considering the experimental evaluations, we found that combining local and sparse global connections outperformed the local connections alone, and the state reset method showed further improvement in the out-domain test set.
Ii Proposed Methods
The RNN-T consists of an encoder, prediction, and joint-networks. It transcribes into , where is an acoustic frame and is an output token (e.g., a word-piece unit). To deal with the length difference problem between x and y, RNN-T adopts the blank token , which enables the RNN-T to decide whether to produce the output token during the T-step decoding procedure. Further details of the RNN-T can be found in . In this study, a Conformer, LSTM, and feed-forward network are used for the encoder, prediction, and joint networks, respectively. The Conformer is a state-of-the-art architecture for RNN-T’s encoder network because it has a powerful local-global context modeling ability via its self-attention and convolution layers .
Ii-a Sparse Self-Attention Layer for Encoder
The vanilla self-attention in the Conformer has full connectivity between queries and keys; most of them are redundant [23–25]. Thus, pruning some connections (i.e., injecting sparsity into the self-attention layer) can improve the generalization ability of the Conformer-based encoder networks. Improving the encoder network’s generalization ability is crucial as the RNN-T’s encoder network can easily overfit the in-domain data compared to the other RNN-T components (prediction and joint networks). This leads to excessive deletion error in the out-domain data, specifically for the long-form utterance . In , they used only local self-attention to generalize the attention connectivity to deal with long-form utterances. This is an appropriate sparsity pattern as a speech frame is highly correlated with its adjacent frames.
However, one of the strengths of the self-attention layer is that it can jointly learn both local and global patterns from the input features . Furthermore, the Conformer-based encoder network can learn both linguistic and acoustic information because the RNN-T is trained in an end-to-end manner. Therefore, restricting the Conformer to use only the local information, as in , can limit the Conformer’s potential modeling ability of the linguistic information inherent in the speech signal because some global information is important for linguistic modeling .
To validate whether the Conformer deploys global information, we describe the attention behaviors of trained Conformer as shown in Fig. 1. A minute out-domain utterance was used in this experiment. We found that unlimited attention drastically deteriorates the performance with high deletion errors, specifically for the out-domain long-form utterance. Therefore, empirically, our self-attention layers only pay attention in the range for the past and future 24 s for both the training and inference phases. Considering Fig. 1, all the four layers pay high attention to the local information, whereas the higher layers ( and ) attempt to deploy some global information together with sparse attention pattern.
Based on this investigation, we assume that using both local and some important sparse global connections is more appropriate than using only local connections or a full connection to leverage the Conformer’s potential local-global context modeling ability while maintaining the generalization ability. Our assumption is based on the investigation in Fig. 1. Therefore, we apply the proposed sparse self-attention layer only to the inference phase, whereas the training is performed with full connections because if we apply the sparse self-attention layer to the training phase, the observed attention patterns in Fig. 1 cannot be guaranteed, which has also been confirmed by .
Given an input sequence , the proposed sparse self-attention layer is constructed as follows:
is the attention mask, the set of indices of input vectors to be attended by; , , and are the weight matrices that transform a given into a query, key, and value; is the inner dimension of the queries and keys. To leverage both the local and global connections, is designed as:
where and are local and global masks, respectively. These are defined as and ;
is a hyperparameter, andis the averaged attention score which is calculated as:
In the case of multi-head attention, we consider three types of global masks with the same local mask () for the -head attention mask to investigate the generalization ability according to the degree of sparsity: , , and , where is the number of attention heads. Regarding the three global attention types, will have the highest sparsity, whereas have the lowest. The final attention was performed as follows:
where denotes the post-attention weight matrix.
Ii-B State Reset at the Silence for Prediction Network
The prediction network can also be vulnerable to the out-domain long-form utterances in the inference phase as the unseen linguistic context can be excessively accumulated in the prediction network. To alleviate this problem, it can be useful to reset the prediction network’s states during the inference phase at the silent audio inputs. Thus, we propose a state reset method at the silence (SRS) as shown in Fig. 2. Given the encoder network outputs , is used for the beam search with the previous hypotheses . CheckBlankToken outputs ‘true’ if all the last tokens from hypotheses in correspond to the blank tokens. If blank is ‘false’, reset is set to zero and i is increases by one for the next beam search. When it is ‘true’, reset is increased by one. If reset is higher than the predefined hyperparameter , the states of prediction network are reset to zeros. We define consecutive blank tokens during beam search as silence; thus, is related to the silence length. The aforementioned procedures are repeated until i is equal to N.
Ii-C Segment Methods for Long-Form Utterances
The out-domain long-form utterances not in the range of the training phase can cause some deletion errors during the inference phase . We were also suffered from the deletion problem with the Conformer RNN-T. To address this problem, we adopted two types of segmentation methods. The first is the dynamic overlapping inference (DOI), which splits a long-form utterance into several uniform length utterances. Further, they partially overlap to alleviate the discontinuity problem between the segmented utterances. The DOI guarantees the utterance length; thus, input utterances are easily generalized by the RNN-T, considering the length. The DOI can degrade the long-term context if it splits the middle of the utterance although there is an overlapped region between the segmented utterances. As our self-attention layer exploits some global information, we consider conserving the long-term context. Consequently, we also adopt an end-point detection (EPD) as another segment method. This is more likely to conserve the intact utterance, whereas the utterance length cannot be guaranteed. This indicates that the utterance length detected by the EPD can be longer than our desired length.
Iii Experiments and Results
Iii-a Experimental Setup
As a training set, we used our in-house dataset, which consists of 15k h of 10M Korean speech utterances. Most of them are related to the voice search domain for our voice assistant service. The training utterances are generally short: the and percentile lengths are 2.5 s and 5.9 s, respectively. Additional noise was mixed into the training set to make the overall SNR be between 5 db and 20 dB. From the training set, approximately 1 h and 24 h of mutually exclusive utterances were randomly extracted for the validation and in-domain test sets, respectively. The remaining utterances were used for the training. The length of the utterances in the in-domain test set ranged from 1 s to 10 s. Considering the out-domain test set, we collected approximately 24 h of videos from the broadcast (e.g., news, documentary, and variety show), and each video length ranged from 3 min to 50 min. All the datasets were anonymized and hand-transcribed.
The model implementation was based on the end-to-end speech processing toolkit . As input features, we used globally normalized 80- and 3-dimensional log Mel-filter bank coefficients and pitch features (83 dimensions in total), computed with a 25 ms window, shifted every 10 ms. The input features were first processed by a convolution subsampling layer, i.e., a two-layer convolutional neural network with 256 channels, two strides, and a kernel size of three, before forwarding them to the encoder network. The Conformer-based encoder network consists of 12 self-attention layers, and each layer has 1024 hidden units. We used one-layer LSTMs with 640 cells as the prediction network and a joint network with 640 hidden units. As output labels, 256-word pieces based on Jamo (Korean alphabet) were used. The other model specifications and training strategy can be found in .
Considering the segment methods, the DOI and voice activity detection -based EPD were used. The DOI consists of 16 s segments and 2 s overlap before and after the segment; thus, the length of each utterance from the DOI is 20 s, similar to the setup in . After applying the EPD, the length of utterances with respect to in- and out-domain test sets were in the range of 0.4–2.48 s and 0.6–196.8 s, respectively. Considering the inference phase, we used a beam search in . The hyperparameters and
were set to 40 and 15, respectively, as determined from our validation set. The character error rate (CER) was used as an evaluation metric.
Iii-B Experimental Results and Discussion
Table I compares three types of sparse global masks: (), (), and () to a local mask (LM). The baseline utilizes full connections, without masks. has the highest sparsity among the three types. Considering the in-domain test set, the baseline showed the lowest CER, while exhibiting the highest CER in the out-domain test set. Applying LM to the baseline highly improved the CER in the out-domain test set, while degrading the CER in the in-domain test set. This implies that most of the global connections in the self-attention layers were fit for the training domain, while local connections were relatively generalized to other domains. The addition of and to the LM resulted in a lower CER in the out-domain test set than the baseline, while keeping the CER similar to the baseline in the in-domain test set. However, both and showed higher CER than the LM in the out-domain test set, implying that and still have some global connections that overfit the training domain. Considering the comparison between and LM, we can claim that some global context information captured by is valid in other domains. This indicates that the global connections concurrently showing high attention scores in all attention heads are generalized because using these connections improve the CER in both the in- and out-domain test sets. We refer to as the SGM for the rest of this study.
|The numbers in bold indicate the best result. EPD was used for the segment method.|
|+ SGM + SRS||6.92||6.97||13.27||12.94|
|+ SGM(T) + SRS||7.11||7.50||14.47||14.51|
|The numbers in bold indicate the best result.|
Table II compares the proposed methods (SGM and SRS) to the baseline and LM based on the segment methods (DOI and EPD). The LM(T) and SGM(T) imply that the LM and SGM are also applied in the training phase. When the SGM was applied to LM, the CER was lower than the LM for all the segment methods and test sets. To leverage the global context information, it is desirable to use an intact utterance without splitting. For the in-domain test set, all the utterances were not affected by the DOI as all of them were shorter than 20 s, whereas most of the utterances in the out-domain test set were split by the DOI. Thus, the SGM with EPD showed lower CER to the out-domain utterances because the EPD split the out-domain utterances less than the DOI.
Applying the SRS improved the CER in all the cases in the out-domain test set; thus, LM+SGM+SRS with EPD achieved 24.6% and 6.5% relative CER reduction compared to the baseline and LM, respectively, while degrading the CER in the in-domain test set. This implies that the long-term linguistic context modeling by prediction network has an overfitting issue to the in-domain; thus, splitting the long-term context modeling to the short-term by SRS can be helpful for generalization. Overall, LM(T) and LM(T)+SGM(T) showed worse CER, implying that restricting the context information via masks during training is ineffective.
In Table II, we observe that LM+SGM+SRS with DOI achieves only a 1.7% relative CER reduction compared to the LM. We assume that the DOI length 20 (DOI-20, 16 s segment and 2 s overlap) is too short to leverage the global context information by the SGM. To validate this assumption, we perform the experiments according to various DOI lengths: 20, 28, 38, and 48 as shown in Table III. In this experiment, the 2 s overlap was maintained, and we only increased the segment length. The DOI-48 is the maximum length as our self-attention layers only consider the past and future 24 s as described in Fig. 1. In Table III, LM+SGM+SRS shows further CER reduction as the DOI length increases; LM+SGM+SRS with DOI-48 achieves 18.4% and 5.0% relative CER reduction compared to the baseline and LM, respectively. This result implies that the effectiveness of the SGM and SRS is exhibited when the utterance length is sufficiently long.
|+ SGM + SRS||13.27||13.38||12.97||12.82|
|The numbers in bold indicate the best result.|
In Fig. 3, we investigate the effect of the LM and LM+SGM based on the training epochs. All the masks were applied only during the inference phase. The baseline (no mask) showed the lowest CER across all the training procedures in the in-domain test set and the highest CER in the out-domain test set. Further, the baseline exhibited high CER variation in the out-domain test set across the training procedures, while showing a low CER variation in the in-domain test set. This indicates that the baseline suffers from the generalization problem across domains. In contrast, both the LM and LM+SGM showed stable CER decreases during the training in both the in- and out-domain test sets. Considering the earlier training epochs (), there was no obvious CER difference between the LM and LM+SGM. Regarding the latter training epochs (), however, we found that the CER difference between the LM and LM+SGM was consistently maintained in both the in- and out-domain test sets, implying that they learn the local context information first and later learn the global one, respectively.
-  W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, May 2016, pp. 4960–4964.
-  J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho and Y. Bengio, “Attention-Based Models for Speech Recognition,” in Proc. NIPS, 2015, pp. 577–585.
-  S. Watanabe, T. Hori, S. Kim, J. R. Hershey and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE J. Sel. Topics Signal Process., vol. 11, no. 8, pp. 1240–1253, 2017.
-  C.-C. Chiu et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. ICASSP, 2018.
-  C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in Proc. ICLR, 2018.
-  A. Graves, S. Fernández, F. Gomez and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks,” in Proc. ICML, 2006, pp. 369–376.
-  A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. ICML, 2014, pp. 1764–1772.
-  A. Graves, A.-r. Mohamed and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. ICASSP, 2013.
-  K. Rao, H. Sak and R. Prabhavalkar, “Exploring Architectures Data and Units for Streaming End-to-End Speech Recognition with RNN-Transducer,” in Proc. ASRU, 2017, pp. 193–199.
-  T. N. Sainath et al., “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” in Proc. ICASSP, 2020.
-  J. Kim, Y. Lee and E. Kim, “Accelerating RNN Transducer Inference via Adaptive Expansion Search,” IEEE Signal Process. Lett., vol. 27, pp. 2019–2023, Nov. 2020.
-  D. Amodei et al., “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,” in Proc. ICML, 2016, pp. 173–182.
-  H. Soltau, H. Liao and H. Sak, “Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition,” in Proc. Interspeech, 2017, pp. 3707-3711.
-  K. Audhkhasi et al., “Direct Acoustics-to-Word Models for English Conversational Speech Recognition,” in Proc. Interspeech, 2017, pp. 959-963.
-  D. S. Park et al., “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019.
-  C. Lüscher et al., “Rwth asr systems for librispeech: Hybrid vs attention,” in Proc. Interspeech, 2019.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, Apr. 2015, pp. 5206–5210.
-  C.-C. Chiu et al., “RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions,” in Proc. SLT, 2021, pp. 873–880.
-  A. Graves, “Practical variational inference for neural networks,” in Proc. NIPS, 2011, pp. 2348–2356.
S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”Neural Comp., 2020, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
-  B. Li et al., “A Better and Faster end-to-end Model for Streaming ASR,” in Proc. ICASSP, 2021, pp. 5634–5638.
-  A. Gulati et al., “ Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech, 2020.
-  R. Child, S. Gray, A. Radford and I. Sutskever, “Generating Long Sequences with Sparse Transformers,” arXiv preprint arXiv:1904.10509, 2019.
-  G. Zhao et al., “Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection,” arXiv preprint arXiv:1912.11637, 2019.
-  B. Cui et al., “Fine-tune BERT with sparse self-attention mechanism,” in Proc. EMNLP-IJCNLP, 2020, pp. 3548–3553.
-  V. Ashish et al., “Attention is all you need,” in Proc. NIPS, 2017, pp. 5998–6008.
-  S. Watanabe et al., “ESPnet: End-to-End Speech Processing Toolkit,” in Proc. Interspeech, Sep. 2018, pp. 2207-2211.
-  P. Guo et al., “Recent Developments on ESPnet Toolkit Boosted by Conformer,” arXiv preprint arXiv:2010.13956, 2020.
J. Kim and M. Hahn, “Voice Activity Detection Using an Adaptive Context Attention Model,”IEEE Signal Process. Lett., vol. 25, no. 8, pp. 1181–1185, Aug. 2018.