Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

11/03/2020 ∙ by Ching-Feng Yeh, et al. ∙ 0

Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition. One major challenge of attention-based models is the need of access to the full sequence and the quadratically growing computational cost concerning the sequence length. These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming. In this paper, we build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution. The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory. On the LibriSpeech dataset, our proposed system achieves word error rates 2.7 test-clean and 5.8 streaming approaches reported so far.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequence modeling is the core of speech recognition. For both more conventional “hybrid” systems [21, 24] and more recently popular “end-to-end” systems [2, 9], neural encoders are used to extract high-level embeddings as the representation of input sequences.

Recurrent neural networks (RNNs) are naturally effective as encoders for speech signals given the recurrent connection between embeddings at different time steps. Among variants of RNNs, long short-term memory (LSTM) [10] models are especially famous for its gated mechanism for capturing long and short term change. However, the recurrent nature of RNNs also presents challenges. For example, RNNs connects the previous state with the current state

recurrently. This way, the carried over information is condensed into fixed-sized vectors, and the direct connections between distant frames are limited, thus making capturing long contexts difficult. Besides, the computation of

can only begin once the inference of is finished; this makes the parallel computation of RNNs difficult.

In contrast to RNNs, attention-based models allow the computation to be fully parallel and explicitly connect the input tokens at any two positions directly, and have demonstrated significant success in fields such as machine translation [20] and speech recognition [6, 21, 24, 25], and often referred to as “Transformer” models.

Although attention-based models have achieved several milestones in sequence modeling, there remain challenges to be tackled with. In addition to accuracy, latency is another important metric for quality evaluation of speech recognition systems. For applications requiring low-latency, such as live captioning or messaging, the system often needs to be “streaming”, i.e., transcribing received segments as they arrive rather than waiting until the end of the utterance. Streaming is a major challenge for attention-based models, as the extraction of the embedding at any position depends on all input tokens, or the access to the full input sequence is need for inference to begin. Among various approaches proposed for enabling streaming [22, 25, 3, 17, 23], attention with augmented memory [22] demonstrated several advantages including accuracy, parameter-efficiency and scalability. In addition, while being strong on long context modeling, attention-based models lack explicit modeling on local patterns, as shown with improvements from approaches such as positional encoding [3] for NLP tasks. However, speech signals pose additional challenges, such as being continuous rather than discrete and typically having longer sequence lengths (acoustic frames) than word tokens. Approaches including weak-attention suppression [19] and convolution-augmented attention [6] take these characteristics into consideration and further improve attention-based models for speech recognition.

Each of the mentioned related works has benefits and limitations on its own. Specifically, Conformer-Transducer [6] is accurate and compact but streaming is a challenge, while attention with augmented memory [22] and weak-attention suppression [19] were evaluated with hybrid systems which are commonly larger in footprint. In this work, we adopted the mentioned works to produce an integrated system that is accurate, compact, and streaming.

Figure 2: Streaming Attention with Augmented Memory

2 Conformer-Transducer

Neural transducer (RNN-T) [5] has demonstrated strong performance in the field of speech recognition and gained significant research interests [6, 9, 25, 23, 8]. Compared with the traditional hybrid framework, neural transducer aims to model the transformation from speech signal to word tokens directly; therefore, the model becomes simpler, requires less human intervention, and is more compact in terms of system size. Among many end-to-end approaches, such as Transformer [20, 13] and Listen-Attend-and-Spell (LAS) [1], the major benefit of the neural transducer is its streaming design, which is crucial for low-latency scenarios.

Inside neural transducer, the “encoder” (similar to the “acoustic model” in the hybrid framework) can be constructed with different neural network components, from the more traditional LSTM

[9] to the recently popular attention-based modules [21, 25, 23]. Attention-based models are strong and effective in long-term temporal modeling. They are widely adopted in recent works that produced new milestones in the field [21, 24, 25] and are often referred to as “Transformer” models. While attention-based modules are strong on long-term modeling, they often suffer at localized and sequential patterns, which are particularly common given the acoustic frames in speech are highly correlated. Techniques such as positional encoding [3] have addressed the issue, but the limitation remains. On the other hand, convolutional modules are naturally effective in extracting localized information, and recent works have shown improvements from the heterogeneous combination of attention-based modules and convolutional modules, i.e., the Conformer architecture [6]. In addition, convolutional modules also demonstrated high parameter efficiency and a good tradeoff between accuracy and system size [8, 11].

Neural transducer with the encoder constructed with the Conformer architecture (or Conformer-Transducer) is accurate and effective for speech recognition. Similar to other attention-based models, the major limitation of Conformer is the need of access to the full sequence in its attention modules, which blocks it from direct application to low-latency scenarios and not fully utilizing the streaming potential of the neural transducer. In this work, we adopted Conformer-Transducer as the strong baseline system and applied (1) augmented memory (2) weak attention suppression to enable streaming while remaining competitive accuracy with non-streaming versions.

Attribute Transformer Conformer
Code (S) (M) (S) (M)
Parameters (M) 10.9 30.5 10.3 27.9
Num. Layers 16 16 16 16
Layer Dim. 160 288 144 256
Attention Heads 4 4 4 4
Conv. Kernels 32 32
Table 1: Transformer/Conformer Model Hyper-parameters.

The model architecture of the encoder of the Conformer-Transducer in this work is illustrated in Fig. 1. Fig. 1(a) refers to the acoustic encoder component, where the acoustic frames first go through two VGG layers, each with subsampling factor 2 for reducing sequence length and adding inter-frame correlation information [13]. Next, a linear layer applies projection on the embedding dimension before going to the following multiple Conformer modules. The decomposition of the Conformer module is shown in Fig. 1(b), where a multi-head attention module followed by a convolutional module forms the core component and are encapsulated by two macaron-like feed-forward modules, then followed by a post layer norm. The feed-forward modules, multi-head attention modules and convolutional modules in Fig. 1(b) are then further decomposed in Fig. 1(c), Fig. 1(d) and Fig. 1(e) respectively, following the design in [6]. The hyper-parameters for models within several target size constraints are listed in Table. 1, where for Transformer modules, we compensate the input dimensions to match the size of corresponding Conformer modules for fair comparison. We focused on compact models in this work and included only models with sizes around 10M and 30M, coded with (S) and (M), respectively.

3 Attention with Augmented Memory

The main challenge of attention-based models to be streaming is the need for access to the full input sequence for inference. To enable streaming, attention with augmented memory [22] was proposed, tackling the challenge from two aspects. First, for block-wise inference [4], the full input sequence is broken down into fix-sized segments with overlapping left context frames and right context frames. By having the lengths of the segments fixed, the computational cost in attention modules becomes constant for individual segments and linear for the entire sequence, which also enables the processing of long sequences. Second, to carry over temporal dependency across segments, an augmented memory bank is used, where segment-wise information is summarized and stored in the form of embeddings in slots in the augmented memory bank. The augmented memory bank demonstrated efficient and effective long-term modeling for speech signals [22] compared with other streaming approaches for attention-based models such as Transformer XL [3] and time-restricted self-attention [17].

Fig. 2 illustrates how attention module works with augmented memory, with formulation in equations (1), (2), (3), (4) and (5). Assuming the input sequence is broken down into

segments of fixed length and padded with overlapping left/right contexts. For the

-th segment, , , and represent the left context, the central body and the right context of the segment respectively, in the form of stacked embeddings, i.e., matrices. A summary embedding is then computed by pooling over the central body , providing the later multi-head attention additional information on segment-wise normalization. We followed the original augmented memory work [22] and used average pooling in this work.

(1)
(2)
(3)
(4)
(5)

The query, key and value for multi-head attention are then formed by these elements. , , and are concatenated and go through projection to form the query . The concatenation of , , and goes through projections and separately to form the key and the value . The formulation in equations (1), (2) and (3) makes only contains information of current segment, while and

contain carried over information in the augmented memory. This allows the attention module to model the attention from the current segment on carried over history and generate output embeddings accordingly. To carry over the information of the current segment to the next, the attention probabilities of

summary embedding on the embeddings in the value are used as weights to aggregate the embeddings in to form the new memory slot . Here is the attention probability of on the -th embedding in , and dropout is used as regularization function . Finally, the new memory slot is added to the existing memory bank to form , to be used for the next segment. Note that the output embeddings generated also contain the left context , the central body and the right context . They are propagated through attention layers and only stripped out at the output of the acoustic encoder. This guarantees the contexts stay constant with a growing number of layers, which makes the approach scalable and the latency controllable.

4 Weak-Attention Suppression (WAS)

The attention module models the distribution of attention probabilities between embeddings in parallel, enabling better long-term temporal modeling and additional mechanisms to handle ambiguity between embeddings at different positions, such as positional encodings [3]

. Different from NLP tasks, speech signals come in continuous forms and typically are significantly longer. For example, a 3-second long utterance may contain 10 word tokens but 300 acoustic frames. This increases the difficulty of attention modeling due to the high correlation between acoustic frames

[19], as the attention probability may be diluted among similar embeddings while a localized and sparse distribution can be more robust. Weak-attention suppression (WAS) [19]

was proposed to improve the attention probability distribution by forcing the model to skip the embeddings with low attention probabilities dynamically.

The suppression is a two-stage process: (1) threshold determination, (2) attention weight modification. For threshold determination, a dynamic threshold is selected based on the attention distribution for the -th position of query :

(6)

where and

are the mean and standard deviation of the attention probabilities at the

-th position of , and is a user-specified hyper-parameter to control the level of suppression. For attention weight modification, the attention weights from positions to corresponding to the indices having probability lower than are set to negative infinity:

(7)

where and are the attention probability and the attention weight from positions to . With modified attention weights, the produced attention probabilities for suppressed positions will be 0 while the entire distribution will be re-normalized.

System Model Size (M)
WER w/o LM
test-{clean,other}
WER w/ LM
test-{clean,other}
Hybrid Transformer [21] 80

2.3

4.9

Transformer [24] 124 2.1 4.2
Transducer Transformer [25] 139 2.4 5.6 2.0 4.6
Conformer (S) [6] 10.3 2.7 6.3 2.1 5.0
Conformer (M) [6] 30.7 2.3 5.0 2.0 4.3
Conformer (L) [6] 118.8 2.1 4.3 1.9 3.9
Transducer
(Our Re-implementation)
Transformer (S) 10.9 14.4 18.5 13.2 16.2
Transformer (M) 30.5 9.4 11.4 8.8 10.3
Conformer (S) 10.3 3.0 6.8 2.4 5.4
Conformer (M) 27.9 2.5 5.5 2.2 4.7
Table 2: Comparison on Systems and Architectures on LibriSpeech [15] (Non-streaming).

5 Experiments

The models were trained and evaluated using an in-house extension of the PyTorch-based

fairseq [14] toolkit. In all experiments, 80-dimensional log Mel-filter bank features with a 10ms frame-shift and a 25ms frame-width were used as input features. Speed perturbation and SpecAugment [16] were applied to enhance the robustness and overall accuracy of the systems.

For transducer systems, we built sentence piece models [12] with 1024 targets across experiments to ensure identical system sizes. Byte pair encoding (BPE) [18] was adopted as the segmentation algorithm. We focused on identifying the impact of the encoder (or the “acoustic encoder”) in neural transducer and used identical architectures for the predictor (or the “label encoder”) and the joiner (or the “joint network”). For the predictor, the tokens are first represented by 256-dimensional embeddings before going through a single LSTM layer with 320 hidden nodes, followed by a linear projection to 640-dimensional features before the joiner. For the joiner, the combined embeddings from the encoder and the predictor first go through a tanh activation and then another linear projection to the target number of sentence pieces (1024). Relative positional encoding [3] was applied in both Transformer and Conformer models in experiments. For models with augmented memory, we applied 16 frames as the left context, 32 frames as the central body, and 8 frames as the right context, which translates to a 320ms right context (excluding computational delay) in block-wise processing given the subsampling factor 4 from the VGG layers. We applied a universal for weak-attention suppression (WAS) for all models involved.

5.1 LibriSpeech: Description and Setup

The LibriSpeech dataset [15] contains about 960 hours of reading speech data for training and an additional 800M word tokens text-only corpus for building language models. The word error rates on sets test-{clean,other}

are reported. For hybrid systems, the standard 4-gram language model with a 200K vocabulary was used for all first-pass decoding, which contains roughly 144M n-gram probabilities, excluding the backoff weights. For transducer systems, the transcript of the

train set was combined with the 800M text-only set to build LSTM-based neural network language models (NNLMs), similar to [6]

. We experimented with the NNLMs with a fixed number of layers but different hidden nodes per layer (3x512, 3x1024, 3x2048, and 3x4096) to evaluate the tradeoff between WER and parameter efficiency, while the largest model (3x4096) is used for demonstrating the best case WERs for individual systems. We interpolate NNLM probabilities with the raw probabilities of the hypotheses with constant weight 0.25 during beam search in the on-the-fly manner, referred to as “shallow fusion”

[7], which is streaming by itself and applicable to both streaming and non-streaming models.

5.2 Results on System and Architecture

Table 2 presents the comparison on model framework and architecture. All attention modules in the models are non-streaming, i.e., they need access to the full sequence for inference. Note that the system size for hybrid systems here includes only the acoustic model and excludes both the n-gram LM and the NNLM, while the “transducer” system size includes all components (encoder, predictor, and joiner [23]). With attention modules as the core of the acoustic model, hybrid systems achieved impressive accuracy along with both n-gram and neural language models on LibriSpeech, with the cost of high-latency (non-streaming). From the results of [25], transducer-based models proved to be similarly strong on WERs but able to improve the parameter efficiency by removing external language models (both the n-gram and neural ones). For example, the hybrid system in [24] gives 2.1 and 4.2 on test-{clean,other} with a 124M acoustic model, a 144M n-gram LM and a 351M neural LM, while the transducer system in [25] gives 2.4 and 5.6 with a 139M model alone. This demonstrates that parameter efficiency (or the overall system size) is one of the major advantages of transducer systems.

In addition, comparing the Conformer models in [6] with the Transformer models in [25], we see that the introduction of convolutional modules is significantly effective both in terms of accuracy and parameter efficiency, from the 30.7M Conformer and the 139M Transformer results. Although we could not fully reproduce the results in [6], we observed similar trends by comparing Transformer and Conformer models on similar scales from our re-implementation, listed as “Transducer (Our Re-implementation)”. The Transformer models for comparison here follows the same architecture as the Conformer models illustrated in Fig. 1, with convolutional modules removed and dimension increased to compensate for the model sizes as listed in Table. 1. It is evident that within the same range of model size, the Conformer models significantly outperform the Transformer models, demonstrating the heterogeneous combination of attention-based and convolutional modules.

System Model Size (M) Streaming
WER
test-{clean,other}
Hybrid Transformer [24] (+NNLM) 124 No

2.1

4.2

Transformer [22] 80 2.6 5.6
Transformer [22] + Aug. Mem. Yes 2.8 6.7
Transducer Conformer (S) [6] (+NNLM) 10.3 No 2.1 5.0
Conformer (M) [6] (+NNLM) 30.7 2.0 4.3
Transducer
(Our Re-implementation)
Conformer (S) 10.3 No 3.0 6.8
   + NNLM 2.4 5.4
Conformer (S) + Aug. Mem. Yes 3.6 8.3
   + WAS 3.4 8.0
   + NNLM 3.1 6.6
Conformer (M) 27.9 No 2.5 5.5
   + NNLM 2.2 4.7
Conformer (M) + Aug. Mem. Yes 3.1 6.5
   + WAS 3.0 6.4
   + NNLM 2.7 5.8
Table 3: Comparison on Streaming and Non-streaming Models on LibriSpeech [15].

5.3 Results on Streaming Models

Table 3 presents the comparison between streaming and non-streaming models. For hybrid systems, introducing augmented memory to attention modules enables streaming and improves latency with the cost of increased WERs. Similar trade-offs were observed for transducer systems, for example, the degradation from 2.5/5.5 to 3.1/6.5 with the 27.9M Conformer model. We further applied weak-attention suppression (WAS) on top of the results with augmented memory. Although the magnitude of relative improvement observed was less significant compared with hybrid system results [22, 19], possibly from overlapping functionalities by convolutional modules in Conformer, weak-attention showed consistent improvements and reduced the gap from non-streaming models. On top of that, shallow fusion with NNLM further improves the results by the streaming hybrid system [22] from 2.8/6.7 to 2.7/5.8. In this comparison, both the hybrid and the transducer systems benefit from the 800M text-only corpus, and the language models involved are applied in streaming fashions during beam search, therefore both remain low-latency.

5.4 Results on Shallow Fusion with NNLMs

From Table 3, it is clear that an external NNLM consistently helps on top of transducer models alone. However, NNLMs also contribute significantly to the overall system size. For example, the NNLM used in Table 3 has 3 LSTM layers with 4096 hidden nodes each, translating to 345.3M parameters, which is quite disproportional to the size of transducer models (10.3M or 27.9M). For further investigation, we evaluated NNLMs with different sizes and plotted results in Fig. 3 to demonstrate the trade-off between WERs and system sizes. The number of LSTM layers in the NNLM was kept the same, while the number of hidden nodes varies. The architectures evaluated include 3x512 ( 6.8M), 3x1024 (24.4M), 3x2048 (92.6M) and 3x4096 (345.3M). WERs against the combined system sizes of model and NNLM are plotted to show the parameter efficiency across models and NNLMs. We see that all NNLMs of different sizes help, but the effectiveness does not grow linearly with the number of parameters. Since the computational cost and latency grow with the number of parameters in practice, this indicates that in pursuit of compact and low-latency systems, trade-offs between word error rates and size of NNLMs may be worth further investigations.

Figure 3: Comparison on NNLMs of Different Sizes.

From Fig. 3, we see that the improvement from NNLMs saturates more quickly for the larger 27.9M Conformer models, both streaming and non-streaming, compared with the smaller 10.3M ones. This shows that it is more efficient to distribute more parameters on the model than on the NNLM, given constant budgets on overall system sizes. In addition, from the comparison between streaming and non-streaming trends, the improvement from NNLMs appears consistent and is not entangled with whether the model is streaming or not. From the trade-offs shown in Fig. 3, a 27.9M Conformer model with a 3x1024 (24.4M) NNLM (52.3M combined) can achieve similar word error rate with less than 5% relative degradation compared with the results with a 3x4096 (345.3M) NNLM, but only 15% in size.

6 Conclusion

In this work, we investigated several cutting-edge technologies in speech recognition and analyzed each one’s strengths and challenges. The Conformer-Transducer offers high accuracy, but streaming remains to be challenging. With the augmented memory, the model propagates the contextual dependency from previous segments in attention modules, which enables streaming. The weak-attention suppression focuses on localized pattern modeling to improve attention modules. By integrating all three technologies, they contribute complementary benefits to compensate for the limitations of each. Specifically, the transducer framework offers a compact system size, convolution-augmented attention module and weak attention suppression ensure high accuracy, augmented memory unblocks streaming and reduces latency. On the test-{clean,other} sets of the widely used LibriSpeech dataset, with the budget of 320ms lookahead, we observed word error rates of 3.0/6.4 for the system of 27.9M parameters alone, and 2.7/5.8 for the same system with an external neural language model, both of which are the lowest error rates among systems within a similar range of latency and sizes to our knowledge so far. The integrated system demonstrated several major benefits, including high accuracy, high parameter-efficiency, and low-latency.

References

  • [1] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §2.
  • [2] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al. (2018) State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. Cited by: §1.
  • [3] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988. Cited by: §1, §2, §3, §4, §5.
  • [4] L. Dong, F. Wang, and B. Xu (2019) Self-attention aligner: a latency-control end-to-end model for asr using self-attention network and chunk-hopping. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5656–5660. Cited by: §3.
  • [5] A. Graves (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition, §2.
  • [6] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition, §1, §1, §1, §2, §2, §2, Table 2, §5.1, §5.2, Table 3.
  • [7] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin, F. Bougares, H. Schwenk, and Y. Bengio (2015)

    On using monolingual corpora in neural machine translation

    .
    arXiv preprint arXiv:1503.03535. Cited by: §5.1.
  • [8] W. Han, Z. Zhang, Y. Zhang, J. Yu, C. Chiu, J. Qin, A. Gulati, R. Pang, and Y. Wu (2020)

    ContextNet: improving convolutional neural networks for automatic speech recognition with global context

    .
    arXiv preprint arXiv:2005.03191. Cited by: §2, §2.
  • [9] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al. (2019) Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. Cited by: §1, §2, §2.
  • [10] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation. Cited by: §1.
  • [11] S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, and Y. Zhang (2020) Quartznet: deep automatic speech recognition with 1d time-channel separable convolutions. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §2.
  • [12] T. Kudo and J. Richardson (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: §5.
  • [13] A. Mohamed, D. Okhonko, and L. Zettlemoyer (2019) Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660. Cited by: §2, §2.
  • [14] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §5.
  • [15] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition, Table 2, §5.1, Table 3.
  • [16] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019-09) SpecAugment: a simple data augmentation method for automatic speech recognition. Interspeech 2019. External Links: Link, Document Cited by: §5.
  • [17] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur (2018) A time-restricted self-attention layer for asr. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §3.
  • [18] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Link, Document Cited by: §5.
  • [19] Y. Shi, Y. Wang, C. Wu, C. Fuegen, F. Zhang, D. Le, C. Yeh, and M. L. Seltzer (2020) Weak-attention suppression for transformer based speech recognition. arXiv preprint arXiv:2005.09137. Cited by: Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition, §1, §1, §4, §5.3.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition, §1, §2.
  • [21] Y. Wang, A. Mohamed, D. Le, and et al. (2020-05) Transformer-based acoustic modeling for hybrid speech recognition. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). External Links: ISBN 9781509066315, Link, Document Cited by: §1, §1, §2, Table 2.
  • [22] C. Wu, Y. Wang, Y. Shi, C. Yeh, and F. Zhang (2020) Streaming transformer-based acoustic models using self-attention with augmented memory. arXiv preprint arXiv:2005.08042. Cited by: Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition, §1, §1, §3, §3, §5.3, Table 3.
  • [23] C. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le, M. Jain, K. Schubert, C. Fuegen, and M. L. Seltzer (2019) Transformer-transducer: end-to-end speech recognition with self-attention. arXiv preprint arXiv:1910.12977. Cited by: §1, §2, §2, §5.2.
  • [24] F. Zhang, Y. Wang, X. Zhang, C. Liu, Y. Saraf, and G. Zweig (2020) Faster, simpler and more accurate hybrid asr systems using wordpieces. In Proc. Interspeech, Cited by: §1, §1, §2, Table 2, §5.2, Table 3.
  • [25] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar (2020) Transformer transducer: a streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833. Cited by: §1, §1, §2, §2, Table 2, §5.2, §5.2.