1 Introduction
There has been significant progress on automatic speech recognition (ASR) technologies over the past few years due to the adoption of deep neural networks
[1]. Conventionally, speech recognition systems involve individual components for explicit modeling on different levels of signal transformation: acoustic models for audio to acoustic units, pronunciation model for acoustic units to words and language model for words to sentences. This framework is often referred to as the “traditional” hybrid system. Conventionally, individual components in the hybrid system can be optimized separately. For example, CDDNNHMM [1]focuses on maximizing the likelihood between acoustic signals and acoustic models with framelevel alignments. For language modeling, both statistical ngram models
[2] and more recently, neuralnetworkbased models [3] aim to model purely the connection between word tokens.Hybrid systems achieved significant success [4] but also present challenges. For example, hybrid system requires more human intervention in the building process, including the design of acoustic units, the vocabulary, the pronunciation model and more. In addition, an accurate hybrid system often comes with the cost of higher computational complexity and memory consumption, thus increasing the difficulty of deploying hybrid systems in resourcelimited scenarios such as ondevice speech recognition. Given the challenges, the interests in endtoend approaches for speech recognition have surged recently [5, 6, 7, 8, 9, 10, 11, 12]. Different from hybrid systems, endtoend approaches aim to model the transformation from audio signal to word tokens directly, therefore the model becomes simpler and requires less human intervention. In addition to the simplicity of training process, endtoend systems also demonstrated promising recognition accuracy [11]
. Among many endtoend approaches, recurrent neural network transducer (RNNT)
[5, 6] provides promising potential on footprint, accuracy and efficiency. In this work, we explore options for further improvements based on RNNT.Recurrent neural networks (RNNs) such as longshort term memory (LSTM)
[13] networks are good at sequence modeling and widely adopted for speech recognition. RNNs rely on the recurrent connection from the previous state to the current state to propagate contextual information. This recurrent connection is effective but also presents challenges. For example, since depends on , RNNs are difficult to compute in parallel. In addition,is usually of fixed dimensions, which means all historical information is condensed into a fixedlength vector and makes capturing long contexts also difficult. The attention mechanism
[14, 15] was introduced recently as an alternative for sequence modeling. Compared with RNNs, the attention mechanism is nonrecurrent and can compute in parallel easily. In addition, the attention mechanism can ”attend” to longer contexts explicitly. With the attention mechanism, the Transformer model [14] achieved stateoftheart performance in many sequencetosequence tasks [15, 16].In this paper, we explore options to apply Transformer networks in the neural transducer framework. VGG networks [17] with causal convolution are adopted to incorporate contextual information into the Transformer networks and reduce the frame rate for efficient inference. In addition, we use truncated selfattention to enable streaming inference and reduce computational complexity .
2 Neural Transducer (RNNT)
In nature, speech recognition is a sequencetosequence (audiototext) task in which the lengths of input and output sequences can vary. As an endtoend approach, connectionist temporal classification (CTC) [9] was introduced before RNNT to model such sequencetosequence transformation. Given input sequence , where and is the input sequence length, output sequence , where represent output symbols and is the output sequence length, CTC introduces an additional ”blank” label
and models the posterior probability of
given by:(1) 
where correspond to any possible paths such that after removing and repeated consecutive symbols of yields .
The formulation of CTC assumes that symbols in the output sequence are conditionally independent of one another given the input sequence. The RNNT model improves upon CTC by making the output symbol distribution at each step dependent on the input sequence and previous nonblank output symbols in the history:
(2) 
where correspond to any possible paths such that after removing and repeated consecutive symbols of yields . By explicitly conditioning the current output on the history, RNNT outperforms CTC when no external language model is present [6, 7]. RNNT can be implemented in the encoderdecoder framework, as illustrated in Fig. 1. The encoder encodes the input acoustic sequence to with potential subsampling . And the decoder contains a predictor to encode the previous nonblank output symbol
for the logits
to condition on. It’s worth noting that only when the most probable symbol is nonblank the input to predictor will be updated, so that the conditioning encoding only changes when nonblank output symbols are observed. From the illustration, we see that RNNT incorporates a language model of output symbols internally in the decoder.There are many architectures that can be used as encoders and predictors. The functionality of these blocks is to take a sequence and find a higherorder representations. Recurrent neural networks (RNNs) such as LSTM [13] have been successfully used for such functionality. In this paper, we explore Transformer [14, 15] as an alternative for sequence encoding in RNNT. Since Transformer is not recurrent in nature, we refer to the architecture illustrated in Fig. 1 as simply ”neural transducer” [18] for the rest of the paper.
3 Transformer
The attention mechanism [19] is one of the core ideas of Transformer [15]. It was proposed to model correlation between contextual signals and produced stateoftheart performance in many domains including machine translation [15]
and natural language processing
[14]. Similar to RNNs, attention mechanism aims to encode the input sequence to a higherlevel representation by formulating the encoding function into the relationship between queries , keys and values and describing the similarities between them with:(3) 
where , and . This mechanism becomes ”selfattention” when . A selfattention block encodes the input to a higherlevel representation , just like RNNs but without recurrence. Compared with RNNs where depends on , selfattention has no recurrent connections between time steps in the encoding , therefore it can generate encoding efficiently in parallel. In addition, compared with RNNs where contexts are condensed into fixedlength states for the next time step to condition on, selfattention ”pays attention” to all available contexts to better model the context within the input sequence.
3.1 MultiHead SelfAttention
The attention mechanism can be further extended to multihead attention, in which 1) dimensions of input sequences are split into multiple chunks with multiple projections 2) each chunk goes through independent attention mechanisms 3) encodings from each chunks are concatenated then projected to produce the output encodings, as described with:
(4) 
where is the number of heads, is the dimension of input sequence, , is the encoding generated by head , , , and . Multihead attention integrates encodings generated from multiple subspaces to higherdimensional representations [15].
3.2 Transformer Encoder
The Transformer [14] is also a sequencetosequence model. The architecture of the Transformer encoder contains three main blocks: 1) attention block, 2) feedforward block and 3) layer norm [20] as shown in Fig. 2(a). The attention block contains the core multihead selfattention component. The feedforward block projects the input dimension to another feature space and then back to (usually ) for learning feature representation. The final layer normalization and other additional components including layer norm and dropout in the first two blocks are added to stabilize the model training and prevent overfitting. Furthermore, we use VGGNets to incorporate positional information into the Transformer as illustrated in Fig. 2(b). More details are given in section 4.1.
4 TransformerTransducer
Given the success of the Transformer, we explore the options of applying Transformer in neural transducer. For further improvement, we propose 1) using causal convolution for context modeling and frame rate reduction and 2) using truncated selfattention to reduce the computational complexity and enable streaming for Transformer.
4.1 Context Modeling with Causal Convolution
Transformer relies on multihead selfattention to model the contextual information. However, attention mechanism is nonrecurrent and nonconvolutive, therefore risks losing the order or positional information in the input sequence [21, 15], which could harm the performance especially for the case of language modeling. To incorporate the positional information into Transformer, a simple way is adding positional encoding [15] but convolutional approaches [8] demonstrated superior performance. In this paper we adopt the convolutional approach in [8] with modification.
Convolution networks model contexts by using kernels to convolve blocks of features. If we treat the input sequence (for example: acoustic features) as a twodimensional image , in common practice for a kernel the convolution would cover from to to produce the convolved output . Therefore the convolution would need ”future” information to generate the encoding for the current time step. For acoustic modeling this introduces additional look ahead and latency, but introducing future information is impractical for language modeling since the next symbol is unknown during inference.
To prevent future information from leaking into the computation at the current time step, we use causal convolution in which all contexts required are pushed to the history, as illustrated in Fig. 3(a). With causal convolution, for a kernel the convolution covers from to to produce the convolved output , therefore ensuring the convolution is purely ”causal”. Similar to [8], we also adopt the VGGNet [17] structure, as illustrated in Fig. 3
(b), where two twodimensional convolution layers are stacked sequentially followed by a twodimensional maxpooling layer. We use layers of the causal VGGNet to incorporate positional information and propagate to the succeeding Transformer encoder layers. We refer to this network as ”VGGTransformer” and illustrate the architecture used for the encoder in neural transducer in Fig.
3(b), where the first two VGGNet layers are used to incorporate positional information and reduce the frame rate for efficient inference, followed by a linear layer for dimension reduction and multiple Transformer encoder layers for generating higherlevel representations.4.2 Truncated SelfAttention
Unlimited selfattention attends to the whole input sequence and poses two issues: 1) streaming inference is disabled and 2) computational complexity is high. As illustrated in Fig. 4(a), for unlimited selfattention, the output at time step depends on the entire input sequence , meaning the inference can only begin after the final length is known. In addition, depends on the similarity pairs , giving complexity for computing . These issues are critical for selfattention to work in scenarios demanding lowlatency and lowcomputation such as ondevice speech recognition [6].
To reduce both the latency and computational cost, we replace the unlimited selfattention by truncated selfattention, as illustrated in Fig. 4(b). Similar to timedelayed neural network (TDNN) [22, 23], we limit the contexts available for selfattention so that output at time only depends on . Compared with unlimited selfattention, truncated selfattention is both streamable and computationally efficient. The lookahead is the right context and the computational complexity reduces from to . However, it also comes with potential performance degradation and is investigated further in experiments.
5 Experiments
5.1 Corpus and Setup
We use the publiclyavailable, widelyused LibriSpeech corpus [24] for experiments. LibriSpeech comes with 960 hours of read speech data for training, and 4 sets {dev, test}{clean,other} for finetuning and evaluations. The clean sets contain high quality utterances where as the other sets are more acoustically challenging. We use dev{clean,other} sets to finetune parameters for beam search and report results on test{clean,other} results. We extract 80dimensional log Melfilter bank features every 10ms as acoustic features and normalize them with global mean computed from the training set. We also apply SpecAugment [25] with policy ”LD” for data distortion. A sentence piece model [26]
with 256 symbols is trained from transcriptions of the training set and serves as the output symbols. For each model, we use a learnable embedding layer to convert symbols to 128dimensional vectors just before the predictor. The experiments are done using PyTorch
[27] and Fairseq [28] All models are trained on 32 GPUs with distributed data parallel (DDP) mechanism. We use standard beam search with beam size of 10 for decoding. The decoded sentence pieces are then concatenated into hypotheses to be compared with ground truth transcription for word error rate (WER) evaluation.5.2 Model Architectures and Details
We compare architectures with roughly the same number of parameters in total. For the encoder in neural transducer, we evaluate options including 1) BLSTM 4x640: bidirectional LSTM with 4 layers of 640 hidden units in each direction, 2) LSTM 5x1024: LSTM with 5 layers of 1024 hidden units and 3) Transformer 12x: VGGTransformer with 2 layers of VGGNets and 12 Transformer encoder layers. Each VGGNet layer contain 2 layers of twodimension convolution of 64 kernels of size 3x3. Each Transformer encoder layer takes 512dimensional inputs, with 8 heads for multihead selfattention and 2048 as the feedforward dimension. For efficient inference, all encoders generate output encodings every 60ms. For LSTM/BLSTM this is achieved with low frame rate [29] in which every three consecutive frames are stacked and subsampled to form the new frame, and apply subsampling of factor 2 to the output of the second LSTM/BLSTM layer [6]. For VGGTransformer we set the maxpooling on time dimension to 3 for the 1st VGGNet and 2 for the 2nd VGGNet, as illustrated in Fig. 2(b).
For the predictor in neural transducer, we evaluate options including 1) LSTM 2x700: LSTM with 2 layers of 700 hidden units and 2) Transformer 6x: VGGTransformer with 1 layer of VGGNet and 6 Transformer encoder layers. Both the VGGNet layer and the Transformer encoder layers share the same configuration with the the encoder case, with the exception that maxpooling is removed in the VGGNet. In addition, the right context for these the Transformer encoders is 0 for preventing future information leakage.
For the joiner in neural transducer, outputs from the encoder and the predictor are joined with:
(5) 
where and project and to a common feature space of dimension ,
is an activation function and
generates the logits . We use , and consistently for all experiments.5.3 Results on Transformer/LSTM Combinations
We experimented with combinations of Transformer and LSTM networks for neural transducer. The results are summarized in Table 1. For the encoder, we use LSTM 5x1024 as the streamable baseline, BLSTM 5x640 as the nonstreamable baseline and Transformer 12x as the novel replacement for the two. For the predictor, we use LSTM 2x700 and Transformer 6x described in section 5.2 as the two options.
encoder  predictor  # params 




(1) LSTM 5x1024  LSTM 2x700  50.5 M  12.31  23.16  
(2) BLSTM 4x640  LSTM 2x700  48.3 M  6.85  16.90  
(3) Transformer 12x  LSTM 2x700  45.7 M  6.08  13.89  
(4) LSTM 5x1024  Transformer 6x  67.1 M  15.76  26.67  
(5) BLSTM 4x640  Transformer 6x  64.9 M  7.20  16.67  
(6) Transformer 12x  Transformer 6x  62.3 M  7.11  15.62 
From Table 1, given the same configuration for the predictor we see that it is difficult for the LSTM network as encoder to perform well given the constraint on number of parameters. The bidirectional LSTM (BLSTM) network however can compensate the performance and remain compact in size at the cost of being nonstreamable. The VGGTransformer with unlimited selfattention outperforms BLSTM significantly as the encoder and is also nonstreamable. For the predictor, for all encoder configurations we see the LSTM network still gives better results than the VGGTransformer and is smaller in size. As a result we keep LSTM 2x700 as the predictor for the experiments in section 5.4. It is worth noting that the VGGTransformer loses the advantage of parallel computation as the predictor, as during beam search the hypothesis also extends a token at one search step.
5.4 Results on Truncated SelfAttention
We evaluated the impact of the contexts in truncated selfattention on recognition accuracy for the VGGTransformer. As summarized in section 5.3, we find the VGGTransformer performs well as the encoder but not as the predictor. Therefore we keep LSTM 2x700 as the predictor for the experiments in truncated selfattention. The results are summarized in Table 2, where are used for truncated selfattention in the VGGTransformer per layer and aggregate through layers.
Model Architecture 




(1) LSTM 5x1024 + LSTM 2x700  inf  0  12.31  23.16  
(2) BLSTM 4x640 + LSTM 2x700  inf  inf  6.85  16.90  
(3) Transformer 12x + LSTM 2x700  inf  inf  6.08  13.89  
(4) Transformer 12x + LSTM 2x700  inf  0  12.32  23.08  
(5) Transformer 12x + LSTM 2x700  inf  1  6.99  16.88  
(6) Transformer 12x + LSTM 2x700  inf  2  6.47  15.79  
(7) Transformer 12x + LSTM 2x700  inf  4  6.14  14.86  
(8) Transformer 12x + LSTM 2x700  inf  8  5.99  14.17  
(9) Transformer 12x + LSTM 2x700  4  4  6.84  17.38  
(10) Transformer 12x + LSTM 2x700  8  4  6.69  16.79  
(11) Transformer 12x + LSTM 2x700  16  4  6.57  15.92  
(12) Transformer 12x + LSTM 2x700  32  4  6.37  15.30 
Since the right context introduces algorithmic latency and has major impact on the recognition accuracy, to find optimal parameters for truncated selfattention, we search for the right context first while keeping the left context unlimited and then reduce the left context given the selected right context . From Table 2 we see both and have significant impact on the performance, especially when when the VGGTransformer becomes purely causal. However, as increases, the WERs gradually recover and come close to the case of unlimited selfattention when . With limited right context , the VGGTransformer becomes streamable but still is in computational complexity due to the unlimited left context . To keep reasonable performance while minimizing latency at the same time, we selected right context and evaluate different left contexts . Similar to right context , we see the WER is also sensitive to left context . With we see the VGGTransformer with truncated selfattention gives better WER than both LSTM/BLSTM baselines. With we only lose 4.7 % on testclean and 10.1 % on testother relatively compared with the case of umlimited selfattention, but the system becomes streamable and efficient with computational complexity .
6 Conclusion
In this paper, we explore options for using the Transformer networks in neural transducer for endtoend speech recognition. The Transformer network uses selfattention for sequence modeling and can compute in parallel. With causal convolution and truncated selfattention, the neural transducer with the proposed VGGTransformer as the encoder achieved 6.37 % on the testclean set and 15.30 % on the testother set of the public corpus LibriSpeech with a small footprint of 45.7 M parameters for the entire system. The proposed TransformerTransducer is accurate, streamable, compact and efficient, therefore a promising option for resourcelimited scenarios such as ondevice speech recognition.
References
 [1] George E Dahl, Dong Yu, Li Deng, and Alex Acero, “Contextdependent pretrained deep neural networks for largevocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2011.
 [2] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai, “Classbased ngram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–479, 1992.
 [3] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur, “Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010.
 [4] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “Achieving human parity in conversational speech recognition,” arXiv preprint arXiv:1610.05256, 2016.
 [5] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
 [6] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al., “Streaming endtoend speech recognition for mobile devices,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6381–6385.
 [7] Kanishka Rao, Haşim Sak, and Rohit Prabhavalkar, “Exploring architectures, data and units for streaming endtoend speech recognition with rnntransducer,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 193–199.
 [8] Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer, “Transformers with convolutional context for asr,” arXiv preprint arXiv:1904.11660, 2019.

[9]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
Schmidhuber,
“Connectionist temporal classification: labelling unsegmented
sequence data with recurrent neural networks,”
in
Proceedings of the 23rd international conference on Machine learning
. ACM, 2006, pp. 369–376.  [10] Alex Graves and Navdeep Jaitly, “Towards endtoend speech recognition with recurrent neural networks,” in International conference on machine learning, 2014, pp. 1764–1772.
 [11] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
 [12] Linhao Dong, Feng Wang, and Bo Xu, “Selfattention aligner: A latencycontrol endtoend model for asr using selfattention network and chunkhopping,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5656–5660.
 [13] Sepp Hochreiter and Jürgen Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
 [15] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attentionbased models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
 [16] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
 [17] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [18] Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur Yi Li, Hairong Liu, Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu, “Exploring neural transducers for endtoend speech recognition,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 206–213.
 [19] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
 [20] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
 [21] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin, “Convolutional sequence to sequence learning,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 1243–1252.
 [22] Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J Lang, “Phoneme recognition using timedelay neural networks,” Backpropagation: Theory, Architectures and Applications, pp. 35–61, 1995.
 [23] Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [24] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
 [25] Daniel S Park, William Chan, Yu Zhang, ChungCheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
 [26] Taku Kudo and John Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” arXiv preprint arXiv:1808.06226, 2018.
 [27] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in pytorch,” in NIPSW, 2017.
 [28] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of NAACLHLT 2019: Demonstrations, 2019.
 [29] Golan Pundak and Tara N Sainath, “Lower frame rate neural network acoustic models,” Interspeech 2016, pp. 22–26, 2016.