In the past few years, models employing self-attention  have been achieving state-of-art results in many tasks such as machine translation, language modeling, and language understanding [1, 2]. In particular, large Transformer-based language models have brought gains in speech recognition tasks when used for second-pass re-scoring and in first-pass shallow fusion . As typically used in sequence-to-sequence transduction tasks [4, 5, 6, 7, 8], Transformer-based models attend over encoder features using decoder features, implying that the decoding has to be done in a label-synchronous way, thereby posing a challenge for streaming speech recognition applications. An additional challenge for streaming speech recognition with these models is that the number of computations for self-attention increases quadratically with input sequence size. For streaming to be computationally practical, it is highly desirable that the time it takes to process each frame remains constant relative to the length of the input.
For streaming speech recognition models, recurrent neural networks (RNNs) have been the de facto choice since they can model the temporal dependencies in the audio features effectively  while maintaining a constant computational requirement for each frame. Streamable end-to-end modeling architectures such as the Recurrent Neural Network Transducer (RNN-T) [10, 11, 12], Recurrent Neural Aligner (RNA) , and Neural Transducer  utilize an encoder-decoder based framework where both encoder and decoder are layers of RNNs that generate features from audio and labels respectively. In particular, the RNN-T and RNA models are trained to learn alignments between the acoustic encoder features and the label encoder features, and so lend themselves naturally to frame-synchronous decoding.
Previously, several optimization is done to enable running RNN-T on device . Additionally, extensive architecture and output unit search is done for RNN-T . In this paper we explore the possibility of replacing RNN-based audio and label encoders in the conventional RNN-T architecture with Transformer encoders. With a view to preserving model streamability, we show that Transformer-based models can be trained with self-attention on a fixed number of past input frames and previous labels. This results in a degradation of performance (compared to attending to all past input frames and labels), but then the model satisfies a constant computational requirement for processing each frame, making it suitable for streaming. Given the simple architecture and parallelizable nature of self-attention computations, we observe large improvements in training time and training resource utilization compared to RNN-T models that employ RNNs.
The RNN-T architecture111We use ”RNN-T architecture” or ”RNN-T model” interchangeably in this paper to refer to the neural network architecture described in Eq. (3), and Eq. (4), and ”RNN-T loss”, defined in Eq. (5), to refer to the loss used to train this architecture. (as depicted in Figure 1
) is a neural network architecture that can be trained end-to-end with the RNN-T loss to map input sequences (e.g. audio feature vectors) to target sequences (e.g. phonemes, graphemes). Given an input sequence of real-valued vectors of length, , the RNN-T model tries to predict the target sequence of labels of length .
Unlike a typical attention-based sequence-to-sequence model, which attends over the entire input for every prediction in the output sequence, the RNN-T model gives a probability distribution over the label space at every time step, and the output label space includes an additional null label to indicate the lack of output for that time step — similar to the Connectionist Temporal Classification (CTC) framework . But unlike CTC, this label distribution is also conditioned on the previous label history.
The RNN-T model defines a conditional distribution over all the possible alignments, where
is a sequence of pairs of length , and represents an alignment between output label and the encoded feature at time . The labels can optionally be blank labels (null predictions). Removing the blank labels gives the actual output label sequence , of length .
We can marginalize over all possible alignments to obtain the probability of the target label sequence given the input sequence ,
where is the set of valid alignments of length for the label sequence.
2 Transformer Transducer
2.1 RNN-T Architecture and Loss
In this paper, we adopt the monotonic RNN-T loss , which performs similarly to the original RNN-T loss  but constrains the model to output one label per frame, resulting in a strictly monotonic audio and label alignment. For simplicity, in the following paragraphs, we use the term “RNN-T loss” to refer to the monotonic RNN-T loss.
With the monotonicity constraint, the probability of an alignment can be factorized as
where is the sequence of non-blank labels in . The RNN-T architecture parameterizes
with an audio encoder, a label encoder, and a joint network. The encoders are two neural networks that encode the input sequence and the target output sequence, respectively. Previous work
has employed Long Short-term Memory models (LSTMs) as the encoders, giving the RNN-T its name. However, this framework is not restricted to RNNs. In this paper, we are particularly interested in replacing the LSTM encoders with Transformers[1, 2]. In the following, we refer to this new architecture as the Transformer Transducer (T-T). As in the original RNN-T model, the joint network combines the audio encoder output at and the label encoder output given the previous non-blank output label sequence as follows:
where each function is a different single-layer feed-forward neural network, is the audio encoder output at time , and is the label encoder output given the previous non-blank label sequence.
To compute Eq. (1) by summing all valid alignments naively is computationally intractable. Therefore, we define the forward variable as the sum of probabilities for all paths ending at time-frame and label position . We then use the forward algorithm [10, 17] to compute the last alpha variable , which corresponds to defined in Eq. (1). Efficient computation of
using the forward algorithm is enabled by the fact that the local probability estimate (Eq. (4)) at any given label position and any given time-frame is not dependent on the alignment . The training loss for the model is then the sum of the negative log probabilities defined in Eq. (1) over all the training examples,
where and are the lengths of the input sequence and the output target label sequence of the -th training example, respectively.
The Transformer  is composed of a stack of multiple identical layers. Each layer has two sub-layers, a multi-headed attention layer and a feed-forward layer. Our multi-headed attention layer first projects the input to , , and for all the heads . The attention mechanism is applied separately for different attention heads. The attention mechanism provides a flexible way to control the context that the model uses. For example, we can mask the attention score to the left of the current frame to produce output conditioned only on the previous state history. The weight-averaged
s for all heads are concatenated and passed to a dense layer. We then employ a residual connection and on the output of the dense layer to form the final output of the multi-headed attention sub-layer (i.e. , where
is the input to the multi-headed attention sub-layer). We also apply dropout on the normalized attention probabilities and to the output of the dense layer to prevent overfitting. Our feed-forward sub-layer has two dense layers, and we use ReLu as the activation for the first dense layer. Again, dropout is applied to both dense layers for regularization, and a residual connection andis applied to the output of the second dense layer (i.e. , where is the input to the feed-forward sub-layer). See Figure 2 for more details.
Note that states do not attend to states, in contrast to the architecture in . As discussed in the Introduction, doing so poses a challenge for streaming applications. Instead, we implement and in Eq. (3), which are LSTMs in conventional RNN-T architectures [10, 12, 11], using the Transformers described above. In tandem with the RNN-T architecture described in the previous section, the attention mechanism here only operates within or , contrary to the standard practice for Transformer-based systems. In addition, so as to model sequential order, we use the relative positional encoding proposed in . With relative positional encoding, the encoding only affects the attention score instead of the s being summed. This allows us to reuse previously computed states rather than recomputing all previous states and getting the last state in an overlapping inference manner when the number of frames or labels that or processed is larger than the maximum length used during training (which would again be intractable for streaming applications). More specifically, the complexity of running one-step inference to get activations at time is , which is the computation cost of attending to states and of the feed-forward process for the current step when using relative positional encoding. On the other hand, with absolute positional encoding, the encoding added to the input should be shifted by one when is larger than the maximum length used during training, which precludes re-use of the states, and makes the complexity . However, even if we can reduce the complexity from to with relative positional encoding, there is still the issue of latency growing over time. One intuitive solution is to limit the model to attend to a moving window of states, making the one-step inference complexity constant. Note that training or inference with attention to limited context is not possible for Transformer-based models that have attention from to , as such a setup is itself trying to learn the alignment. In contrast, the separation of and , and the fact that the alignment is handled by a separate forward-backward process, within the RNN-T architecture, makes it possible to train with attention over an explicitly specified, limited context.
|Input feature/embedding size||512|
|Dense layer 1||2048|
|Dense layer 2||1024|
|Number attention heads||4|
3 Experiments and Results
We evaluated the proposed model using the publicly available LibriSpeech ASR corpus 
. The LibriSpeech dataset consists of 970 hours of audio data with corresponding text transcripts (around 10M word tokens) and an additional 800M word token text only dataset. The paired audio/transcript dataset was used to train T-T models and an LSTM-based baseline. The full 810M word tokens text dataset was used for standalone language model (LM) training. We extracted 128-channel logmel energy values from a 32 ms window, stacked every 4 frames, and sub-sampled every 3 frames, to produce a 512-dimensional acoustic feature vector with a stride of 30 ms. Feature augmentation was applied during model training to prevent overfitting and to improve generalization, with only frequency masking (, ) and time masking (, ).
3.2 Transformer Transducer
Our Transformer Transducer model architecture has 15 audio and 2 label encoder layers. Every layer is identical for both audio and label encoders. The details of each computation layer are shown in Figure 2 and Table 1. We also add dropout layers on each attention layer and between each transformer layer, to prevent the model from quickly overfitting (in a few hours it can overfit on the small training data). We train this model to output grapheme units in all our experiments. We found that the Transformer Transducer models trained much faster ( day) compared to the an LSTM-based RNN-T model ( days), with a similar number of parameters.
|Audio Mask||Label Mask||WER (%)|
We first compared the performance of Transformer Transducer (T-T) models with full attention on audio to an RNN-T model using a bidirectional LSTM audio encoder. As shown in Table 2, the T-T model significantly outperforms the LSTM-based RNN-T baseline. We also observed that T-T models can achieve competitive recognition accuracy with existing wordpiece-based end-to-end models with similar model size. To compare with systems using shallow fusion [15, 21] with separately trained LMs, we also trained a Transformer-based LM with the same architecture as the label encoder used in T-T, using the full 810M word token dataset. This Transformer LM (6 layers; 57M parameters) had a perplexity of on the dev-clean set; the use of dropout, and of larger models, did not improve either perplexity or WER. Shallow fusion was then performed using that LM and both the trained T-T system and the trained bidirectional LSTM-based RNN-T baseline, with scaling factors on the LM output and on the non-blank symbol sequence length tuned on the LibriSpeech dev sets. The results are shown in Table 2 in the “With LM” column. The shallow fusion result for the T-T system is competitive with corresponding results for top-performing existing systems.
|Audio Mask||Label Mask||WER (%)|
|Audio Mask||Label Mask||WER (%)|
|Audio Mask||Label Mask||WER (%)|
Next, we ran training and decoding experiments using T-T models with limited attention windows over audio and text, with a view to building online streaming speech recognition systems with low latency. Similarly to the use of unidirectional RNN audio encoders in online models, where activations for time are computed with conditioning only on audio frames before , here we constrain the to attend to the left of the current frame by masking the attention scores to the right of the current frame. In order to make one-step inference for tractable (i.e. to have constant time complexity), we further limit the attention for to a fixed window of previous states by again masking the attention score. Due to limited computation resources, we used the same mask for different Transformer layers, but the use of different contexts (masks) for different layers is worth exploring. The results are shown in Table 3, where -1 means that the model uses the entire left or right context, and positive values represent the number of states the model uses to the left or right of the current frame. As we can see, using the entire audio history gives the lowest WER, but a wide limited state history preserves good performance. Conditioning the model on the 10 previous states is only around 0.1% absolute worse compared to using full left context attention.
Similarly, we explored the use of limited right context to allow the model to see some future audio frames, in the hope of bridging the gap between a left context only T-T model (left = -1, right = 0) and a full attention T-T model (left = -1, right = -1). Since we apply the same mask for every layer, the latency introduced by using right context is aggregated over all the layers. For example, in Figure 3, to produce from a 3-layer Transformer with one frame of right context, it actually needs to wait for to arrive, which is 90 ms latency in our case. As we can see from Table 4, the performance improves more than 16% relative over the left-attention-only baseline with a single frame of future context (450 ms of latency), and it is around 10% worse relative compared to the full attention T-T with a right context of 6 frames (2.7 sec of latency).
In addition, we evaluated how the left context used in the T-T affects performance. In Table 5, we show that constraining each layer to only use one previous label state yields the best accuracy, and this even outperforms the baseline, which uses the full label history. We hypothesize that this is because the models overfit more when provided with the full label history. We see a similar trend when limiting left label states while using a full attention T-T audio encoder. Finally, Table 6
reports the results when using a limited left context of 10 frames, which reduces the time complexity for one-step inference to a constant, with look-ahead to future frames, as a way of bridging the gap between the performance of left-only attention and full attention models.
In this paper, we presented the Transformer Transducer model, embedding Transformer based self-attention for audio and label encoding within the RNN-T architecture, resulting in an end-to-end model that can be optimized using a loss function that efficiently marginalizes over all possible alignments and that is well-suited to time-synchronous decoding. This model is competitive with the best reported standard attention based models trained using cross-entropy loss, and can easily be used for streaming speech recognition by limiting the audio and label context used in self-attention. Transformer Transducer models train significantly faster than LSTM based RNN-T models, and they allow us to trade recognition accuracy and latency in a flexible manner.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” arXiv preprint arXiv:1901.02860, 2019.
-  Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Language modeling with deep transformers,” arXiv preprint arXiv:1905.04226, 2019.
-  Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” 04 2018, pp. 5884–5888.
-  Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Müller, and Alex Waibel, “Very deep self-attention networks for end-to-end speech recognition,” CoRR, vol. abs/1904.13377, 2019.
-  Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao, “Learning deep transformer models for machine translation,” CoRR, vol. abs/1906.01787, 2019.
-  Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu, “Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin chinese,” 09 2018, pp. 791–795.
-  Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer, “Transformers with convolutional context for ASR,” CoRR, vol. abs/1904.11660, 2019.
-  Haşim Sak, Andrew Senior, and Francoise Beaufays, “Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling,” in INTERSPEECH 2014, 2014.
-  Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
Kanishka Rao, Haşim Sak, and Rohit Prabhavalkar,
“Exploring architectures, data and units for streaming end-to-end
speech recognition with rnn-transducer,”
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 193–199.
-  Yanzhang (Ryan) He, Rohit Prabhavalkar, Kanishka Rao, Wei Li, Anton Bakhtin, and Ian McGraw, “Streaming small-footprint keyword spotting using sequence-to-sequence models,” in Automatic Speech Recognition and Understanding (ASRU), 2017 IEEE Workshop on, 2017.
-  Haşim Sak, Matt Shannon, Kanishka Rao, and Françoise Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping,” in Proc. Interspeech 2017, 2017, pp. 1298–1302.
-  Navdeep Jaitly, David Sussillo, Quoc V Le, Oriol Vinyals, Ilya Sutskever, and Samy Bengio, “A neural transducer,” arXiv preprint arXiv:1511.04868, 2015.
-  Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
-  Anshuman Tripathi, Han Lu, Hasim Sak, and Hagen Soltau, “Monotonic Recurrent Neural Network Transducer and Decoding Strategies,” in Proc. ASRU, 2019.
-  L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, PTR Prentice-Hall, Inc., Englewood Cliffs, New Jersey 07632, 1993.
-  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
-  Jan Chorowski and Navdeep Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” in Proc. Interspeech 2017, 2017, pp. 523–527.