Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

02/07/2020 ∙ by Qian Zhang, et al. ∙ Google 0

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with a monotonic RNN-T loss well-suited to frame-synchronous, streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model achieves competitive performance compared to existing LibriSpeech benchmarks for attention-based models trained with cross-entropy loss. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past few years, models employing self-attention [1] have been achieving state-of-art results in many tasks such as machine translation, language modeling, and language understanding [1, 2]. In particular, large Transformer-based language models have brought gains in speech recognition tasks when used for second-pass re-scoring and in first-pass shallow fusion [3]. As typically used in sequence-to-sequence transduction tasks [4, 5, 6, 7, 8], Transformer-based models attend over encoder features using decoder features, implying that the decoding has to be done in a label-synchronous way, thereby posing a challenge for streaming speech recognition applications. An additional challenge for streaming speech recognition with these models is that the number of computations for self-attention increases quadratically with input sequence size. For streaming to be computationally practical, it is highly desirable that the time it takes to process each frame remains constant relative to the length of the input.

For streaming speech recognition models, recurrent neural networks (RNNs) have been the de facto choice since they can model the temporal dependencies in the audio features effectively [9] while maintaining a constant computational requirement for each frame. Streamable end-to-end modeling architectures such as the Recurrent Neural Network Transducer (RNN-T) [10, 11, 12], Recurrent Neural Aligner (RNA) [13], and Neural Transducer [14] utilize an encoder-decoder based framework where both encoder and decoder are layers of RNNs that generate features from audio and labels respectively. In particular, the RNN-T and RNA models are trained to learn alignments between the acoustic encoder features and the label encoder features, and so lend themselves naturally to frame-synchronous decoding.

Previously, several optimization is done to enable running RNN-T on device [12]. Additionally, extensive architecture and output unit search is done for RNN-T [11]. In this paper we explore the possibility of replacing RNN-based audio and label encoders in the conventional RNN-T architecture with Transformer encoders. With a view to preserving model streamability, we show that Transformer-based models can be trained with self-attention on a fixed number of past input frames and previous labels. This results in a degradation of performance (compared to attending to all past input frames and labels), but then the model satisfies a constant computational requirement for processing each frame, making it suitable for streaming. Given the simple architecture and parallelizable nature of self-attention computations, we observe large improvements in training time and training resource utilization compared to RNN-T models that employ RNNs.

Figure 1: RNN/Transformer Transducer architecture.

The RNN-T architecture111We use ”RNN-T architecture” or ”RNN-T model” interchangeably in this paper to refer to the neural network architecture described in Eq. (3), and Eq. (4), and ”RNN-T loss”, defined in Eq. (5), to refer to the loss used to train this architecture. (as depicted in Figure 1

) is a neural network architecture that can be trained end-to-end with the RNN-T loss to map input sequences (e.g. audio feature vectors) to target sequences (e.g. phonemes, graphemes). Given an input sequence of real-valued vectors of length

, , the RNN-T model tries to predict the target sequence of labels of length .

Unlike a typical attention-based sequence-to-sequence model, which attends over the entire input for every prediction in the output sequence, the RNN-T model gives a probability distribution over the label space at every time step, and the output label space includes an additional null label to indicate the lack of output for that time step — similar to the Connectionist Temporal Classification (CTC) framework [15]. But unlike CTC, this label distribution is also conditioned on the previous label history.

The RNN-T model defines a conditional distribution over all the possible alignments, where

is a sequence of pairs of length , and represents an alignment between output label and the encoded feature at time . The labels can optionally be blank labels (null predictions). Removing the blank labels gives the actual output label sequence , of length .

We can marginalize over all possible alignments to obtain the probability of the target label sequence given the input sequence ,


where is the set of valid alignments of length for the label sequence.

2 Transformer Transducer

2.1 RNN-T Architecture and Loss

In this paper, we adopt the monotonic RNN-T loss [16], which performs similarly to the original RNN-T loss [10] but constrains the model to output one label per frame, resulting in a strictly monotonic audio and label alignment. For simplicity, in the following paragraphs, we use the term “RNN-T loss” to refer to the monotonic RNN-T loss.

With the monotonicity constraint, the probability of an alignment can be factorized as


where is the sequence of non-blank labels in . The RNN-T architecture parameterizes

with an audio encoder, a label encoder, and a joint network. The encoders are two neural networks that encode the input sequence and the target output sequence, respectively. Previous work 


has employed Long Short-term Memory models (LSTMs) as the encoders, giving the RNN-T its name. However, this framework is not restricted to RNNs. In this paper, we are particularly interested in replacing the LSTM encoders with Transformers 

[1, 2]. In the following, we refer to this new architecture as the Transformer Transducer (T-T). As in the original RNN-T model, the joint network combines the audio encoder output at and the label encoder output given the previous non-blank output label sequence

using a feed-forward neural network with a softmax layer, inducing a distribution over the labels. The model defines

as follows:


where each function is a different single-layer feed-forward neural network, is the audio encoder output at time , and is the label encoder output given the previous non-blank label sequence.

To compute Eq. (1) by summing all valid alignments naively is computationally intractable. Therefore, we define the forward variable as the sum of probabilities for all paths ending at time-frame and label position . We then use the forward algorithm [10, 17] to compute the last alpha variable , which corresponds to defined in Eq. (1). Efficient computation of

using the forward algorithm is enabled by the fact that the local probability estimate (Eq. (

4)) at any given label position and any given time-frame is not dependent on the alignment [10]. The training loss for the model is then the sum of the negative log probabilities defined in Eq. (1) over all the training examples,


where and are the lengths of the input sequence and the output target label sequence of the -th training example, respectively.

2.2 Transformer

The Transformer [1] is composed of a stack of multiple identical layers. Each layer has two sub-layers, a multi-headed attention layer and a feed-forward layer. Our multi-headed attention layer first projects the input to , , and for all the heads [2]. The attention mechanism is applied separately for different attention heads. The attention mechanism provides a flexible way to control the context that the model uses. For example, we can mask the attention score to the left of the current frame to produce output conditioned only on the previous state history. The weight-averaged

s for all heads are concatenated and passed to a dense layer. We then employ a residual connection and

 [18] on the output of the dense layer to form the final output of the multi-headed attention sub-layer (i.e. , where

is the input to the multi-headed attention sub-layer). We also apply dropout on the normalized attention probabilities and to the output of the dense layer to prevent overfitting. Our feed-forward sub-layer has two dense layers, and we use ReLu as the activation for the first dense layer. Again, dropout is applied to both dense layers for regularization, and a residual connection and

is applied to the output of the second dense layer (i.e. , where is the input to the feed-forward sub-layer). See Figure 2 for more details.

Note that states do not attend to states, in contrast to the architecture in [1]. As discussed in the Introduction, doing so poses a challenge for streaming applications. Instead, we implement and in Eq. (3), which are LSTMs in conventional RNN-T architectures [10, 12, 11], using the Transformers described above. In tandem with the RNN-T architecture described in the previous section, the attention mechanism here only operates within or , contrary to the standard practice for Transformer-based systems. In addition, so as to model sequential order, we use the relative positional encoding proposed in [2]. With relative positional encoding, the encoding only affects the attention score instead of the s being summed. This allows us to reuse previously computed states rather than recomputing all previous states and getting the last state in an overlapping inference manner when the number of frames or labels that or processed is larger than the maximum length used during training (which would again be intractable for streaming applications). More specifically, the complexity of running one-step inference to get activations at time is , which is the computation cost of attending to states and of the feed-forward process for the current step when using relative positional encoding. On the other hand, with absolute positional encoding, the encoding added to the input should be shifted by one when is larger than the maximum length used during training, which precludes re-use of the states, and makes the complexity . However, even if we can reduce the complexity from to with relative positional encoding, there is still the issue of latency growing over time. One intuitive solution is to limit the model to attend to a moving window of states, making the one-step inference complexity constant. Note that training or inference with attention to limited context is not possible for Transformer-based models that have attention from to , as such a setup is itself trying to learn the alignment. In contrast, the separation of and , and the fact that the alignment is handled by a separate forward-backward process, within the RNN-T architecture, makes it possible to train with attention over an explicitly specified, limited context.

Figure 2: Transformer encoder architecture.
Input feature/embedding size 512
Dense layer 1 2048
Dense layer 2 1024
Number attention heads 4
Head dimension 128
Dropout ratio 0.3
Table 1: Transformer encoder parameter setup.
Model Param No LM (%) With LM (%)
size clean other clean other
Best LAS[19] 361M 2.8 6.8 2.5 5.8
LAS [19] 184M 3.4 9.2 2.7 7.3
BiLSTM RNN-T 130M 4.8 11.8 3.5 8.4
FullAttn T-T 118M 3.5 9.4 2.6 6.9
Table 2: Comparison of WERs for LAS, RNN-T and Transformer Transducer models on LibriSpeech test sets.

3 Experiments and Results

3.1 Data

We evaluated the proposed model using the publicly available LibriSpeech ASR corpus [20]

. The LibriSpeech dataset consists of 970 hours of audio data with corresponding text transcripts (around 10M word tokens) and an additional 800M word token text only dataset. The paired audio/transcript dataset was used to train T-T models and an LSTM-based baseline. The full 810M word tokens text dataset was used for standalone language model (LM) training. We extracted 128-channel logmel energy values from a 32 ms window, stacked every 4 frames, and sub-sampled every 3 frames, to produce a 512-dimensional acoustic feature vector with a stride of 30 ms. Feature augmentation 

[19] was applied during model training to prevent overfitting and to improve generalization, with only frequency masking (, ) and time masking (, ).

3.2 Transformer Transducer

Our Transformer Transducer model architecture has 15 audio and 2 label encoder layers. Every layer is identical for both audio and label encoders. The details of each computation layer are shown in Figure 2 and Table  1. We also add dropout layers on each attention layer and between each transformer layer, to prevent the model from quickly overfitting (in a few hours it can overfit on the small training data). We train this model to output grapheme units in all our experiments. We found that the Transformer Transducer models trained much faster ( day) compared to the an LSTM-based RNN-T model ( days), with a similar number of parameters.

Audio Mask Label Mask WER (%)
left right left Test-clean Test-other
-1 0 -1 6.1 15.7
10 0 -1 6.2 16.5
6 0 -1 6.8 17.8
2 0 -1 8.6 21.7
Table 3: Limited left context per layer for audio encoder.

3.3 Results

We first compared the performance of Transformer Transducer (T-T) models with full attention on audio to an RNN-T model using a bidirectional LSTM audio encoder. As shown in Table 2, the T-T model significantly outperforms the LSTM-based RNN-T baseline. We also observed that T-T models can achieve competitive recognition accuracy with existing wordpiece-based end-to-end models with similar model size. To compare with systems using shallow fusion [15, 21] with separately trained LMs, we also trained a Transformer-based LM with the same architecture as the label encoder used in T-T, using the full 810M word token dataset. This Transformer LM (6 layers; 57M parameters) had a perplexity of on the dev-clean set; the use of dropout, and of larger models, did not improve either perplexity or WER. Shallow fusion was then performed using that LM and both the trained T-T system and the trained bidirectional LSTM-based RNN-T baseline, with scaling factors on the LM output and on the non-blank symbol sequence length tuned on the LibriSpeech dev sets. The results are shown in Table 2 in the “With LM” column. The shallow fusion result for the T-T system is competitive with corresponding results for top-performing existing systems.

Figure 3: Transformer context masking for the position (left=2, right=1)
Audio Mask Label Mask WER (%)
left right left Test-clean Test-other
-1 -1 -1 3.5 9.4
-1 10 -1 3.9 10.5
-1 6 -1 4.1 10.9
-1 2 -1 4.8 12.6
-1 1 -1 5.1 13.4
-1 0 -1 6.1 15.7
Table 4: Limited right context per layer for audio encoder.
Audio Mask Label Mask WER (%)
left right left Test-clean Test-other
-1 0 -1 6.1 15.7
-1 0 1 5.9 14.9
-1 0 2 6 15.2
-1 0 3 6 15.2
-1 0 4 6.1 15.4
Table 5: Limited left context per layer for label encoder.
Audio Mask Label Mask WER (%)
left right left Test-clean Test-other
-1 -1 -1 3.5 9.4
10 10 2 4.2 11.8
10 2 2 4.8 13.4
-1 0 -1 6.1 15.7
Table 6: Results for limiting audio and label context for streaming.

Next, we ran training and decoding experiments using T-T models with limited attention windows over audio and text, with a view to building online streaming speech recognition systems with low latency. Similarly to the use of unidirectional RNN audio encoders in online models, where activations for time are computed with conditioning only on audio frames before , here we constrain the to attend to the left of the current frame by masking the attention scores to the right of the current frame. In order to make one-step inference for tractable (i.e. to have constant time complexity), we further limit the attention for to a fixed window of previous states by again masking the attention score. Due to limited computation resources, we used the same mask for different Transformer layers, but the use of different contexts (masks) for different layers is worth exploring. The results are shown in Table 3, where -1 means that the model uses the entire left or right context, and positive values represent the number of states the model uses to the left or right of the current frame. As we can see, using the entire audio history gives the lowest WER, but a wide limited state history preserves good performance. Conditioning the model on the 10 previous states is only around 0.1% absolute worse compared to using full left context attention.

Similarly, we explored the use of limited right context to allow the model to see some future audio frames, in the hope of bridging the gap between a left context only T-T model (left = -1, right = 0) and a full attention T-T model (left = -1, right = -1). Since we apply the same mask for every layer, the latency introduced by using right context is aggregated over all the layers. For example, in Figure 3, to produce from a 3-layer Transformer with one frame of right context, it actually needs to wait for to arrive, which is 90 ms latency in our case. As we can see from Table 4, the performance improves more than 16% relative over the left-attention-only baseline with a single frame of future context (450 ms of latency), and it is around 10% worse relative compared to the full attention T-T with a right context of 6 frames (2.7 sec of latency).

In addition, we evaluated how the left context used in the T-T affects performance. In Table 5, we show that constraining each layer to only use one previous label state yields the best accuracy, and this even outperforms the baseline, which uses the full label history. We hypothesize that this is because the models overfit more when provided with the full label history. We see a similar trend when limiting left label states while using a full attention T-T audio encoder. Finally, Table 6

reports the results when using a limited left context of 10 frames, which reduces the time complexity for one-step inference to a constant, with look-ahead to future frames, as a way of bridging the gap between the performance of left-only attention and full attention models.

4 Conclusions

In this paper, we presented the Transformer Transducer model, embedding Transformer based self-attention for audio and label encoding within the RNN-T architecture, resulting in an end-to-end model that can be optimized using a loss function that efficiently marginalizes over all possible alignments and that is well-suited to time-synchronous decoding. This model is competitive with the best reported standard attention based models trained using cross-entropy loss, and can easily be used for streaming speech recognition by limiting the audio and label context used in self-attention. Transformer Transducer models train significantly faster than LSTM based RNN-T models, and they allow us to trade recognition accuracy and latency in a flexible manner.