Recently, the speech research community is seeing a significant trend of moving from deep neural network based hybrid modeling [DNN4ASR-hinton2012] to end-to-end (E2E) modeling [miao2015eesen, chan2016listen, prabhavalkar2017comparison, battenberg2017exploring, chiu2018state, rao2017exploring, he2019streaming] for automatic speech recognition (ASR). Where hybrid models required disjoint optimization of separate constituent models such as acoustic and language model, E2E ASR systems directly translate an input speech sequence into an output token (characters, sub-words, or even words) sequence using a single network.
Some widely used contemporary E2E approaches for sequence-to-sequence transduction are: (a) Connectionist Temporal Classification (CTC) [graves2006connectionist, Graves-E2EASR], (b) recurrent neural network Transducer (RNN-T)[Graves-RNNSeqTransduction], and (c) Attention-based Encoder-Decoder (AED) [Attention-bahdanau2014, Attention-speech-chorowski2015, chan2016listen]. Among these three approaches, CTC was the earliest and can map the input speech signal to target labels without requiring any external alignments. However, it also suffers from the conditional frame-independence assumption. RNN-T extends CTC modeling by changing the objective function and the model architecture to remove the frame-independence assumption. Because of its streaming nature, RNN-T has received a lot of attention for industrial applications and has also managed to replace traditional hybrid models for some cases [he2019streaming, Sainath19, Li2019RNNT, jain2019rnn].
AED is a general family of models that was initially proposed for machine translation [bahdanau2014neural] but has shown success in other domains (including ASR [Attention-bahdanau2014, Attention-speech-chorowski2015, chan2016listen]) as well. These models are not streaming in nature by default but there are several studies towards that direction, such as monotonic chunkwise attention [chiu2017monotonic] and triggered attention [moritz2019triggered]. The early AED models used RNNs as a building block for its the encoder and decoder modules. We refer to them as RNN-AED in this study. More recently, the transformer architecture with self attention [vaswani2017attention] has also become prevalent and is being used as a fundamental building block for encoder and decoder modules [dong2018speech, zhou2018syllable, karita2019comparative]. We refer to such a model as Transformer-AED in this paper.
Given the fast evolving landscape of E2E technology, it is timely to compare the most popular and promising E2E technologies for ASR in the field, shaping the future research direction. This paper focuses on the comparison of current most promising E2E technologies, namely RNN-T, RNN-AED, and Transformer-AED, in both non-streaming and streaming modes. All models are trained with 65 thousand hours of Microsoft anonymized training data. As E2E models are data hungry, it is better to compare its power with such a large amount of training data. To our best knowledge, there is no such a detailed comparison. In a recent work [Sainath19], the streaming RNN-T model was compared with the non-streaming RNN-AED. In [chiu2019comparison], streaming RNN-AED is compared with streaming RNN-T for long-form speech recognition. In [karita2019comparative], RNN-AED and Transformer-AED are compared in a non-streaming mode, with training data up to 960 hours. As the industrial applications usually requires the ASR service in a streaming mode, we further put more efforts on how to develop these E2E models in a streaming mode. While it has been shown in [sainath2020streaming] that combining RNN-T and RNN-AED in a two-pass decoding configuration can surpass an industry-grade state-of-the-art hybrid model, this study shows that a single streaming E2E model, either RNN-T or Transformer-AED, can also surpass a state-of-the-art hybrid model [li2020high, li2019improving].
In addition to performing a detailed comparison of these promising E2E models for the first time, other contributions of this paper are 1) We propose a multi-layer context modeling scheme to explore future context with significant gains; 2) The cross entropy (CE) initialization is shown to be much more effective than CTC initialization to boost RNN-T models; 3) For streaming Transformer-AED, we show chunk-based future context integration is more effective than the lookahead method. 4) We release our Transformer related code with reproducible results on Librispeech at [Wang2020Transformer] to facilitate future research in E2E ASR.
2 Popular End-to-End Models
In this section, we give a brief introduction of current popular E2E models: RNN-T, RNN-AED, and Transformer-AED. These models have an acoustic encoder that generates high level representation for speech and a decoder, which autoregressively generates output tokens in the linguistic domain. While the acoustic encoders can be same, the decoders of RNN-T and AED are different. In RNN-T, the generation of next label is only conditioned on the label outputs at previous steps while the decoder of AED conditions the next output on acoustics as well. More importantly, RNN-T works in a frame-synchronized way while AED works in a label-synchronized fashion.
2.1 RNN transducer
The encoder network converts the acoustic feature into a high-level representation . The decoder in RNN-T, called prediction network, produces a high-level representation by consuming previous non-blank target . Here denotes output label index. The joint network is a feed-forward network that combines the encoder network output and the prediction network output to generate the joint matrix . Here denotes time index. This joint matrix is used to calculate softmax output.
The encoder and prediction networks are usually realized using RNN with LSTM [Hochreiter1997long] units. When the encoder is a unidirectional LSTM-RNN as Eq. (1), RNN-T works in streaming mode by default.
However, when the underlying LSTM-RNN encoder is a bi-directional model as Eq. (2), it is a non-streaming E2E model.
When implemented with LSTM-RNN, the prediction network formulation is
2.2 Attention-based Encoder-Decoder
While RNN-T has received more attention from the industry due to its streaming nature, the Attention-based Encoder-Decoder (AED) models attracts more research from academia because of its powerful attention structure. RNN-AED and Transformer-AED differ at the realization of encoder and decoder by using LSTM-RNN and Transformer, respectively.
is the context vector obtained by weighted combination of the encoder output.is supposed to contain the acoustic information necessary to emit the next token. It is calculated using the help of the attention mechanism [Attention-bahdanau2014, bahdanau2016end].
Even though RNNs can capture long term dependencies, Transformer [vaswani2017attention]
based models can do it more effectively given the attention mechanism sees all context directly. Specifically, the encoder is composed of a stack of Transformer blocks, where each block has a multi-head self-attention layer and a feed-forward layer. Suppose that the input of a Transformer block can be linearly transformed to, , and . Then, the output of a multi-head self-attention layer is
Here is the number of attention heads and
is the dimension of the feature vector for each head. This output is fed to the feed-forward layer. Residual connections[RESNET-he2015] and layer normalization [ba2016layer] are indispensable when we connect different layers and blocks. In addition to the two layers in an encoder block, the Transformer decoder also has an additional third layer that performs multi-head attention over the output of the encoder. This is similar to the attention mechanism in RNN-AED.
3 Our Models
3.1 Model building block
The encoder and decoder of E2E models are constructed as the stack of multiple building blocks described in this section. For the models using LSTM-RNN, we explore two structures. The first one, LSTM_cuDNN, directly calls Nvidia cuDNN library [chetlur2014cudnn] for the LSTM implementation. We build every block by concatenating a cuDNN LSTM layer, a linear projection layer to reduce model size, and then followed by layer normalization. Calling Nvidia cuDNN implementation enables us for fast experiment of comparing different models.
The second structure, LSTM_Custom, puts layer normalization and projection layer inside LSTM, as it was indicated in [he2019streaming] that they are important for better RNN-T model training. Hence, we only use this structure for RNN-T by customizing the LSTM function. The detailed formulations are in [Li2019RNNT]. However, this slows down the model training speed by 50%.
For the Transformer-AED models, we remove the position embedding part [wang2019semantic] and use a VGG-like convolution module [simonyan2014very] to pre-process the speech feature before the Transformer blocks. The layer normalization is put before multi-head attention layer (Pre-LN), which makes the gradients well-behaved at the early stage in training.
3.2 Non-streaming models
We achieve non-streaming behavior in RNN-T by adding bidirectionality in the encoder. The encoder of this RNN-T is composed of multiple blocks of bi-directional LSTM_cuDNN as described in Section 3.1. The prediction network is realized with multiple uni-directional blocks of LSTM_cuDNN.
Similar to RNN-T, the non-streaming RNN-AED investigated in this study also uses multiple blocks of bi-directional LSTM_cuDNN in the encoder and uni-directional LSTM_cuDNN in the decoder. This decoder works together with a location-aware softmax attention [asr_location_aware_chorowski]. No multi-task training or joint-decoding with CTC is used for RNN-AED.
Following [karita2019comparative], the Transformer-AED model uses the multi-task training and the joint decoding of CTC/attention. The training objective function is
The log-likelihood of the next subword in the joint decoding is formulated as
In practice, we first use the attention model to select top-k candidates and then re-rank them with Eq.7.
3.3 Streaming models
Streaming RNN-T model has a uni-directional encoder. While we can directly incorporate a standard LSTM as the building block with either LSTM_cuDNN or LSTM_Custom as described in Section 3.1, incorporating the future context into encoder structure can significantly improve the ASR accuracy, as shown in [Li2019RNNT]. However, different from [Li2019RNNT] which explores future context frames together with the layer trajectory structure, in this study we propose to only use context modeling. We do this to save model parameters. Future context is modelled using the simple equation below.
Because is element-wise product, Eq. (8) only increases the number of model parameters very slightly. It transfers a lower layer vector together with its future vectors into a new vector , where is future frame index. We modify the block of LSTM_cuDNN or LSTM_Custom with the context modeling.
LSTM_cuDNN_Context: the block is constructed with a Nvidia cuDNN LSTM layer, followed by a linear projection layer, then the context modeling layer, and finally a layer normalization layer.
LSTM_Custom_Context: the block is constructed with the layer normalized LSTM layer with projection, and then followed by the context modeling layer.
A similar concept of context modeling was applied to RNN in [wang2016lookahead] as Lookahead convolution layer. However, it was only applied to the top layer of a multi-layer RNN. In contrast, in this study we apply context modeling to every block of LSTM_cuDNN or LSTM_Custom, and also investigate its effectiveness in the context of E2E modeling. For RNN-T, we also investigate initializing the encoder with either CTC [rao2017exploring] or CE training [Hu2020].
RNN-AED models use blocks of LSTM_cuDNN_Context as encoder. Experiments with LSTM_Custom_Context will be a part of future study. The streaming mechanism we have chosen for this study is Monotonic Chunkwise Attention (MoChA) [mocha]. MoChA consists of a monotonic attention mechanism [monotonic_attention]
which scans the encoder output in a left to right order and selects a particular encoder state when it decides to trigger the decoder. This selection probability is selected by sampling from a parameterized Bernoulli random variable. Once a trigger point is detected, MoChA also uses an additional lookback window and applies a regular softmax attention over that. Note that we have a sampling operation here, which precludes the use of standard backpropagation. Therefore we train with respect to the expected values of the context vectors. Please refer to[mocha] for more details.
To enable streaming scenario in Transformer-AED models, we borrow the idea in trigger-attention (TA) [moritz2019triggered], where the CTC conducts frame-synchronized decoding to select top-k candidates for each frame and then the attention model is leveraged to jointly re-rank the candidates using Eq. 7 once a new subword is triggered by the CTC. Since the Transformer encoder is deeper than LSTM, the lookahead method may not be the best solution. We compare the chunk-based method and the lookahead-based method. The former segments the entire input into several fixed-length chunks and then feeds them into the model chunk by chunk, while the latter is exactly the same with the method in RNN-T and RNN-AED. For the chunk-based encoder, the decoder can see the end of a chunk. For the lookahead based encoder, we set a fixed window size for decoder.
In this section, we evaluate the effectiveness of all models by training them with 65 thousand (K) hours of transcribed Microsoft data. The test sets cover 13 application scenarios such as Cortana and far-field speech, containing a total of 1.8 million (M) words. We report the word error rate (WER) averaged over all test scenarios. All the training and test data are anonymized with personally identifiable information removed.
For fair comparison, all E2E models built for this study have around 87 M parameters. The input feature is 80-dimension log Mel filter bank with a stride of 10 milliseconds (ms). Three of them are stacked together to form a 240-dimension super-frame. This is fed to the encoder networks for RNN-T and RNN-AED, while Transformer-AED directly consumes the 10 ms feature. All E2E models use the same 4 K word piece units as the output target.
4.1 Non-streaming E2E models
As described in Section 3.1, the non-streaming RNN-T model uses bi-directional LSTM with Nvidia cuDNN library in its encoder. The LSTM memory cell size is 780. The LSTM outputs from the forward and backward direction are concatenated with the total dimension of 1560 then linearly projected to dimension 780, followed by a layer normalization layer. There are total 6 stacked blocks of such operation. The prediction network has 2 stacked blocks, each of which contains a uni-directional cuDNN LSTM with memory cell size of 1280, followed by a linear projection layer to reduce the dimension to 640, and then with a layer normalization layer.
The non-streaming RNN-AED model uses exactly the same encoder and decoder structures as the non-streaming RNN-T model. Similar to [bahdanau2016end], a location-aware attention mechanism is used. In addition to the encoder and decoder hidden states, this mechanism also takes alignments from previous decoder step as inputs. The attention dimension is 512.
The Transformer-AED model has 18 Transformer blocks in encoder and 6 Transformer blocks in decoder. Before Transformer blocks in encoder, we use a 4 layers VGG network to pre-process the speech feature with total stride 4. The number of attention head is 8 and the attention dimension of each head is 64. The dimension of the feed-forward layer is 2048 in Transformer blocks. The combination weights of joint training and decoding (i.e. ) are both 0.3.
As shown in Table 1, the non-streaming AED models have a clear advantage over the non-streaming RNN-T model due to the power of attention modeling. Transformer-AED improves RNN-AED by 2.7% relative WER reduction.
4.2 Surpassing hybrid model with streaming E2E models
In [li2020high] we reported results from our best hybrid model called the contextual layer trajectory LSTM (cltLSTM) [li2019improving]. The cltLSTM was trained with a three-stage optimization process. This model was able to obtain a 16.2% relative WER reduction over the CE baseline. Introducing 24 frames of total future-context further yields an 18.7% relative WER reduction. The encode latency is only 480 ms (24*20ms=480 ms; stride-per-frame is 20 ms due to frame skipping [Miao16]). Hence, this cltLSTM model (Table 2) presents a very challenging streaming hybrid model to beat. This model has 65 M parameters, and is decoded with 5 gigabytes 5gram decoding graph.
We list the results for all streaming E2E models in Table 2. The baseline RNN-T implementation uses unidirectional cuDNN LSTMs in both the encoder and the decoder. The encoder has 6 stacked blocks of LSTM_cuDNN. Each block has a unidirectional cuDNN LSTM with 1280 memory cells which projected to 640 dimension and followed by layer normalization. The prediction and the joint network is the same as in the non-streaming RNN-T model. This RNN-T model obtains 12.16% test WER. The second RNN-T model inserts the context modeling layer (Eq. (8)) after the linear projection layer in each block. The context modeling has 4 frames lookahead at each block, and therefore the encoder has frames lookahead. Because the frame shift is 30 ms, the total encoder lookahead is 720ms. The lookahead brings great WER improvement, obtaining 10.65% WER. This is 12.4% relative WER reduction from the first RNN-T model without any lookahead. We also followed lookahead convolution proposed in [wang2016lookahead] by using 24 frames lookahead only on the top most RNN block. This model gives 11.19% WER, showing that our proposed context modeling, which allocates lookahead frames equally at each block, is better than lookahead convolution [wang2016lookahead], which simply puts all lookahead frames on the top layer only.
Next, we look at the impact of encoder initialization for RNN-T. Shown in Table 2, the CTC initialization of RNN-T encoder doesn’t help too much while the CE initialization significantly reduces WER to 9.80. This is 8.0% relative WER reduction from the randomly initialized model. The CTC initialization makes the encoder emit token spikes together with lots of blanks while CE initialization enables the encoder to learn time alignment. Given the gain with CE initialization, we believe the encoder of RNN-T functions more like an acoustic model in the hybrid model. Note the CE pre-training needs time alignments, which is hard to get for word piece units as many of them don’t have phoneme realisation. However, the time alignment for words is still accurate. We make an approximation and obtain alignments for a word piece by simply segmenting the duration of its word equally into its constituent word pieces.
For the last RNN-T model, we put projection layer and layer normalization inside the LSTM cell (Custom_LSTM), and then insert the context modeling layer after it. Putting projection layer inside allows us to use larger number of memory cells while keeping similar model size as the cuDNN_LSTM setup. This LSTM has 2048 memory cells and the project layer reduces the output size to 640. This model finally gives 9.27% WER, which is slightly better than our best hybrid model.
|streaming models||WER||encoder lookahead|
|cuDNN+convolution [wang2016lookahead]||11.19||720 ms|
|cuDNN+Context+CTC init.||10.62||720 ms|
|cuDNN+Context+CE init.||9.80||720 ms|
|Custom+Context+CE init.||9.27||720 ms|
|Lookahead method||10.26||720 ms|
|Chunk-based method||9.16||720 ms|
With the same encoder architecture as the cuDNN RNN-T, the MoChA-based streaming RNN-AED model gives impressive results. Unlike RNN-T, it does not need any initialization and is still able to slightly outperform it in an apple-to-apple comparison (9.61% vs 9.80%). To the best of our knowledge, this is the first time a streaming RNN-AED has outperformed RNN-T on a large scale task. Note that our previous study didn’t observe accuracy improvement for RNN-AED with CE initialization [Hirofumi2020streaming]. We will investigate whether RNN-AED can also benefit from customized LSTM function in future study.
The architecture of the streaming Transformer-AED model is the same as the non-streaming one. For lookahead context-modeling method, each encoder block looks ahead 1 frame. Considering the total stride of VGG is 4 and the speech sampling rate is 10ms, the encoder has latency. The decoder of the look-ahead method introduces an extra 240ms latency. The chunk-based method considers future context with a fixed-chunk. The latency of each frame is in the range of , resulting in a 720ms averaged latency without extra decoder latency. The chunk-based method obtains 9.16 WER, significantly outperforming the lookahead method, mainly because the bottom Transformer blocks of the lookahead approach cannot enjoy the full advantages provided by the right context.
This work presents the first large-scale comparative study of three popular E2E models (RNN-T, RNN-AED, and Transformer-AED). The models are compared in both streaming and non-streaming modes. All models are trained with 65K hours of Microsoft’s internal anonymized data. We observe that with the same encoder structure, AED is better than RNN-T for both non-streaming and streaming models. With customized LSTM and CE initialization for encoder, the RNN-T model becomes better than RNN-AED. Among all models, Transformer-AED obtained the best WERs in both streaming and non-streaming modes.
In this study, both streaming RNN-T and Transformer-AED outperformed a highly-optimized hybrid model. There are several significant factors contributing to this success. For streaming RNN-T, the proposed context modeling reduces the WER by 12.4% relative from the one without any lookahead. The CE initialization for RNN-T improves over the random initialization baseline by 8.0% relative WER reduction. This shows pretraining is helpful even on a large scale task. To utilize future context for streaming Transformer-AED, we show that the chunk-based method is better than the lookahead method by 10.7% relative.