There has been a growing interest in building on-device streaming speech recognition models, which provide recognition results instantly as words are being spoken [rnnt]. Such models make predictions based on partial context under strict latency requirements [streaming-e2e, pre-rescore, latency_mocha_msr]. As a result the streaming models tend to be less accurate than non-streaming models, which have access to the entire utterance.
Previous work have shown that this issue can be alleviated by combining a second-pass rescoring model [two-pass] with streaming models, where the rescoring model uses the Listen, Attend, and Spell (LAS) architecture [las]. LAS has access to the full context of the utterance and therefore provides better quality than the streaming models [chiu2018state]. From user’s perspective, such a two-pass speech model exhibits the advantages of both streaming and non-streaming models—words are recognized as they are spoken and the final results have high accuracy.
The canonical architecture of the LSTM-based LAS model, however, is designed for beam search and is not efficient as a -pass rescoring model. The LSTM [lstm] layers process hypothesis tokens sequentially, with temporal dependency between timesteps. On the other hand, for the -pass rescoring, all hypothesis tokens are available. A more efficient design of the rescorer model will be to rescore all tokens in parallel.
In recent years there have been a growing success in applying Transformer [transformer] for machine translation and language modeling [t5], and speech recognition [karita2019comparative, qian_tt, fb_tt, speech-xformer]. Transformer applies self-attention to capture the sequential relation among input features, and therefore does not have the recurrent constraint. This allows Transformer to compute self-attention in parallel and significantly increase the computation efficiency. The Transformer architecture proposed in [transformer] consists of an encoder and a decoder, where each decoder layer has an additional cross-attention that summarizes the encoder output based on the self-attention output.
In this work, we address the sequential dependency issue of the original LSTM-based rescoring model with Transformer. Specifically, the paper proposes to use Transformer as the second-pass rescorer for parallel rescoring of hypothesis tokens. Unlike beam search, where the Transformer decoder still has to run autoregressively, the rescoring scenario allows parallel processing of the full hypothesis sequence. Such parallelism reduces the lengths of temporal dependency paths from to , where corresponds to the hypothesis length. This allows the Transformer rescorer to utilize on-device computation capacity much more efficiently. We further improve the inference speed of the Transformer rescorer by reducing the number of cross-attention in the decoder. The Transformer rescorer improves the Word Error Rate (WER) of Google’s voice search query test set to from with LSTM rescoring. On Librispeech [Panayotov2015] the Transformer rescorer improves the WER to on test clean and on test other compared to and with LSTM rescoring. The th percentile second-pass latency, benchmarked on a Google Pixel4 phone on CPUs, is reduced to ms from previous ms with LSTM rescoring.
2 Transformer Rescorer
2.1 Two-Pass Model
A two-pass model consists of a -pass model and a -pass model. Here we use RNN-T [Graves2012, Graves2013] as the -pass model and Transformer for the -pass model. Specifically, our Transformer-based two-pass model, as demonstrated in Figure 1, consists of four components: RNN-T encoder, RNN-T decoder, additional encoder, and Transformer decoder as the rescorer. The input acoustic frames are denoted as , where are stacked log-mel filterbank energies () and is the number of frames in . In the -pass, each acoustic frame is passed through RNN-T encoder, consisting of a multi-layer LSTM [lstm], to get encoder output. RNN-T decoder takes the acoustic features from RNN-T encoder to generate the hypotheses in a streaming fashion, denoted as where is the label sequence length. Here is a sequence of word-piece tokens [vocabulary]. In the -pass, the full output of the RNN-T encoder is passed to a small additional encoder to generate , which is then passed to Transformer decoder. The additional encoder is added as it is found to be useful to adapt the encoder output to be more suitable for the second-pass model [streaming-e2e]. The RNN-T model structure and the additional encoders are exactly the same as [streaming-e2e]. During training, the Transformer decoder computes output label sequence according to the full audio sequence . More details about the rescorer training is elucidated in Section 2.3. During decoding, the Transformer decoder rescores multiple top hypotheses from RNN-T, .
2.2 Transformer Rescorer Architecture
The architecture of our Transformer rescorer is based on the conventional Transformer decoder [transformer] with some cross-attention layers being removed. The conventional Transformer decoder layer contains both the self-attention and the cross-attention, where the query of the cross-attention originates from the output of the self-attention. In the Transformer rescorer, we improve the rescorer efficiency by removing the cross-attention from some decoder layers and interleave those layers with the conventional decoder layers. The decoder layer without the cross-attention shares the same architecture as the conventional Transformer encoder layer [transformer]. The architecture of the resulting rescorer is illustrated in Figure 2, where layers without cross-attention are annotated as self-decoder. The Transformer rescorer takes the RNN-T’s hypothesis as input and feed the tokens to the self-attention layer. And the cross-attention layers attend to the encoder output to summarize the acoustic signals. In our rescorer model, there are
Transformer layers, each with the attention model dimensionand feed forward dimension . Both cross-attention and self-attention layers use multi-headed attention with heads. The rescorer model has parameters.
Our design of keeping only two cross-attention layers in the rescorer is based on observing the attention mechanism of the Transformer decoder. In the first Transformer decoder layer, the self-attention conditions only on the hypothesis tokens, therefore the resulting cross-attention generates its query solely based on language modeling information. The missing of acoustic information on generating attention query inherently limit the effectiveness of the first cross-attention. After the first cross-attention layer, the output of the first decoder layer contains acoustic information, and the following decoder layers can condition on both the acoustic and language modeling information to generate effective cross-attention queries. Thus, it is critical to have the second cross-attention layer in the decoder. On the other hand, the additional cross-attention layers beyond the second one do not introduce additional modality and have diminishing returns in terms of the model quality. As a comparison, the cross-attention of the LAS model conditions on both the previous attention context and the text tokens, and requires only one cross-attention in the decoder. We demonstrated these property with an ablation study in Section 3.
2.3 Rescorer Training
Same with the LAS rescoring training described in [two-pass], Transformer rescorer model is trained after the -pass model training. During -pass training, RNN-T encoder and RNN-T decoder are freezed. Additional encoder and Transformer rescorer are trained in two stages: cross entropy (CE) and minimum word error rate (MWER) training [mwer]. During CE training, frozen RNN-T encoder generates the acoustic features for additional encoder, and Transformer rescorer is trained to predict groundtruth sequence with the full audio context from additional encoder and the prefix of the label sequence context: , where is the label to predict. During MWER training, the Transformer rescorer is trained to re-rank the hypotheses generated from RNN-T, which bridges the gap from CE training to inference [two-pass]. More specifically, given acoustic input , groundtruth transcript
, the probability computed by rescorer modelfor any given target sequence , and a set of hypotheses where b is the beam-size, the MWER loss is defined as
where represents the conditional probability the Transformer rescorer assigns to hypothesis among all hypotheses in , and is the number of word errors of , and is the average number of word errors among . In our MWER training we use the N-Best approximation approach for calculating the expected word errors [mwer].
3 Quality Experiments
3.1 Experiment Setup
We conduct experiments on the Librispeech [Panayotov2015] dataset and a large-scale internal dataset. We use SpecAugment [park2019specaugment] with the same configuration as described in [largespecaugment] during training. Similar to [streaming-e2e], we apply constant learning rate and maintain Exponential moving average (EMA) [ema] of the weights during training, and use the EMA weights for evaluation. Both LSTM and Transformer rescorer are trained with CE and MWER. The N-Best size of MWER training is , which matches the rescoring behavior during evaluation, where top hypotheses from RNN-T are used for rescoring. The prediction targets are word pieces [vocabulary] derived using a large corpus of text transcripts. The LSTM-based rescorer has size and the Transformer has
parameters. All models are implemented in Tensorflow[tf] using the Lingvo [lingvo] toolkit and trained on Tensor Processing Units (TPU) slices with a global batch size of .
3.2 Librispeech Experiment
In this experiment, the models are trained on the Librispeech 960h training set and evaluated on the clean and noisy test sets without an external language model. In order to maintain low-latency streaming speech recognition, the -pass RNN-T models in all the compared systems use a uni-directional LSTM encoder with 0 right context frame. As is shown in Table 1, both the LSTM rescorer and the Transformer rescorer significantly improve the WER of the clean and noisy test sets compared to the RNN-T only model with - relative improvement, alleviating the limited context problem for the -pass model while still maintaining low-latency streaming recognition. The Transformer rescorer further improves the WER slightly over the LSTM rescorer, and also significantly reduce the -pass latency, which is studied in detail in Section 4.
|Model||Test clean||Test other|
3.3 Large Scale Experiment on Voice Search
We perform a large scale experiment on an internal task, Google Voice Search, and show the proposed Transformer rescorer is also effective. In this experiment, the models are trained on a multi-domain training set as described in [multidomain]. These multi-domain utterances span domains of search, farfield, telephony and YouTube. The test set includes Voice-search utterances (VS) extracted from Google traffic. All datasets are anonymized and hand-transcribed. The transcription for YouTube utterances is done in a semi-supervised fashion [semi1, semi2]. Following [li2020towards, chang2019joint, streaming-e2e], we train the first-pass RNN-T to also emit the end-of-sentence decision to reduce the endpointing latency, allowing 2nd-pass rescoring to execute early.
As is shown in Table 2, the Transformer rescorer improves the WER from to on the VS test set compared with the LSTM rescorer, both of which are trained with CE and MWER. Compared with -pass model, the Transformer rescorer achieves relative WER improvement.
|Transformer rescorer CE|
|Transformer rescorer MWER|
3.4 Full Context Rescoring
The additional capability that the Transformer rescorer can bring is to utilize the full hypothesis when rescoring every target token. The original LSTM-based rescorer scores each target token conditioned only on the tokens before it. Specifically, the LSTM rescorer learns a conditional probability for each prediction target where denotes hypothesis tokens from RNN-T and denotes acoustic features. A conventional Transformer decoder uses causal self-attention and also learns . We explored extending the self-attention to access also the future label context and as a result learns to score target tokens with . During CE training, using groundtruth sequence as the full context makes the training target trivial. Thus we randomly swap different proportions of the groundtruth tokens that fed to the self-attention layer with alternative tokens sampled within the word-piece vocabulary. Some sentinel tokens like SOS, EOS, UNKNOWN and RNN-T’s blank symbol are excluded to be used as random tokens. The prediction targets are the original groundtruth sequence. During MWER training, the RNN-T hypothesis is used as the decoder input to match the inference scenario. With this experiment, random proportion works out the best and achieves the same WER on the voice search task. Thus, we report results with causal self-attention for the experiments throughout the paper.
4 Latency Optimizations
In this section, we measure the additional latency introduced by the -pass rescorer on a Google Pixel4 phone on CPUs. For efficient on-device execution, all models are converted to TensorFlow Lite format with post-training dynamic range quantization using the TensorFlow Lite Converter [tflite_quanitzation]. Matrix multiplication is operated in 8-bits with little accuracy loss. The benchmark suite consists of 89 utterances with voice action queries. The LSTM rescorer latency baseline is fully optimized and is measured with lattice rescoring with batching described in [streaming-e2e].
4.1 Effect of Cross-Attention Layers
We investigate the impact of the number of cross-attention layers on quality and latency. As shown in Table 3, we start with cross-attention on the decoder layer and gradually add more. We observe a noticeable quality improvement at first, which later quickly diminishes. Specifically, with cross-attentions the rescorer achieves a WER improvement than cross-attention, but no further improvement is realized by adding more of it. In addition, when cross-attentions are used, we find that applying them on the and layers improves WER by than on the and layers. In the end, by selectively applying cross-attention, we achieved a latency reduction (Table 4) and a () parameter size reduction without quality compromise.
|Cross attention layers||WER|
|1st & 2nd|
|1st & 3rd|
|All 4 layers|
4.2 Parallelism in Transformer Rescoring
As is illustrated in Figure 3, with hypothesis labels ready from the -pass decoder output, Transformer rescorer can finish the computation in a single batch step as opposed to a series of sequential steps as in LSTM rescorer, which could better leverage multi-threading during inference. The batch size for transformer rescorer corresponds to
Taking the utterance at the th percentile latency as an example, with the top hypotheses used, the batch size is . This large batch size provides better parallelism and as a result benefits more from using threads which reduces latency (Table 4). The multi-threading benefit is not witnessed in the LSTM-based rescorer. Potentially it might be due to (1) limited parallelism in LSTM, where batching is done within each inference step with a relatively smaller batch size being and (2) utilizing multi-threading within each inference step could introduce extra overhead due to context switch across inference steps and layers.
4.3 Latency Measurements and Distributions
An overall breakdown for latency optimizations is shown in Table 4. The Transformer rescorer achieves a latency reduction compared to the LSTM rescorer, measured on the utterance with the th percentile latency with the LSTM rescorer, which has audio and word-piece tokens in the transcript.
The initial latency of the Transformer rescorer with cross-attention layer is , which then improves to by keeping only cross-attentions. Compared to from LSTM baseline, the latency improvement is from the reduced FLOPs. Transformer rescorer with and cross-attentions provide a () and () FLOPs reduction compared to LSTM (). Using two threads reduces the latency by an additional for Transformer rescorer, while the LSTM rescorer does not benefit from multi-threading.
|Initial latency ( cross attention)|
|Parallelism in two threads|
We also compared the latency distribution over the full benchmark suite, demonstrated in Figure 4. The speech time ranges from to in the benchmark. The output label sequence length varies from to . Transformer rescorer is consistently faster than LSTM rescorer at almost every latency percentile.
In this work we present a Transformer rescorer for a two-pass model. Our proposed Transformer rescorer reduces more than of the on-device computation latency in second-pass model by taking advantage of the parallelism in Transformer decoder and reducing the number of cross attention layers. On a Google Voice Search task the Transformer rescorer achieves WER compared with of an LSTM rescorer. On Librispeech the Transformer rescorer achieves and WER on test clean and test other, also lower than and of the LSTM rescorer, respectively.
We thank TF-Lite team for the help to get Transformer model running on device, especially T.J. Alumbaugh, Jared Duke, Jian Li, Feng Liu and Renjie Liu. We are also grateful for the insightful discussions with Shuo-yiin Chang, Ian McGraw, Tara Sainath and Yonghui Wu.