Streaming parallel transducer beam search with fast-slow cascaded encoders

by   Jay Mahadeokar, et al.

Streaming ASR with strict latency constraints is required in many speech recognition applications. In order to achieve the required latency, streaming ASR models sacrifice accuracy compared to non-streaming ASR models due to lack of future input context. Previous research has shown that streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders. This work improves upon this cascaded encoders framework by leveraging two streaming non-causal encoders with variable input context sizes that can produce outputs at different audio intervals (e.g. fast and slow). We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders, where the slow encoder corrects the mistakes generated from the fast encoder. The proposed algorithm, achieves up to 20 delays on the public Librispeech dataset and in-house datasets. We also explore techniques to reduce the computation by distributing processing between the fast and slow encoders. Lastly, we explore sharing the parameters in the fast encoder to reduce the memory footprint. This enables low latency processing on edge devices with low computation cost and a low memory footprint.


page 1

page 2

page 3

page 4


A Better and Faster End-to-End Model for Streaming ASR

End-to-end (E2E) models have shown to outperform state-of-the-art conven...

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

In this paper, we propose a dynamic cascaded encoder Automatic Speech Re...

Non-Autoregressive ASR with Self-Conditioned Folded Encoders

This paper proposes CTC-based non-autoregressive ASR with self-condition...

Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

Recently, online end-to-end ASR has gained increasing attention. However...

Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

There is often a trade-off between performance and latency in streaming ...

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Recently, a few novel streaming attention-based sequence-to-sequence (S2...

Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency

We propose a dynamic encoder transducer (DET) for on-device speech recog...

1 Introduction

A responsive user experience is critical for voice-based virtual assistant applications. The latency of speech recognition determines the perceived system responsiveness for both voice commands (time from speech to action) and dictation (feeling of ”snappiness”). Low latency requires streaming ASR, where incoming speech is processed incrementally based on partial context (while non-streaming ASR, e.g. sequence-to-sequence models, run only after observing the whole utterance).

In the family of End-to-End (E2E) ASR models [Graves2012, he2019rnnt, Rao2017, Graves2014, Chan2016, ds2, Miao2016]

, where acoustic model, pronunciation, and language model are combined into a single neural network, the recurrent neural network transducer, or RNN-T 

[Graves2012, he2019rnnt, Rao2017], intrinsically supports for streaming. In [jiahui_2021_fastemit, jay_2020_arrnnt], it is proposed to improve the RNN-T’s token emission latency by sequence-level emission regularization and alignment restrictions, respectively.

In [tara_2019_two_pass], a non-streaming E2E LAS model [Chan2016] is applied for second-pass rescoring to compensate for the accuracy loss from the RNN-T’s limited context. However, [chiu_2019_long_form] shows that LAS-type models suffer from accuracy loss for long-form speech utterances compared to a non-streaming encoder-based RNN-T. Inspired by “universal ASR” [yu2020universal], the idea of cascaded encoders (causal and non-causal) with RNN-Ts are introduced in [cascadedNarayanan], where a non-streaming encoder is trained directly on the output of the streaming encoder instead of input acoustic features allowing the non-streaming decoder to use fewer layers instead of a fully non-streaming model. [li2021better] builds on this work by using a two-pass beam search. The first pass uses only the causal encoder, while during the second pass, additional non-causal layers utilize both the left and the right context of the 1st-pass encoder outputs as the input to a shared RNN-T decoder.

In other related work, [hu2020deliberation] proposes to attend to both acoustics and first-pass hypotheses (“deliberation network”). [yyshi_dynamic_2021] proposes to apply a subset of an encoder for the beginning part of utterance and a full encoder for the remaining utterance. [wang2021deliberation] improves the align-refine approach introduced in [chi2020align] by using a cascaded encoder that captures more audio context before refinement and alignment augmentation, which enforces learning label dependency.

However, the non-streaming encoder has non-trivial user-perceived latency and memory footprint increase for long-form speech applications (e.g., dictation and messaging). We propose to improve the cascaded encoder framework such that both encoders are streaming and non-causal, where for both encoders, look-ahead context is used. The proposed framework is called fast-slow cascaded encoders. The fast encoder produces outputs more frequently, while the slow encoder takes as input multiple segments output by the fast encoder and produces results with more extensive delays. We propose using a novel streaming parallel beam search that leverages both the fast and the slow encoders with shared search space. The fast encoder beam search produces timely partial results from fast encoder outputs to improve token emission delays. Whenever the slow encoder outputs are available, the slow encoder beam search updates the partial output results, at the same time also updating the candidates considered by fast encoder beam search.

Running parallel beam search with fast-slow encoders has real-time factor and memory implications. We carefully analyze these run-time constraints and propose techniques to improve them by distributing parameters across fast-slow encoders and using smaller beam sizes for fast encoders. Similar to [dehghani2018universal, li2019improving, dabre2019recurrent, sho_layer_sharing_2021] we also explore sharing of parameters across layers to reduce the memory footprint.

2 Methodology

This section describes the model architecture, training, and decoding procedures for the proposed model. Similar to the cascaded encoder work [cascadedNarayanan], the proposed method focuses on the encoder in the RNN-T framework [Graves2012].

2.1 Streaming Fast Slow Cascaded Encoders

Figure 1 gives the illustration of the streaming cascaded encoder. Different from the work [cascadedNarayanan] where the causal encoder and the non-causal encoder stochastically use different training samples within a minibatch, in this work, both encoders leverage the same training data. Rather than cascading a non-streaming encoder with a causal encoder in  [cascadedNarayanan], both encoders in our framework are streaming non-causal encoders.

As shown in Figure 1, given the audio feature input

, the cascaded encoder generates two representations, one from the fast encoder and the other from the slow encoder. The same joiner and predictor in the RNN-T model take the representations and then give the logits for the audio and transcript pairs in training. The losses

and are for the fast encoder and the slow encoder, respectively. In training, the final loss is


where . The weighted loss not only passes gradient back for both the fast and the slow encoder but also stabilizes the training, especially for deep encoder structure [Andros2019].

Both the fast encoder and the slow encoder use a stack of Emformer [emformer] layers that apply the block processing method to support streaming ASR. The block processing method segments the whole input sequence

into multiple blocks. Each block is padded with the corresponding right context (lookahead context). The Emformer stores the self-attention keys and values of the history context in the states to save computations. In Fig. 

2, the fast encoder uses segment size 4 and right context 1. The slow encoder uses the same right context size but double the segment size. Given the current block , the right context and the history context state , the fast encoder gives the following outputs for the current block and the right context.


Note is the result from the attention of with -th segment context and uses the attention of with -th segment context. Figure 2 shows that slow encoder takes the outputs from twice the fast encoder forward processes as input and the right context output .

Figure 1: Illustration for streaming cascaded encoder using fast-slow encoders. The joiner and predictor are shared for both fast encoder and slow encoder.

2.2 Parallel Beam Search

For RNN Transducers, [Graves2012] outlines a beam search algorithm. Let’s assume that the algorithm uses a search space . During beam search we iterate over audio time-steps and search for

(beam-size) most probable ASR hypothesis using sets

and . contains the current best hypothesis at time , while the set stores the most probable candidates for step . We reuse the beam search algorithm and extend it for fast-slow cascaded encoders as outlined in Algorithm 1.

  for  to by  do
     if  or  then
     end if
  end for
  return   with highest in
Algorithm 1 Parallel beam search for cascaded encoders.
Figure 2: Illustration for parallel beam search with cascaded encoders. Both the fast encoder and the slow encoder uses the same right context size 1. The fast encoder and the slow encoder use segment size 4 and 8, respectively.

We maintain two sets, and , that represent the best hypotheses generated using fast-slow encoder outputs, respectively. Both the fast and the slow encoders use Emformer layers which use states to store the key and value for the history left context. Let and denote these states. Let , denote the segment size and , denote the beam size used for fast-slow encoder beam search. We iterate over the audio time-steps in the interval of and run a fast encoder. The fast encoder outputs and updates the state . is used to call beam search to update which contains the current partial hypothesis output by ASR. At the same time,

is concatenated with previously cached vector

, which forms the input to slow encoder. We skip details of the right context in Algorithm 1 for simplicity.

When we have processed time-steps, the slow encoder is called with to produce slow encoder output and state , which is then used to update using beam search, that shares the search space . Shared search space is crucial for efficient run-time implementation. We then update the set with which typically corrects the outputs from , and discard existing hypothesis. In the end, we return , which is the most probable hypothesis in . Figure 2 illustrates this using example of 4 fast encoder calls and 2 slow encoder calls.

3 Experimental Setup

3.1 Datasets

3.1.1 Librispeech

The Librispeech [panayotov2015librispeech]

corpus contains 970 hours of labeled speech. We extract 80-channel filterbanks features computed from a 25 ms window with a stride of 10 ms. We apply spectrum augmentation (SpecAugment  

[park2019specaugment]) with mask parameter = 27, ten time masks with maximum time-mask ratio = 0.05, and speed perturbation.

3.1.2 Large-Scale In-House Data

Our in-house training set combines two sources. The first consists of 20K hours of English video data publicly shared by Facebook users; all videos are completely de-identified before transcription. The second contains 20K hours of manually transcribed de-identified English data with no user-identifiable information (UII) in the voice assistant domain. All utterances are morphed when researchers manually access them to further de-identify the speaker. Note that the data are not morphed during training. We further augment the data with speed perturbation, simulated room impulse response, and background noise, resulting in 83M utterances (145K hours).

We consider three in-house evaluation sets:

VA1 – 10.2K hand-transcribed de-identified short-form utterances (less than five words on average) in the voice assistant domain, collected from internal volunteer participants. The participants consist of households that have consented to have their Portal voice activity reviewed and analyzed.

VA2 – 44.2K hand-transcribed de-identified short-form utterances in the voice assistant domain, collected by a third-party data vendor via Oculus devices.

Q&A – 5.7K hand-transcribed de-identified medium-length utterances (more than 13 words on average) collected by crowd-sourced workers via mobile devices. The utterances consist of free-form questions directed toward a voice assistant.

3.2 Evaluation Metrics

To measure the model’s performance and analyze trade-offs, we track the following metrics:

Accuracy: We use word-error-rate (WER) to measure model accuracy on evaluation sets.

Emission Delay: (or finalization delay) as defined in [jay_2020_arrnnt] is the audio duration between the time when the user finished speaking the ASR token, and the time when the ASR token was surfaced as part of the 1-best partial hypothesis, also referred to as emission latency in [yu2021dualmode]. We track the Average () and P99 () token emission Delays.

Correction rate: Our proposed technique uses a slow encoder to correct mistakes made by the fast encoder. Let be the word error rate if we use fast encoder’s output and be the word error rate when using slow encoder’s output. We define correction rate (CR) as CR = - .

Real Time Factor: To measure the impact of parallel beam search on run-time / compute, we use Real Time Factor (RTF) measured on an actual android device.

3.3 Model setup

We use an RNN-T model architecture that has emformer [shi2021emformer]

as encoders. We use a stacked time reduction layer with a stride of 4, which converts 80-dimensional input features into 320-dimensional features that are input to the encoder. Predictor consists of 3 LSTM layers, with Layer Norm having 512 hidden units. Both encoder and predictor project embeddings of 1024 dimensions, which are input to joint layer, consist of a simple DNN layer and a softmax layer, predicting a word-piece output of size 5k dimensions.

For librispeech experiments, we train our models for 120 epochs. We use an ADAM optimizer and a tri-stage LR scheduler with a base learning rate of 0.001, a warmup of 10K iterations, and forced annealing after 60K epochs. Experiments on in-house data follow a similar model architecture and training hyperparameters. Models are trained for 15 epochs on large-scale training data.

4 Results

4.1 Optimizing WER and latency

4.1.1 Effects of varying slow encoder context

In table 1 we train baselines B1 to B5 with 20 layer models with varying context size of 160 to 6400 ms. As expected, with increased model context, we see improved WERs and increased emission delays. We train models using streaming cascaded encoders (C1 to C4) with 15 fast layers and a fixed context of 160ms while the five slow layers are trained with varying contexts. Using streaming parallel beam search, we achieve up to 20% WERR (B1 Vs. C4). As shown by CR metric, since we correct  2.63% words with C4, the ED P99 degrades from 560 to 800ms, with minimal effect on ED Avg.

Model Con- Test- Test- ED CR
text clean other Avg P99



3.46 8.96 0335 0560


20 full


3.15 8.10 0651 1160



3.17 7.63 0971 1880 N/A



3.11 7.23 1612 3360



3.10 6.99 2754 6276



3.22 8.24 0336 0600 1.38


15 fast


3.11 7.83 0329 0600 1.74


5 slow


2.99 7.28 0329 0600 2.39



2.91 7.15 0346 0800 2.63
Table 1: Experiments comparing baseline models trained using different context sizes and fixed fast encoder context of 160ms, with varying slow-encoder context. CR and ED are using Test-other.

4.1.2 Further improving latency

In this section, we explore further improving the token emission latency of the model. [jay_2020_arrnnt] shows that emission latency can be controlled by restricting optimized paths while also reducing compute and improving training throughput. Fast-emit [yu2021fastemit] introduces regularization to force timely token emissions. We empirically verify that applying fast-emit regularization on a restricted set of paths gives the best of both worlds in terms of faster latency and optimal throughput. All experiments in Table 2 use 15 fast and 5 slow encoder layers and AR-RNNT left and right restrictions of 0ms and 600ms. Using larger fast-emit , we reduce ED Avg from 329 to 174ms with some degradation on test-other WER. We also explore reducing the context of a slow encoder to improve token emission delay further, as shown in experiments L4 and L5.

Con- Fast- Test- Test- ED
text emit clean other Avg P99


160 / 0.0 2.99 7.28 329 600


3200 0.001 3.02 7.47 299 600


0.01 3.07 7.56 174 600


80 / 0.0 3.01 7.6 295 520


3200 0.01 3.09 7.72 135 560
Table 2: Experiments with varying fast-emit lambda and smaller fast encoder context. ED is computed using test-other.

4.2 Optimizing Runtime

4.2.1 Distributing layers between fast / slow encoder

The parallel beam search using the fast-slow encoders impacts the model’s real-time factor (RTF). Experiments in Table 3 look into techniques to improve the RTF of the proposed model by analyzing the effects of distributing layers between fast-slow encoders and reducing the beam size of fast encoder search. Models R1 to R4 are trained using 160ms and 800ms context sizes for fast-slow encoders. We observe that since a slow encoder has a larger context compared to the fast encoder, it incurs less compute due to overlapping right context, and the execution can be batched across timesteps more efficiently. Combining this with a reduced beam size of fast encoder beam search, we see improvements to RTF (0.55 to 0.48 for B1 to R3). Configuration R3 provides the best tradeoffs in terms of WER (8.13), and P99 ED (600ms), RTF (0.48). Note that similar to Table 1 Avg ED is mostly unchanged for R1 to R4 compared to B1. Further optimization of runtime implementation and other techniques like applying additional time reduction layer within slow encoder can further improve RTF, which we plan to explore as future work.


Test Beam 10 ED Beam 2 ED
other RTF CR P99 RTF CR P99


20 8.96 0.44 560


3+17 8.2 0.55 15.6 1120 0.42 23.3 1120


7+13 8.54 0.55 4.6 880 0.45 8.49 960


13+7 8.13 0.56 2.16 600 0.48 4.84 600


17+3 8.45 0.55 0.81 560 0.5 2.42 600
Table 3: Experiments to analyze WER Vs. RTF tradeoffs by distributing the layers between the fast and the slow encoders and using smaller beam for fast encoder outputs.

4.2.2 Sharing parameters for memory reduction

Memory optimizations are critical for on-the-edge applications. We explore sharing the parameters of layers in fast encoders to further improve the memory consumption similar to [li2019improving]. Our intuition is that since the slow encoder has access to extra future constraints, the same parameters could be utilized to further correct the mistakes made by fast encoders.

We explored different layer sharing in fast-slow cascaded frameworks, such as sharing layers between the fast and the slow encoders, sharing layers in slow encoders, and sharing layers in fast encoders. We found that sharing layers in a fast encoder gave better performance. In Table 4, we only list the results from layer sharing in a fast encoder situation, specifically, sharing continuous 13 layers from the 2nd layer to the 14th layer.

In Table 4, using layer sharing on top of baseline models (P1) shows significant WER reduction compared with the same model size baseline (B2). Similar to the trend in Table 1, leveraging 800 ms context in the slow encoder (Q1) gets more than 5 relative WER reduction over layer sharing on baseline models (P1). Further Extending the slow encoder with context to 6.4 s on top of the layer sharing, the 41 million parameters model even outperforms the 79 million parameters model by relative WER reduction 13 on test-other and 9 on test-clean.




layers-share Test-clean Test-other


20 full


79M - 3.46 8.96


8 full


41M - 4.32 11.05




41M 2-14 3.82 9.50


15 fast


41M 2-14 3.55 9.04


5 slow


41M 2-14 3.16 7.86
Table 4: Experiments comparing baseline models trained using different number of parameters, layer sharing and layer sharing in fast slow cascaded encoder.

4.3 In-house dataset

This section runs the most promising configurations on the in-house dataset. Experiment in Table 5 trains a baseline model using 20 layers. Similar to Table 1, experiments P2 to P5 outline results using the proposed technique with 15 fast and 5 slow layers while varying slow encoder context. We see significant gains on VA2 and Q&A domains, consisting of longer-form utterances than the VA1 dataset. We see 14.7% and 10.7% WERR on Q&A and VA2 datasets comparing P1 Vs. P5, with minimal change to Avg ED, or P99 ED (not shown in table).





20 full


4.71 7.51 13.35 388



4.65 7.16 12.72 372 1.5


15 fast


4.57 6.82 12.13 371 1.88


5 slow


4.66 6.66 12.05 381 1.95



4.72 6.4 11.92 410 2.29
Table 5: Experiments on in-house dataset with different context size for slow encoder. ED and CR are computed on Q&A dataset.

5 Conclusion

We proposed a framework that uses streaming parallel transducer beam search with fast-slow cascaded encoders. We show that using the proposed technique, achieving 15 to 20% WER reduction on librispeech and in-house datasets, with trivial degradation to average token emission delays. We empirically show that additional techniques can improve model’s runtime and memory. In future work, we will further explore optimizing the runtime by subsampling in the slow encoder.

Acknowledgement: We would like to thank Frank Seide for careful review and feedback on the paper.