are widely used for streaming automatic speech recognition due to their superior performance and compactness. A sequence transducer model has an encoder to capture the context information from acoustic signals, a predictor to model the grammar, syntactic, and semantic information, and a joiner to combine the two parts. The work[25, 22] showed replacing the LSTM encoder with the self-attention-based transformer  yielded the state-of-the-art of accuracy on public benchmark datasets, which is consistent to the trend in applying transformer in various scenarios for automatic speech recognition [2, 6, 16, 26, 18, 24, 12, 19, 15, 4, 20, 21].
A wide range of methods that have been proposed to improve the transformer model. One popular variant of transformer models for speech recognition tasks is Conformer , which adds the depth separable convolutions and macaron network structure  into the transformer. The work [1, 23] simplified the depth-wise convolution in conformer to causal convolution to support streaming scenarios. In , the non-causal convolution is used to support streaming speech recognition. However, sequential block processing is required to avoid training and decoding inconsistency. The sequential block processing segments the input sequences into multiple blocks. And the model is sequentially trained on each block. The sequential block process is slow in training for low latency scenarios and incapable of dealing with large-scale dataset.
The multi-head self-attention  in transformer uses different heads that conduct the attention computation separately. The attention outputs are concatenated at the end. The work  proposed a talking-heads attention method to break the separation among different heads by inserting two additional learnable lightweight linear projections transferring information across these heads.
The Emformer  and the augmented memory transformer  support streaming speech recognition by using block processing where a whole utterance is segmented into multiple blocks. The self-attention performs the computation on the current block and its surrounding left context and lookahead context. An augmented memory scheme is proposed to store the information from the previous blocks, which explicitly introduces compact long-form context while maintains limited computation and runtime memory consumption in inference. The attention output from the mean of the current block is used as a memory slot for future blocks.
In this work, we advance the Emformer  model from the following aspects. First, we leverage a similar architecture as conformer but use non-causal convolution to support streaming. In comparison with , this work enables parallel block processing with the non-causal convolution, achieving a similar training speed as the baseline Emformer model. Second, the attention is replaced with the talking-heads attention scheme. Third, we further simplify the augmented memory extraction process similar to 
, referred to as context compression: rather than using the self-attention output from the mean of each block, we directly use the linear interpolation of each block as memory. On a large-scale speech recognition task, we evaluated this novel variant of the streaming transformer.
The rest of this paper is organized as follows. In Section 2, we present the methods to advance the Emformer model. Section 3 demonstrates and analyzes the experimental results, followed by a conclusion in Section 4.
2 Methods to advance Emformer
Fig. (0(a)) illustrates forward logic in one Emformer layer. To support streaming speech recognition, Emformer applies the parallel block processing to segment a input sequence into multiple non-overlapping blocks , where denotes the index of current block, and
denotes the layer’s index. In order to reduce boundary effect where the most right vector inhas no lookahead context information, a right contextual block , is concatenated with to form a contextual block . At the -th block, the -th Emformer layer takes and a bank of memory vector as the input, and produces and as the output, whereas is fed to the next layer and is inserted into the memory bank to generate and carried over to the next block and next layer.
The modified attention mechanism in emformer attends to the memory bank and yields a new memory slot at each block:
where and are the key and value copies from previous blocks. and are the attention output for and respectively; is the mean of center block ; is the attention operation defined in  with , and being the query, key and value, respectively. specifies the number of slots in augmented memory; the most recent slots are used.
are passed to a point-wise feed-forward network (FFN) with layer normalization and residual connection to generate the output of this Emformer layer, i.e.,
, where FNN is a two-layer feed-forward network with ReLU.
2.1 Streaming Non-causal Convolution
The convolution layer in Fig. (0(b)) has a similar structure as , except the layer norm is used right after depth-wise convolution rather than the batch norm. In our experiments, the layer norm gives better performance than the batch norm.
The work [20, 21] uses sequential block processing where the training and streaming decoding do the forward logic in the same way. The self-attention and convolution receptive field is limited by the block size and surrounding context size. It is trivial to use a non-causal convolution operation in this way. However, the sequential block processing is slow in training as it doesn’t utilize GPU parallel computation capacity. For low latency situations where the block size is tiny, sequential block processing is not practical to use.
To use the lookahead context in streaming speech recognition, Emformer  uses the right-context-hard-copy methods in training. The right-context-hard-copy method copies and concatenates each block ’s lookahead context . Then it puts the concatenated lookahead context at the beginning of the input sequence. The right-context-hard-copy method is essential to avoid the lookahead context leaking issue in training, where the higher transformer layer has a larger lookahead context than the bottom layer when multiple transformer layers are stacking on top of the other.
Fig. 2 shows the forward logic of using non-causal convolution operation in Emformer. The output from the attention operation is first splitted into two parts: right context and center block . Then the same depth-wise convolution is applied to both parts. For the center block part, it is straightforward to directly apply the convolution operation as shown in Eq. (14
). The right context part needs to go through reshape, padding, convolution operation and finally be reshaped to its original shape. In padding operation, each right context block is padded with its corresponding block.
where is the kernel size used in depth-wise convolution. The padding in Eq. (15) is the ending feature vectors from center block .
2.2 Talking-heads Attention
Self-attention forms the foundation of transformers. Assume a set of tokens that is packed into a matrix form and consider and its corresponding keys and queries, respectively. Here denotes the length of tokens and is the dimension of each token. Self-attention aggregates information across different tokens and transforms as follows,
Multi-heads attention assembles multiple standard self-attention blocks for better representation learning,
where denotes the number of heads and , and represents the queries, keys and values from different heads, respectively.
One potential drawback of multiple-attention is that different heads are trained independently without coordination. Talking-heads attention  improves on multi-heads attention by allowing information fusion among different attention heads. Assume the attention weights learned by different heads in multi-head attention (). Talking-heads attention introduces two additional linear layers immediately before and after the softmax and computes the new self-attention weights as follows,
Here and are trainable parameters, and is applied on the second dimension. In practice, the two linear projections introduced by talking-heads attention are computationally efficient as the number of heads used is often small.
2.3 Context Compression
The augmented memory is designed to introduce long-form information into the attention. As shown in Eq. (6-7), the information is introduced via the queries of the previous segments in the previous layer. This inter-layer strategy gets rid of the auto-regression property if it is on the same layer, preventing inefficient block processing in training. However, one potential issue of this design is the representation mismatch between successive layers. In the attention operation, the augmented memory slots from the previous layer and the frames from the current layer are equally treated in key and value, which depends on the similar representations on the two layers. Otherwise, long-form information can be misleadingly introduced.
To address the potential mismatch between memory slots and frames, we put forward the context compression strategy in this paper. The context compression directly introduces compact memory to the key and value in the attention, not to the query. It is formalized as follows,
where the operation stands for a function that can compress the segment into one single vector, e.g., linear interpolation or average pooling; this work chooses the linear interpolation. Contrasting to Eq. (19), an offset term is introduced in Eq. (6), which is intended to prevent the overlap between the short-form left context and this long-form compressed context. For instance, on a model with a segment size of 4 and a left context of 8, we set an offset of 2 to skip the interval covered by the left context. According to Eq. (18), the context compression operates the input of each layer, preventing the auto-regression between successive segments; thus the whole sequence can be trained in parallel, thoroughly taking advantage of the graphics computing resource.
Our training data is a large-scale speech recognition dataset composed of two scenarios. The assistant scenario consists of three parts. One is 13K hours of recordings collected from third-party vendors via crowd-sourced volunteers responding to artificial prompts with mobile devices. The content varies from voice assistant commands to a simulation of conversations between people. The second is 1.3K hours of voice commands from production. The last is 4K hours of speech for calling names and phone numbers generated by an in-house TTS model. The open domain dictation has 18K hours of human transcribed data from video and 2M hours of unlabeled videos transcribed by a high-quality in-house model. The data was augmented with various distortion methods: speed perturbation , simulated reverberation SpecAugment , and randomly sampled additive background noise extracted from videos.
In evaluation, we use assi, call and dict dataset. The assi and call are 13.6K manually transcribed utterances from in-house volunteer employees, and each utterance starts with a wake word. The dict is 8 hours open domain dictation from crowd-sourced workers recorded via mobile devices.
3.2 Experiment Setting
The input features are 80-dim log Mel filter bank features at a 10ms frame rate; The network’s input is a 640-dim superframe consists of 8 consistent frames with a downsampling factor of 8 to 80ms frame rate. This paper explored models with 32M parameters and 73M parameters. In the 32M parameter baseline model, a projection layer maps the superframe to a 320-dim vector. The encoder consists of 21 Emformer layers. Each layer uses four heads for self-attention, and its FFN-block dimension is 1280. The predictor consists of a 256-dim embedding layer with 4096 sentence pieces , 1 LSTM layer with 512 hidden nodes, and a linear projection layer with 1024 output nodes. The baseline with 32M parameters uses a left context of 640ms (10 slots) in the left context. For the block size and right context, two settings are investigated. One is the block size of 320ms (4 slots) and right context of 80ms (1 slot); the other is a block size of 400ms (5 slots) and a right context of 0 (0 slots). In the 73M parameter model, the superframe is mapped to a 512-dim vector. The encoder has 20 layers of Emformer. Each layer has an 8-head self-attention and a 2048-dim FFN block. Its predictor has the same layer configuration as the 32M baseline, but the number of LSTM layers is 3. The left context is set to 2.4s, i.e., 30 slots in the left context. In training, on the 73M parameter model, SpecAugment  without time warping, and dropout 0.1 are used. We found that the 32M parameter models are underfitting a large amount of training data. The best performance is obtained by not using either scheme.
For our proposed models, we first investigate the non-causal convolution. A kernel size seven is used for depth-wise convolution operations. In the 32M parameter model, the superframe is projected to a 256-dim vector. In the 73M parameter model, the superframe is projected to a 384-dim vector. It consists of 20 layers containing an 8-head self-attention and a 1456-dim FFN block in each layer. It consists of 18 layers containing a 4-head self-attention and a 1024-dim FFN block in each layer. Other settings are the same as the baselines. The block size and right context are fixed as 320ms and 80ms, respectively. For the context compression scheme, we use a regular left context of 8 slots, implying 640ms. The compressed left context is set to 2 slots, implying 640ms; also, it uses an offset of 2, in Eq. (19), to skip the same interval of the 640ms regular left context. In total, ten slots are used but implying a history of 1280ms.
In all the experiments, alignment restrict RNNT  is used. The training of all the models uses 32 Nvidia V100 GPUs. We evaluate the models by word error rate (WER) for accuracy and the real-time factors (RTFs) and speech engine perceived latency (SPL) for latency. The SPL measures the time the speech engine gets the last word from user utterance to the speech engine transcribes the last word and gets the endpoint signals.
3.3 Improvement from Non-causal Convolution
Table. 1 gives the WER, RTF, and SPL results for models with 32M and 73M parameters. The results show that by keeping the overall context size (the sum of block size and lookahead context size) the same, using lookahead context gives WER improvement over not using it, especially for open domain dictation scenarios. We also observe that convolution and macaron structure improves the baseline using the same context configuration. Table. 1 also show that the direct application of causal convolution with 400ms block size does not improve the baseline model which leverages 320ms block size with 80ms lookahead context for both 32M and 73M models.
Using lookahead context adds more computation for encoders in transducer model, as the forward logic has duplicated computation for the lookahead context. For the 73M model, using right side context 80ms shows relative RTF increase. However, lookahead context provides more accurate ASR results and yields slightly better speech perceived latency (SPL).
3.4 Improvement from Talking-heads Attention and Context Compression
Table. 2 shows the impact of applying talking-heads attention and context compression on top of Emformer with non-causal convolutions. For the model with 32M parameters, the talking-heads attention generates , , and relative WER reductions on open-domain dictation, assistant general queries, and assistant calling queries, respectively. Using two slots of context compression outperforms the model with only regular left context. Combining non-casual convolution, talking-heads, and context compression in the 32M model improves the WER by 5.1%, 14.5%, 8.4% relatively on open-domain dictation, assistant general, and assistant calling test sets, while maintaining similar SPL and RTF as the Emformer baseline. For the model with 73M parameters, talking-heads attention and context compression obtain on par WER as the Emformer with non-causal convolution. Note the 73M parameters baseline uses 30 slots of left context, while context compression uses 8 slots of left context and 2 slots of memory which slightly improves the RTF and SPL. However, the 73M model already has a much stronger model capacity than the 32M model. The lightweight optimizations of talking-heads attention and context compression do not generate obvious improvement.
|73M Emformer (baseline)||15.49||3.98||5.81||599||0.30|
|+ Talk heads||14.69||3.64||5.79||574||0.30|
|+ Context Compression||14.69||3.66||5.77||554||0.29|
|32M Emformer (baseline)||17.09||5.05||6.68||588||0.22|
|+ Talk heads||16.25||4.48||6.35||620||0.23|
|+ Context Compression||16.22||4.32||6.12||589||0.24|
In this work, we proposed to use non-causal convolution, talking heads attention, and context compression to improve the streaming transformer transducer for speech recognition. This work managed to apply non-causal convolution with lookahead context in streaming transformer by separating the forward logic for the center block and lookahead context. The talking-heads attention coordinates the training of different heads in self-attention. The context compression keeps the representation contained in the long-form and short-form history similar, providing a compact way of introducing long-form information. The experiments on 32M parameter and 73M parameter models show that the proposed model outperforms the Emformer baseline on open-domain dictation, assistant general, and assistant calling scenarios while maintaining comparable RTF and latency.
-  (2021) Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset. In Proc. ICASSP, pp. 5904–5908. External Links: Cited by: §1.
-  (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In Proc. ICASSP, Cited by: §1.
Sequence Transduction with Recurrent Neural Networks. arXiv preprint arXiv:1211.3711. External Links: Cited by: §1.
-  (2020) Conformer: Convolution-augmented transformer for speech recognition. In Proc. INTERSPEECH, External Links: Cited by: §1, §1, §2.1.
-  (2019) Streaming End-to-end Speech Recognition for Mobile Devices. In Proc. ICASSP, Cited by: §1.
-  (2019) A Comparative Study on Transformer vs RNN in Speech Applications. arXiv preprint arXiv:1909.06317. Cited by: §1.
-  (2015) Audio augmentation for speech recognition. In Proc. INTERSPEECH, Cited by: §3.1.
-  (2018) SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Proc. EMNLP. External Links: Cited by: §3.2.
-  (2019) Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View. arXiv preprint arXiv:1906.02762. External Links: Cited by: §1.
Alignment restricted streaming recurrent neural network transducer. In Proc. SLT, External Links: Cited by: §3.2.
-  (2019) Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §3.1, §3.2.
-  (2018) A time-restricted self-attention layer for asr. In Proc. ICASSP, Cited by: §1.
-  (2019) Compressive Transformers for Long-Range Sequence Modelling. arXiv preprint arXiv:1911.05507. External Links: Cited by: §1.
-  (2020) Talking-Heads Attention. arXiv preprint arXiv:2003.02436. External Links: Cited by: §1, §2.2.
-  (2021) Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition. In Proc. ICASSP, External Links: Cited by: §1, §1, §1, §2.1, §2.
-  (2018) Self-attentional acoustic models. arXiv preprint arXiv:1803.09519. Cited by: §1.
-  (2017) Attention is all you need. In Proc. NIPS, External Links: Cited by: §1, §1, §2.
-  (2020) Low Latency End-to-End Streaming Speech Recognition with a Scout Network. arXiv preprint arXiv:12003.10369. External Links: Cited by: §1.
-  (2019) Transformer-Based Acoustic Modeling for Hybrid Speech Recognition. In Proc. ICASSP, External Links: Cited by: §1.
-  (2020) Streaming Transformer-based Acoustic Modeling Using Self-attention with Augmented Memory. In Proc. INTERSPEECH, Cited by: §1, §1, §2.1.
-  (2020) Streaming attention-based models with augmented memory for end-to-end speech recognition. In Proc. SLT, External Links: Cited by: §1, §1, §1, §2.1.
-  (2019) Transformer-Transducer: End-to-End Speech Recognition with Self-Attention. arXiv preprint arXiv:11910.12977. External Links: Cited by: §1.
-  (2021) Fastemit: Low-Latency Streaming Asr With Sequence-Level Emission Regularization. In Proc. ICASSP, Vol. 53. External Links: Cited by: §1.
-  (2020) Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces. InterSpeech. External Links: Cited by: §1.
-  (2020) Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. In Proc. ICASSP, External Links: Cited by: §1.
-  (2018) Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin Chinese. arXiv preprint arXiv:1804.10752. Cited by: §1.