Voice assisted devices nowadays are usually equipped with multiple microphones for far-field speech recognition in noisy environments [15, 36]. By combining the spectral and spatial information of target and interference signals captured from different microphones, the beamforming approaches [31, 41, 22, 21, 37, 28]
have been demonstrated to benefit automatic speech recognition (ASR) systems substantially for improved recognition accuracy[2, 21, 28]. The beamformer thus has become the standard module, typically introduced before the ASR front-end and acoustic model.
The delay-and-sum and super-directive beamformers [8, 18] are among the most popular beamforming methods for ASR, the latter one characterized by both its higher directivity and its lack of robustness to imperfect microphone arrays 
. With the great success of deep neural networks, neural beamformers have gained significant interest and are becoming the state-of-the-art technologies in end-to-end all-neural ASR systems[17, 11, 30, 4, 5, 23, 24, 42, 27, 25]. The neural beamforming methods are generally categorized into fixed beamforming (FBF) [23, 25] and adaptive beamforming (ABF) methods [17, 11, 30, 4, 5, 24, 27] depending on whether the beamforming weights are fixed or varied based on the input signals during inference time.
While neural beamforming approaches are attractive for their model capacity and direct access to the downstream ASR loss for optimizing the beamforming weights, their performance is still hindered by stagewise training. For example, the neural mask estimators in ABF methods[17, 11] usually must be pre-trained on synthetic data where the target speech and noise labels are well defined. The mismatch of these statistics between synthetic data and real-world data, however, can lead to noise leaking into the target speech statistics , and deteriorate its finetuning with the cascaded acoustic models.
Bypassing the need for stage-wise optimization and leveraging the core ability of transformer networks , i.e. attention on multiple modalities, a single integrated multi-channel transformer network was proposed  with both channel-wise and cross-channel attention layers for joint beamforming and acoustic modeling. Despite its effectiveness, this model is hard to apply to the streaming case such as on-device speech recognition , which demands low latency and low computation. First, it relies on an attention mechanism (encoder-decoder attention) over full encoder outputs to learn alignments between input and output sequences . Second, the input audio is encoded in a bidirectional way, thus requiring a full utterance as input. Furthermore, the attention computation increases quadratically with the length of input sequences. Finally, the model size of the multi-channel transformer increases w.r.t. the number of microphones and the number of time frames  due to the use of affine transformations to aggregate multi-channel embeddings in cross-channel attention layers. For these reasons, it is unsuitable for on-device ASR systems with small memory.
There exist many streamable ways for alignment learning such as connectionist temporal classification (CTC) , transducer , monotonic chunkwise attention (MoChA) , and triggered attention , all of which can be integrated with transformer [35, 9, 26, 38, 14]. In this work, we focus on transducer due to its outstanding performance over traditional hybrid models for streaming speech recognition [16, 32]. Several research efforts have combined transformer with transducer for single-channel speech recognition [34, 43, 44, 19], but to the best of our knowledge, it is the first time that transducer is integrated with multi-channel transformer.
In addition to achieving streamable alignment learning, we further make the encoders streamable via limiting future context (right-context) and previous context (left-context) in both channel-wise and cross-channel attention computations for multi-channel audio encoding, and constraining previous context in self attention for output sequence embedding as well. For cross-channel attention computations, we also propose to use two simple combiners, the average and concatenation of multiple channels to create keys and values. In this way, our model size does not increase as the number of microphone and input sequence length increase.
In a far-field in-house dataset, we show that the proposed multi-channel transformer transducer outperforms single channel and stagewise neural beamformers cascaded with transformer transducers by and WERR respectively. Moreover, our model performs better than multi-channel transformer  up to WERR and is times faster in terms of inference speed (TP50). Finally, we improve the computational cost of both multi-channel audio encoder and label encoder for streaming case, by limiting both the left and right context in attention computations. Moreover, the performance gap between the causal attention and full attention versions of our model can be bridged by attending to a limited number of future frames.
2 Multi-Channel Transformer Transducer
We denote -channel of audio sequences as where each channel is of frames, . We also denote a transcription label sequence of length as , where , and is a predefined set of token labels. As depicted in Fig. 1 (a), the transducer model encodes acoustic sequences first with a multi-channel audio encoder network (Fig. 1 (b)) to produce encoder output states as . For each encoder state , the model predicts either a label or a blank symbol with a joint network. If the model predicts a blank symbol, which indicates the lack of token label for that time step, then the model proceeds to the next encoder state. Different from CTC , the transducer model exploits not only the encoder output at time but also the previous non-blank label history as inputs to predict the next output. The previously predicted labels are encoded with a label encoder as shown in Fig. 1 (c).
The transducer model defines a conditional distribution,
where correspond to any possible alignment path with blank symbols and labels such that after removing all blank symbols in yields , and is the start of sentence symbol.
2.2 Multi-Channel Audio Encoder
Previous work on the transducer framework [13, 34, 43, 44, 19] relied only on single-channel input. To address multi-channel inputs, we propose to build our audio encoder based on multi-channel transformer network , as shown in Fig. 1 (b), containing two main blocks, channel-wise self-attention layers and cross-channel attention layers.
Channel-wise Self-Attention Layer (CSA): We start by projecting the source channel features (log-STFT magnitude and phase features are used in this work) to the dense embedding space for more discriminative representations. Then the embedded features plus the positional encoding  are fed into a set of learnable weight parameters to create Query (), Key (), Value (). Similar to , the transformed features, and , are used to compute the correlation across time steps within a channel via multi-head attention (MHA) . The resulting attention matrix is then used to reweight the features of in each time step followed by a feed-forward network to produce the self-attention outputs.
Cross-Channel Attention Layer (CCA): Given the self-attended outputs per channel, the cross-channel attention layers aim to learn the contextual relationship across channels both within and across time steps. Inspired by , when we use the -th channel to create , the other channels are leveraged by a combiner to create and . Different from  which takes the sum of channel encodings after applying affine transformations (Affine), we investigate two simple combiners: (1) Avg: take the average of the other channels along both time and embedding axes, , which can be seen as the symmetric weight case of the Affine combiner in  (2) Concat: concatenate the other channels along the time axis, . Here, and is the embedding size. With this adaptation, the model parameters do not increase w.r.t. the number of microphones () and time frames () as in . Finally, the cross-channel attention outputs are fused by a simple average.
2.3 Label Encoder and Joint Network
We leverage the transformer network to build the label encoder, as illustrated in Fig. 1
(c). An embedding layer converts previously predicted non-blank labels into vector representations. Then several linear layers project the embedding vectors in order to create, , and followed by masked MHA computations. The attention scores from the future frames are always masked out to ensure causality. Note that label encoder outputs do not attend to multi-channel audio encoder outputs, in contrast to the architecture in . As discussed in Sec. 1
, doing so poses a challenge for streaming applications. Instead, we use a joint network, which is a fully-connected feed-forward neural network with a single hidden layer and
as the activation function. We concatenate outputs of multi-channel audio encoder and label encoder as inputs to the joint network.
2.4 Limiting History and Future Contexts in Attention
Attending to the whole input acoustic sequences in attention computations (i.e. full attention) not only disables the streaming inference but also gives the high computational complexity, for computing encoder outputs. To reduce the computational cost and latency, we limit the left history frames () and future frames (), , of multi-channel encoder to compute . We also limit the left history frames () of the label encoder to compute . However, it also comes with potential performance drop, as investigated in experiments.
To evaluate our multi-channel transformer transducer (MCTT), we conduct a series of ASR experiments using over 2,200 hours of speech utterances from our in-house de-identified far-field dataset. The amount of training set, validation set (for model hyper-parameter selection), and test set are 2,000 hours, 24 hours, and 233 hours respectively. The device-directed speech data was captured using a smart speaker with 7 microphones, and a 63 mm aperture. The evaluation set has abundant annotations including the estimated SNR levels, and test-clean (no background speech) as well as test-other (with background speech) splits . In this dataset, 2 microphone signals of aperture distance and the super-directive beamformed signal by  using 7 microphone signals are employed through all the experiments.
Following , one of the baselines is single channel + Transformer Transducer (SC-TT); we feed each of two raw channels individually into the transformer transducer for training and testing, and pick the best performed one. In addition, we compare to three stagewise beamforming methods cascaded with the transformer transducer (TT) models. The beamforming methods include Super-directive beamformer (SDBF) , Neural beamformer (NBF) , and Neural masked-based beamformer (NMBF) . We denote the stagewise methods as SDBF-TT, NBF-TT, NMBF-TT, respectively. Note that SDBF-TT uses 7 microphone signals for beamforming as mentioned in section 3.1 while NBF-TT, NMBF-TT, and the proposed MCTT all take only 2 microphone signals as inputs. We also compare our method to multi-channel transformer network (MCT) , which is a single integrated multi-channel model.
3.3 Experimental Setup and Evaluation Metric
We set the number of audio encoder layers (=12) and label encoder layers (=6 for SC-TT, SDBF-TT, NBF-TT, NMBF-TT,
=4 for MCT and MCTT) with 512 neurons to make all models with comparable number of parameters (18 millions), except for NMBF-TT (25.39 millions) due to the additional mask estimator. Following , we use log-STFT square magnitude and phase features [40, 39]
as inputs of our method, which are extracted every 10 ms with a window size of 25 ms from audio samples. The same setting is also applied to the feature extraction for baselines following. The Adam optimizer , and subword tokenizer  with tokens are exploited. Results of all the experiments are reported as relative word error rate reduction (WERR) . The higher the WERR is the better.
3.4 Comparisons to Stagewise Multi-channel Models
We first compare the performance of MCTT with 2 channels, Avg combiner (MCTT-2) to the stagewise beamforming plus transformer transducer models, all with full attention audio encoder. The results are illustrated in Fig. 2. As shown in Fig. 2 (a), MCTT-2 outperforms SC-TT by 7.1% and neural beamformer + acoustic models (NBF-TT and NMBF-TT) by 6% in average. MCTT-2 also performs better than SDBF-TT by 2.48% even though it only considers 2 raw channels (2 chs). We further investigate if the super-directive beamformed signal is complementary to the other 2 channels by taking it as the third channel and feed them all to MCTT (denoted as MCTT-3). As can be seen in Fig. 2 (a), it provides 4% more improvements (WERRs) in average over all baselines as comparing to MCTT-2. In Fig. 2 (b), we further compare different methods w.r.t. different SNR levels. Again, we observe MCTT-2,3 achieve consistent improvements over SC-TT comparing to other methods across different SNRs.
3.5 Comparisons to Multi-channel Transformer
Next, we compare the proposed MCTT to MCT  with 2 channels and 3 channels (2 raw channels plus the super-directive beamformed signal) as inputs with different combiners. They are denoted as MCT-2,3 and MCTT-2,3 respectively. Note the combiner introduced in Sec. 2.2 is not needed for the 2-channel case, so its effect is only reported for the 3-channel case. We observe in Table 1 that MCTT-2 outperforms MCT-2 especially in test-clean split. Both MCT-3 and MCTT-3 with Avg combiner perform better than MCT-3 with Affine combiner, and MCTT-3 performs the best. Besides, using Avg combiner is more effective than using Concat combiner.
We further evaluate inference speed by measuring decoding time over 10,000 utterances on a Intel Xeon® Platinum 8175M processors machine using 1 CPU per method to process an utterance at a time with greedy search decoding. The Top Percentile values, TP50 (median), TP90, and TP99 wall clock times (WCT) are shown in Table 2. Most of inference time of MCT has been dedicated to the encoder-decoder attention, while MCTT does not have this issue and achieves times faster inference speed in terms of TP50.
|MC Audio Mask||Label Mask||WERR (%)|
|MC Audio Mask||Label Mask||WERR (%)|
|MC Audio Mask||Label Mask||WERR (%)|
3.6 Results of Limiting Contexts in Attention Computation
Finally, we ran training and decoding experiments using MCTT with limited attention windows over audio and text labels, with a view to build streaming multi-channel (MC) speech recognition systems with low latency and low computation cost. “inf” in Table 3, 4, 5 means we employ all of the left or right contexts. Besides, MC Audio Mask, and Label Mask indicate the coverage of audio/label frames to be considered in attention of Multi-Channel audio encoder and label encoder respectively.
We start from evaluating how the left context of the label encoder affects performance. In Table 3, we show that constraining each layer to use only 4 previous label frames yields the similar accuracy with the model using all previous frames per layer ( WERR in average when MC audio mask R=inf). As constraining right context of MC audio to 10, the WERR differences are also small; the maximum WERR difference is (-%-(-)%) when compared to using all previous frames per layer. It indicates that very limited left context for label encoder is good enough for MCTT.
We then fix the left context of label encoder to 20 , and constrain the MC audio encoder to attend to only the left of the current frame (so that no latency is introduced). As shown in Table 4, the WERs drastically degrade by and in test-clean and test-other splits comparing to MCTT with full attention MC audio encoder. By allowing the model to see some future frames (e.g. ), we can bring down the WER degradation to for both splits.
Table 5 reports the results when limiting both the left and right contexts of MC audio encoder. By doing so, not only the latency can be reduced, but also the time complexity for one-step inference becomes a constant. We limit the left context of MC audio encoder to 20 and 10 respectively, and then increase right context from 0 to 20. As can be seen in both cases, with the look-ahead to few future frames (e.g. ), the WER gap to the full-attention audio encoder based model was narrowed down to and respectively in test-other split.
We propose a novel speech recognition model, Multi-Channel Transformer Transducer, which is capable of leveraging multi-channel inputs in an end-to-end fashion and applicable to streaming decoding for speech recognition. We show that the proposed MCTT outperforms its stagewise counterparts, and significantly reduces the inference time against multi-channel transformer 
. Furthermore, by limiting the left contexts and with look-ahead to few future frames, we can not only improve the computation cost, but also bridge the gap between the performance of left-only attention and full attention models.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
-  (2015) The third ‘chime’speech separation and recognition challenge: dataset, task and baselines. In ASRU, Cited by: §1.
-  (2021) End-to-end multi-channel transformer for speech recognition. ICCASP. Cited by: §1, §1, §2.2, §2.2, §2.2, §2.3, Table 1, §3.2, §3.3, §3.5, Table 2, §4.
-  (2019) MIMO-speech: end-to-end multi-channel multi-speaker speech recognition. In ASRU, Cited by: §1.
-  (2020) End-to-end multi-speaker speech recognition with transformer. In ICASSP, Cited by: §1.
-  (2021) On the robustness of the superdirective beamformer. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 838–849. Cited by: §1.
-  (2018) Monotonic chunkwise attention. ICLR. Cited by: §1.
-  (2007) Superdirective beamforming robust against microphone mismatch. IEEE Transactions on Audio, Speech, and Language Processing 15 (2), pp. 617–631. Cited by: §1, §3.1, §3.2.
-  (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In ICASSP, Cited by: §1.
-  (2019) Unsupervised training of neural mask-based beamforming. arXiv preprint arXiv:1904.01578. Cited by: §1.
-  (2016) Improved mvdr beamforming using single-channel mask prediction networks.. In Interspeech, Cited by: §1, §1.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §1, §2.1.
-  (2012) Sequence transduction with recurrent neural networks. ICML workshop. Cited by: §1, §2.1, §2.2.
-  (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §1.
-  (2020) Far-field automatic speech recognition. Proceedings of the IEEE. Cited by: §1.
-  (2019) Streaming end-to-end speech recognition for mobile devices. In ICASSP, Cited by: §1, §1.
-  (2016) Neural network based spectral mask estimation for acoustic beamforming. In ICASSP, Cited by: §1, §1, §3.2, §3.3.
-  (2010) Clustered blind beamforming from ad-hoc microphone arrays. TASLP 19 (4), pp. 661–676. Cited by: §1.
-  (2020) Conv-transformer transducer: low latency, low frame rate, streamable end-to-end speech recognition. Interspeech. Cited by: §1, §2.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
-  (2016) A summary of the reverb challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP Journal on Advances in Signal Processing 2016 (1), pp. 1–19. Cited by: §1.
-  (2012) Microphone array processing for distant speech recognition: from close-talking microphones to far-field sensors. IEEE Signal Processing Magazine 29 (6), pp. 127–140. Cited by: §1.
-  (2019) Multi-geometry spatial acoustic modeling for distant speech recognition. In ICASSP, Cited by: §1, §3.2.
-  (2016) Neural network adaptive beamforming for robust multichannel speech recognition. In Interspeech, Cited by: §1.
-  (2014) Using neural network front-ends on far field multiple microphones based speech recognition. In ICASSP, Cited by: §1.
-  (2020) Exploring transformers for large-scale speech recognition. arXiv preprint arXiv:2005.09684. Cited by: §1.
Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In ICASSP, Cited by: §1.
-  (2016) The rwth/upb/forth system combination for the 4th chime challenge evaluation. In CHiME-4 workshop, Cited by: §1.
-  (2019) Triggered attention for end-to-end speech recognition. In ICASSP, Cited by: §1.
-  (2017) Multichannel end-to-end speech recognition. arXiv preprint arXiv:1703.04783. Cited by: §1.
-  (2001) Speech recognition with microphone arrays. In Microphone arrays, pp. 331–353. Cited by: §1.
-  (2020) A streaming on-device end-to-end model surpassing server-side conventional model quality and latency. In ICASSP, Cited by: §1.
-  (2016) Neural machine translation of rare words with subword units. In ACL, Cited by: §3.3.
-  (2019) Self-attention transducers for end-to-end speech recognition. Interspeech. Cited by: §1, §2.2.
-  (2017) Attention is all you need. In NeurNIPS, Cited by: §1, §1, §2.2.
-  (2018) Audio source separation and speech enhancement. John Wiley & Sons. Cited by: §1.
-  (2012) Techniques for noise robustness in automatic speech recognition. John Wiley & Sons. Cited by: §1.
-  (2020) Transformer-based acoustic modeling for hybrid speech recognition. In ICASSP, Cited by: §1.
-  (2018) Multi-channel deep clustering: discriminative spectral and spatial embeddings for speaker-independent speech separation. In ICASSP, Cited by: §3.3.
Combining spectral and spatial features for deep learning based blind speaker separation. TASLP 27 (2), pp. 457–468. Cited by: §3.3.
-  (2009) Distant speech recognition. John Wiley & Sons. Cited by: §1.
-  (2016) Deep beamforming networks for multi-channel speech recognition. In ICASSP, Cited by: §1.
-  (2019) Transformer-transducer: end-to-end speech recognition with self-attention. arXiv preprint arXiv:1910.12977. Cited by: §1, §2.2.
-  (2020) Transformer transducer: a streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP, Cited by: §1, §2.2.