1 Introduction
Deep learningbased beamforming systems, sometimes called neural beamformers, have been an active research topic in recent years [22, 18]
. A general pipeline in the design of many recent neural beamformers is to first perform preseparation on each channel independently, and then apply conventional beamforming techniques such as minimum variance distortionless response beamforming (MVDR) or multichannel Wiener filtering (MWF) based on the preseparation outputs
[6, 3, 23, 16, 26, 5, 18]. As those conventional beamforming techniques are typically defined as an optimization problem invariant to the permutation of number of the microphones, this pipeline serves as a universal solution to multichannel speech separation tasks in various configurations. However, as the preseparation stage is typically trained independently and the estimation of the beamforming filters is a deterministic operation irrelevant to the preseparation outputs, such systems may generate unreliable outputs when the preseparation stage fails.
Another pipeline for neural beamformers is to directly estimate the beamforming filters in either time domain or frequency domain
[19, 21, 15, 8, 12]. Without the use of conventional filter estimation operations in optimizationbased beamformers, this pipeline allows for endtoend estimation of beamforming filters in a fullytrainable fashion. However, such systems typically assume knowledge about the number of microphones, since a standard network layer can only generate a fixsized output. Moreover, as the fixsized output typically consists of the beamforming filters for all channels, it implicitly determines the permutation or indexing of the microphones during the assignment of the subparts of the output to different channels. As a consequence, permuting the channel indexes while maintaining their locations might generate completely different outputs and lead to inconsistent performance.A recently proposed system, the filterandsum network (FaSNet) [12], attempts to address the disadvantages of both types of pipelines. FaSNet directly estimates the timedomain beamforming filters without specifying the number or permutation of the microphones. With a twostage design, the first stage applies preseparation on a selected reference microphone by estimating its beamforming filters, and the second stage estimates the beamforming filters for all remaining microphones based on pairwise crosschannel features between the preseparation output and each of the remaining microphones. The filters from both stages are convolved with their corresponding channel waveforms and summed together to form the beamformed output. This is equivalent to replace the filter estimation operation in conventional beamformers by a pairwise endtoend filter estimation module and jointly train the two stages. The filter estimation in the second stage is invariant to permutation and number of the microphones due to the use of pairwise features. Experiment results on fixed geometry array configuration have shown that FaSNet was able to achieve better performance than conventional maskbased neural beamformers in multichannel speech separation and dereverberation tasks [12], indicating the potential of the model.
Although FaSNet overcomes the shortcomings of both pipelines, it also weakens the strengths of them. It still suffers from the problem that the performance of the preseparation stage greatly affects the filter estimation at the second stage, and the use of pairwise features prevents it from utilizing the information from all microphones to make a global decision during filter estimation. These flaws might cause unstable and unreliable performance especially in adhoc array configurations, where the acoustic properties of different microphones’ signals may significantly differ.
To allow the model to get rid of the weaknesses and preserve the advantages of both pipelines, we propose transformaverageconcatenate (TAC), a simple method for microphone permutation and number invariant processing that fully utilizes the information from all microphones. A TAC module first transforms each channel’s feature with a submodule shared by all channels, and then the outputs are averaged as a globalpooling stage and passed to another submodule for extra nonlinearity. The corresponding output is then concatenated with each of the outputs of the first transformation submodule and passed to a third submodule for generating channeldependent outputs. It is easy to see that, with parameter sharing at the transform and concatenate stages and the permutationinvariant property of the average stage, TAC guarantees channel permutation and number invariant processing and is always able to make global decisions. In Section 4 we will compare multiple model configurations with and without TAC and show that such design improves the FasNet performance in both adhoc and fixed geometry array settings.
2 Transformaverageconcatenate processing
2.1 Tramsformaverageconcatenate (TAC)
We consider an channel microphone array with an arbitrary geometry where can vary between 2 and a predefined maximum number . Each channel is represented by a sequential feature where denotes the sequence length and denotes arbitrary feature dimensions. For simplicity we assume onedimensional features, i.e. , although the proposed method can be easily extended to higher dimensions.
A TAC module first transforms each channel’s feature with a shared submodule. Although any neural network architectures can be applied, here we simply use a fullyconnected (FC) layer with parametric rectified linear unit (PReLU) activation at each time step:
(1) 
where is the th time step in , is the mapping function defined by the FC layer, and denotes the output for channel at time step . All features at time step are then averaged as a globalpooling stage, and passed to another FC layer with PReLU activation:
(2) 
where is the mapping function defined by this FC layer and is the output at time step . is then concatenated with at each channel and passed to a third FC layer with PReLU activation to generate channelspecific output :
(3) 
where is the mapping function defined by this FC layer and
denotes the concatenation operation of vector
and. A residual connection is then added between the original input
and to form the output of the TAC module:(4) 
TAC is closely related to the recent progress in permutation invariant functions and functions defined on sets [25]. Permutation invariant neural architectures have been widely investigated in problems such as relational reasoning [20], pointcloud analysis [11] and graph neural networks [24]. The transform and average stages correspond to the general idea of parametersharing and pooling in a family of permutation invariant functions [25], while the concatenate stage is applied as in the problem setting of beamforming, the dimension of outputs should match that of the inputs. The concatenate stage also allows the usage of residual connections, which enables the TAC module to be inserted into any deep architectures without increasing the optimization difficulty.
2.2 Filterandsum network (FaSNet) with TAC
2.2.1 FaSNet recap
Filterandsum network (FaSNet) is a timedomain filterandsum neural beamformer that directly estimates the beamforming filters with a twostage design. It first splits the input signals into frames of samples with a hop size of samples:
(5) 
where is the frame index. Each frame is then concatenated with a context window of samples in both future and past, resulting in a context frame of samples:
(6) 
We drop the frame index in the following discussions where there is no ambiguity. beamforming filters of length , , are estimated from for the target sources, and the waveforms of the sources are obtained by timedomain filterandsum operation:
(7) 
where is the beamformed output for source , and represents the convolution operation. All are then converted to waveforms through the overlapandadd method.
With a twostage design, FaSNet first estimates the beamforming filters for a selected reference microphone, which we denote as microphone 1 without the loss of generality. A crosschannel feature of length , which is defined as the normalized crosscorrelation feature (NCC), is calculated between and each of the :
(8) 
It is easy to see that
is defined as the cosine similarity between the center frame
at reference microphone and the context frame at microphone (including the reference microphone itself). A linear layer is also applied on to create a dimensional embedding as with the encoder in [14]:(9) 
where is the weight matrix. is then concatenated with the mean of all and passed to a neural network to calculate the beamforming filters . The meanpooling operation applied on guarantees that the crosschannel feature is invariant to the microphone permutations. The beamforming filters are then convolved with to generate the preseparation results :
(10) 
The second stage of FaSNet estimates source ’s beamforming filter for each of the remaining channels based on . Similarly, the NCC feature between and and channel embedding can be calculated in the same way. Another neural network then takes the concatenation of and as input and generates a single beamforming filter for source . All are convolved with their corresponding window , and summed with the preseparation output to form the final beamforming output:
(11) 
2.2.2 FaSNet variants with TAC
The most straightforward way to apply TAC in FaSNet is to replace the pairwise filter estimation in the second stage to a global operation, allowing the filters for each of the sources to be jointly estimated across all remaining microphones. For each block in the neural networks for filter estimation, e.g. each temporal convolution network (TCN) in [14] or each dualpath RNN (DPRNN) block in [13], the TAC architecture proposed in Section 2.1 is added at the output of each block. Figure 1 (A) and (B) compare the flowcharts of the original and modified twostage FaSNet models.
However, the preseparation results at the reference microphone still cannot benefit from the TAC operation with the twostage design. We thus propose a singlestage architecture where the filters for all channels, including the reference channel, are jointly estimated. Figure 1 (C) and (D) show the singlestage FaSNet models without and with TAC, respectively. For singlestage models, is used as the NCC feature for each channel without meanpooling.
3 Experimental procedures


Model  # of param.  # of mics  Overlap ratio  Average  

25%  2550%  5075%  75%  
TasNetfilter  2.9M  2 / 4 / 6  16.2 / 16.6 / 16.7  11.2 / 10.6 / 10.8  7.3 / 7.1 / 6.8  3.8 / 4.8 / 4.2  9.6 / 9.9 / 9.6 
+NCC ave.  2.9M  16.8 / 17.4 / 17.2  11.8 / 10.7 / 11.2  7.4 / 7.0 / 7.0  4.0 / 5.1 / 4.4  10.2 / 10.2 / 9.9  
+NCC ave.+4ms  2.9M  17.4 / 17.9 / 17.9  12.5 / 12.1 / 12.1  8.1 / 8.2 / 7.9  5.0 / 5.7 / 5.5  10.7 / 11.1 / 10.8  
FaSNet  3.0M  15.5 / 17.9 / 17.1  10.9 / 10.8 / 11.1  6.8 / 6.9 / 6.9  3.2 / 5.0 / 4.6  9.1 / 10.3 / 9.9  
+TAC  3.0M  15.6 / 17.5 / 18.5  11.3 / 10.8 / 11.4  6.8 / 7.0 / 7.0  3.4 / 4.4 / 4.3  9.3 / 10.1 / 10.2  
+joint  2.9M  16.4 / 18.3 / 17.6  11.9 / 11.9 / 11.8  8.1 / 8.0 / 7.9  4.7 / 5.6 / 5.3  10.3 / 11.0 / 10.6  
+TAC+joint  2.9M  18.4 / 20.1 / 20.8  13.6 / 14.5 / 14.7  9.3 / 10.2 / 10.6  6.0 / 8.3 / 8.4  11.8 / 13.4 / 13.6  
+TAC+joint+4ms  2.9M  18.5 / 20.5 / 21.3  13.8 / 14.6 / 15.2  9.9 / 10.8 / 11.2  6.7 / 8.9 / 9.2  12.2 / 13.8 / 14.2  



Model  # of param.  Speaker angle  Overlap ratio  Average  

15°  1545°  4590°  90°  25%  2550%  5075%  75%  
TasNetfilter  2.9M  9.6  9.7  9.7  9.7  16.1  11.3  7  4.3  9.7 
+NCC concat.  3.1M  9.2  9.0  8.9  8.9  15.4  10.6  6.6  3.5  9.0 
+NCC ave.  2.9M  10.2  10.0  9.9  9.8  15.8  11.5  7.9  4.8  10.0 
+NCC ave.+4ms  2.9M  10.9  11.0  10.8  10.9  17.4  12.5  8.3  5.5  10.9 
FaSNet  3.0M  11.1  11.8  12.4  12.9  18.1  13.2  9.6  7.3  12.0 
+TAC+joint  2.9M  11.8  13.2  14.2  14.8  19.3  14.6  11.2  8.9  13.5 
+TAC+joint+4ms  2.9M  12.4  13.8  14.6  15.2  19.9  15.3  11.5  9.4  14.0 

3.1 Dataset
We evaluate our approach on the task of multichannel twospeaker noisy speech separation with both adhoc and fixed geometry microphone arrays. We create a multichannel noisy reverberant dataset with 20000, 5000 and 3000 4second long utterances from the Librispeech dataset [17]. Two speakers are randomly selected from the 100hour Librispeech dataset and convolved with room impulse responses generated by the image method [1] using the gpuRIR toolbox [2]. The length and width of the room are randomly sampled between 3 and 10 meters, and the height is randomly sampled between 2.5 and 4 meters. The reverberation time (T60) is randomly sampled between 0.1 and 0.5 seconds. An overlap ratio between the two speakers is then sampled between 0 and 100% such that the average overlap ratio across the dataset is 50%. The two reverberant speech signals are then shifted accordingly and summed at a random SNR between 5 and 5 dB. The resultant mixture is further corrupted by a random noise signal sampled from [7]
. The noise is repeated if its length is smaller than 4 seconds, and the SNR between the mixture and the noise is randomly sampled between 10 and 20 dB. All microphone, speaker and noise locations in the adhoc array dataset are randomly sampled to be at least 0.5 m away from the room walls. In the fixed geometry array dataset, the microphone center is first sampled and then 6 microphones are evenly distributed around a circle with diameter of 10 cm. The speaker locations are then sampled such that the average speaker angle with respect to the microphone center is uniformly distributed between 0 and 180 degrees. The noise location is sampled without further constraints. The adhoc array dataset contains utterances with 2 to 6 microphones, where the number of utterances for each microphone configuration is set equal.
3.2 Model configurations
We compare multiple models in both array configurations. For singlechannel models, we use the first stage in the original FaSNet as a modification to the timedomain audio separation network [14], where the separation is done by estimating filters for each context frame in the mixture instead of masking matrices on a generated frontend. We refer to this model as TasNetfilter. For adding NCC features to the models, we apply three strategies: (1) no NCC feature (pure singlechannel processing), (2) concatenate the meanpooled NCC features (i.e. first stage in FaSNet), and (3) concatenate all NCC features according to microphone indexes (similar to [4], only applicable in fixed geometry array). For multichannel models, we use the four variants of FaSNet introduced in Section 2.2.2. We use DPRNN blocks [13] as shown in Figure 1 in all models, as it has shown that DPRNN was able to outperform the previously proposed temporal convolutional network (TCN) with a significantly smaller model size [13]. All models are trained to maximize scaleinvariant SNR (SISNR) [10] with utterancelevel permutation invariant training (uPIT) [9]. The training target is always the reverberant clean speech signals. We report SISNR improvement (SISNRi) as the separation performance metric.
The context window size is always set to 16 ms (i.e. 256 at 16k Hz sample rate) and by default we set . We also investigate smaller while keeping at 16 ms, which allows us to investigate the effect of with dimension of both beamforming filters and the NCC features unchanged (both ). The details about the dataset generation as well as model configurations is available online^{1}^{1}1https://github.com/yluo42/TAC.
4 Results and discussions
Table 1 shows the experiment results on the adhoc array configuration. We only report the results on 2, 4 and 6 microphones due to the space limit. For the TasNetbased models, minor performance improvement can be achieved with the averaged NCC features, however increasing the number of microphones does not necessarily improves the performance. For the twostage FaSNet models, the performance is even worse than TasNet with NCC feature even with TAC applied at the second stage. As TasNet with averaged NCC feature is equivalent to the first stage in FaSNet, this observation indicates that the twostage design cannot perform reliable beamforming at the second stage. On the other hand, singlestage FaSNet without TAC already outperforms both TasNet and twostage FaSNet models, and adding TAC results in the best performance for all microphone numbers. This shows that the preseparation stage is unnecessary in this framework, and TAC is able to estimate filters by fully using all available information. Moreover, singlestage FaSNet with TAC is the only model achieving better performance with more microphones, proving the importance of TAC in this configuration.
Although TAC is designed for the adhoc array configuration, where the permutation and the number of microphones are unknown, we also investigate whether improvements can be achieved in a fixed geometry array configuration. Table 2 shows the experiment results with the 6mic circular array described earlier. We notice that TasNet with all NCC features concatenated leads to even worse performance than the pure singlechannel model, indicating that we might need to rethink the properness of feature concatenation in such frameworks. The original FasNet has significantly better performance than all TasNet models, which matches the observation in [12]. However, the singlestage FaSNet with TAC still greatly outperforms the original FaSNet across all conditions, showing that TAC is also helpful for fixed geometry arrays. A possible explanation for this is that TAC is able to learn geometrydependent information even without explicit geometryrelated features.
In the original FaSNet, it was reported that smaller center window size led to significantly worse performance due to the lack of frequency resolution. Here we argue that the worse performance was actually due to the lack of global processing in filter estimation. In the last row of both tables we can observe better performance for singlestage FaSNet with TAC with 4 ms window. This strengthens our argument and further proves the effectiveness of TAC across various model configurations.
5 Conclusion
We proposed transformaverageconcatenate (TAC), a simple method for endtoend microphone permutation and number invariant multichannel speech separation. A TAC module first transformed each input channel feature with a submodule, and averaged the outputs and passed it to another submodule, finally concatenated the output from second stage with each of the output from the first stage and passed it to a third submodule. The first and third submodules were shared for all channels. TAC can be viewed as a function defined on sets being invariant to the permutation and number of set elements, and guaranteed to use the fully information within the set to make global decisions. We showed how TAC can be inserted seamlessly into the filterandsum network (FaSNet), a recently proposed endtoend multichannel speech separation model, to greatly improve the separation performance in both adhoc and fixed geometry configurations. We hope TAC can shed light on model designs for other multichannel processing problems.
References
 [1] (1979) Image method for efficiently simulating smallroom acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §3.1.
 [2] (2018) GpuRIR: a python library for room impulse response simulation with gpu acceleration. arXiv preprint arXiv:1810.11359. Cited by: §3.1.
 [3] (2016) Improved mvdr beamforming using singlechannel mask prediction networks.. In Proc. Interspeech, pp. 1981–1985. Cited by: §1.
 [4] (2019) Endtoend multichannel speech separation. arXiv preprint arXiv:1905.06286. Cited by: §3.2.
 [5] (2018) Performance of mask based statistical beamforming in a smart home scenario. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6722–6726. Cited by: §1.
 [6] (2016) Neural network based spectral mask estimation for acoustic beamforming. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 196–200. Cited by: §1.
 [7] 100 Nonspeech Sounds. Note: http://web.cse.ohiostate.edu/pnl/corpus/HuNonspeech/HuCorpus.html Cited by: §3.1.
 [8] (2018) Estimation of mvdr beamforming weights based on deep neural network. In Audio Engineering Society Convention 145, Cited by: §1.

[9]
(2017)
Multitalker speech separation with utterancelevel permutation invariant training of deep recurrent neural networks
. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (10), pp. 1901–1913. Cited by: §3.2.  [10] (2019) SDR–halfbaked or well done?. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. Cited by: §3.2.

[11]
(2018)
Sonet: selforganizing network for point cloud analysis.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 9397–9406. Cited by: §2.1.  [12] (2019) FaSNet: lowlatency adaptive beamforming for multimicrophone audio processing. arXiv preprint arXiv:1909.13387. Cited by: §1, §1, §4.
 [13] (2019) Dualpath rnn: efficient long sequence modeling for timedomain singlechannel speech separation. arXiv preprint arXiv:1910.06379. Cited by: §2.2.2, §3.2.
 [14] (2019) Convtasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §2.2.1, §2.2.2, §3.2.

[15]
(2017)
Deep long shortterm memory adaptive beamforming networks for multichannel robust speech recognition
. arXiv preprint arXiv:1711.08016. Cited by: §1.  [16] (2017) Unified architecture for multichannel endtoend speech recognition with neural beamforming. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1274–1288. Cited by: §1.
 [17] (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §3.1.
 [18] (2018) Deep learning based speech beamforming. arXiv preprint arXiv:1802.05383. Cited by: §1.

[19]
(2017)
Multichannel signal processing with deep neural networks for automatic speech recognition
. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (5), pp. 965–979. Cited by: §1.  [20] (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §2.1.
 [21] (2016) Beamforming networks using spatial covariance features for farfield speech recognition. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016 AsiaPacific, pp. 1–6. Cited by: §1.
 [22] (2016) A study of learning based beamforming methods for speech recognition. In CHiME 2016 workshop, pp. 26–31. Cited by: §1.
 [23] (2017) On timefrequency mask estimation for mvdr beamforming with application in robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 3246–3250. Cited by: §1.
 [24] (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §2.1.
 [25] (2017) Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §2.1.
 [26] (2017) A speech enhancement algorithm by iterating singleand multimicrophone processing and its application to robust asr. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 276–280. Cited by: §1.
Comments
There are no comments yet.