. A general pipeline in the design of many recent neural beamformers is to first perform pre-separation on each channel independently, and then apply conventional beamforming techniques such as minimum variance distortionless response beamforming (MVDR) or multi-channel Wiener filtering (MWF) based on the pre-separation outputs[6, 3, 23, 16, 26, 5, 18]
. As those conventional beamforming techniques are typically defined as an optimization problem invariant to the permutation of number of the microphones, this pipeline serves as a universal solution to multi-channel speech separation tasks in various configurations. However, as the pre-separation stage is typically trained independently and the estimation of the beamforming filters is a deterministic operation irrelevant to the pre-separation outputs, such systems may generate unreliable outputs when the pre-separation stage fails.
Another pipeline for neural beamformers is to directly estimate the beamforming filters in either time domain or frequency domain[19, 21, 15, 8, 12]. Without the use of conventional filter estimation operations in optimization-based beamformers, this pipeline allows for end-to-end estimation of beamforming filters in a fully-trainable fashion. However, such systems typically assume knowledge about the number of microphones, since a standard network layer can only generate a fix-sized output. Moreover, as the fix-sized output typically consists of the beamforming filters for all channels, it implicitly determines the permutation or indexing of the microphones during the assignment of the sub-parts of the output to different channels. As a consequence, permuting the channel indexes while maintaining their locations might generate completely different outputs and lead to inconsistent performance.
A recently proposed system, the filter-and-sum network (FaSNet) , attempts to address the disadvantages of both types of pipelines. FaSNet directly estimates the time-domain beamforming filters without specifying the number or permutation of the microphones. With a two-stage design, the first stage applies pre-separation on a selected reference microphone by estimating its beamforming filters, and the second stage estimates the beamforming filters for all remaining microphones based on pair-wise cross-channel features between the pre-separation output and each of the remaining microphones. The filters from both stages are convolved with their corresponding channel waveforms and summed together to form the beamformed output. This is equivalent to replace the filter estimation operation in conventional beamformers by a pair-wise end-to-end filter estimation module and jointly train the two stages. The filter estimation in the second stage is invariant to permutation and number of the microphones due to the use of pair-wise features. Experiment results on fixed geometry array configuration have shown that FaSNet was able to achieve better performance than conventional mask-based neural beamformers in multi-channel speech separation and dereverberation tasks , indicating the potential of the model.
Although FaSNet overcomes the shortcomings of both pipelines, it also weakens the strengths of them. It still suffers from the problem that the performance of the pre-separation stage greatly affects the filter estimation at the second stage, and the use of pair-wise features prevents it from utilizing the information from all microphones to make a global decision during filter estimation. These flaws might cause unstable and unreliable performance especially in ad-hoc array configurations, where the acoustic properties of different microphones’ signals may significantly differ.
To allow the model to get rid of the weaknesses and preserve the advantages of both pipelines, we propose transform-average-concatenate (TAC), a simple method for microphone permutation and number invariant processing that fully utilizes the information from all microphones. A TAC module first transforms each channel’s feature with a sub-module shared by all channels, and then the outputs are averaged as a global-pooling stage and passed to another sub-module for extra nonlinearity. The corresponding output is then concatenated with each of the outputs of the first transformation sub-module and passed to a third sub-module for generating channel-dependent outputs. It is easy to see that, with parameter sharing at the transform and concatenate stages and the permutation-invariant property of the average stage, TAC guarantees channel permutation and number invariant processing and is always able to make global decisions. In Section 4 we will compare multiple model configurations with and without TAC and show that such design improves the FasNet performance in both ad-hoc and fixed geometry array settings.
2 Transform-average-concatenate processing
2.1 Tramsform-average-concatenate (TAC)
We consider an -channel microphone array with an arbitrary geometry where can vary between 2 and a pre-defined maximum number . Each channel is represented by a sequential feature where denotes the sequence length and denotes arbitrary feature dimensions. For simplicity we assume one-dimensional features, i.e. , although the proposed method can be easily extended to higher dimensions.
A TAC module first transforms each channel’s feature with a shared sub-module. Although any neural network architectures can be applied, here we simply use a fully-connected (FC) layer with parametric rectified linear unit (PReLU) activation at each time step:
where is the -th time step in , is the mapping function defined by the FC layer, and denotes the output for channel at time step . All features at time step are then averaged as a global-pooling stage, and passed to another FC layer with PReLU activation:
where is the mapping function defined by this FC layer and is the output at time step . is then concatenated with at each channel and passed to a third FC layer with PReLU activation to generate channel-specific output :
where is the mapping function defined by this FC layer and
denotes the concatenation operation of vectorand
. A residual connection is then added between the original inputand to form the output of the TAC module:
TAC is closely related to the recent progress in permutation invariant functions and functions defined on sets . Permutation invariant neural architectures have been widely investigated in problems such as relational reasoning , point-cloud analysis  and graph neural networks . The transform and average stages correspond to the general idea of parameter-sharing and pooling in a family of permutation invariant functions , while the concatenate stage is applied as in the problem setting of beamforming, the dimension of outputs should match that of the inputs. The concatenate stage also allows the usage of residual connections, which enables the TAC module to be inserted into any deep architectures without increasing the optimization difficulty.
2.2 Filter-and-sum network (FaSNet) with TAC
2.2.1 FaSNet recap
Filter-and-sum network (FaSNet) is a time-domain filter-and-sum neural beamformer that directly estimates the beamforming filters with a two-stage design. It first splits the input signals into frames of samples with a hop size of samples:
where is the frame index. Each frame is then concatenated with a context window of samples in both future and past, resulting in a context frame of samples:
We drop the frame index in the following discussions where there is no ambiguity. beamforming filters of length , , are estimated from for the target sources, and the waveforms of the sources are obtained by time-domain filter-and-sum operation:
where is the beamformed output for source , and represents the convolution operation. All are then converted to waveforms through the overlap-and-add method.
With a two-stage design, FaSNet first estimates the beamforming filters for a selected reference microphone, which we denote as microphone 1 without the loss of generality. A cross-channel feature of length , which is defined as the normalized cross-correlation feature (NCC), is calculated between and each of the :
It is easy to see that
is defined as the cosine similarity between the center frameat reference microphone and the context frame at microphone (including the reference microphone itself). A linear layer is also applied on to create a -dimensional embedding as with the encoder in :
where is the weight matrix. is then concatenated with the mean of all and passed to a neural network to calculate the beamforming filters . The mean-pooling operation applied on guarantees that the cross-channel feature is invariant to the microphone permutations. The beamforming filters are then convolved with to generate the pre-separation results :
The second stage of FaSNet estimates source ’s beamforming filter for each of the remaining channels based on . Similarly, the NCC feature between and and channel embedding can be calculated in the same way. Another neural network then takes the concatenation of and as input and generates a single beamforming filter for source . All are convolved with their corresponding window , and summed with the pre-separation output to form the final beamforming output:
2.2.2 FaSNet variants with TAC
The most straightforward way to apply TAC in FaSNet is to replace the pair-wise filter estimation in the second stage to a global operation, allowing the filters for each of the sources to be jointly estimated across all remaining microphones. For each block in the neural networks for filter estimation, e.g. each temporal convolution network (TCN) in  or each dual-path RNN (DPRNN) block in , the TAC architecture proposed in Section 2.1 is added at the output of each block. Figure 1 (A) and (B) compare the flowcharts of the original and modified two-stage FaSNet models.
However, the pre-separation results at the reference microphone still cannot benefit from the TAC operation with the two-stage design. We thus propose a single-stage architecture where the filters for all channels, including the reference channel, are jointly estimated. Figure 1 (C) and (D) show the single-stage FaSNet models without and with TAC, respectively. For single-stage models, is used as the NCC feature for each channel without mean-pooling.
3 Experimental procedures
|Model||# of param.||# of mics||Overlap ratio||Average|
|TasNet-filter||2.9M||2 / 4 / 6||16.2 / 16.6 / 16.7||11.2 / 10.6 / 10.8||7.3 / 7.1 / 6.8||3.8 / 4.8 / 4.2||9.6 / 9.9 / 9.6|
|+NCC ave.||2.9M||16.8 / 17.4 / 17.2||11.8 / 10.7 / 11.2||7.4 / 7.0 / 7.0||4.0 / 5.1 / 4.4||10.2 / 10.2 / 9.9|
|+NCC ave.+4ms||2.9M||17.4 / 17.9 / 17.9||12.5 / 12.1 / 12.1||8.1 / 8.2 / 7.9||5.0 / 5.7 / 5.5||10.7 / 11.1 / 10.8|
|FaSNet||3.0M||15.5 / 17.9 / 17.1||10.9 / 10.8 / 11.1||6.8 / 6.9 / 6.9||3.2 / 5.0 / 4.6||9.1 / 10.3 / 9.9|
|+TAC||3.0M||15.6 / 17.5 / 18.5||11.3 / 10.8 / 11.4||6.8 / 7.0 / 7.0||3.4 / 4.4 / 4.3||9.3 / 10.1 / 10.2|
|+joint||2.9M||16.4 / 18.3 / 17.6||11.9 / 11.9 / 11.8||8.1 / 8.0 / 7.9||4.7 / 5.6 / 5.3||10.3 / 11.0 / 10.6|
|+TAC+joint||2.9M||18.4 / 20.1 / 20.8||13.6 / 14.5 / 14.7||9.3 / 10.2 / 10.6||6.0 / 8.3 / 8.4||11.8 / 13.4 / 13.6|
|+TAC+joint+4ms||2.9M||18.5 / 20.5 / 21.3||13.8 / 14.6 / 15.2||9.9 / 10.8 / 11.2||6.7 / 8.9 / 9.2||12.2 / 13.8 / 14.2|
|Model||# of param.||Speaker angle||Overlap ratio||Average|
We evaluate our approach on the task of multi-channel two-speaker noisy speech separation with both ad-hoc and fixed geometry microphone arrays. We create a multi-channel noisy reverberant dataset with 20000, 5000 and 3000 4-second long utterances from the Librispeech dataset . Two speakers are randomly selected from the 100-hour Librispeech dataset and convolved with room impulse responses generated by the image method  using the gpuRIR toolbox . The length and width of the room are randomly sampled between 3 and 10 meters, and the height is randomly sampled between 2.5 and 4 meters. The reverberation time (T60) is randomly sampled between 0.1 and 0.5 seconds. An overlap ratio between the two speakers is then sampled between 0 and 100% such that the average overlap ratio across the dataset is 50%. The two reverberant speech signals are then shifted accordingly and summed at a random SNR between -5 and 5 dB. The resultant mixture is further corrupted by a random noise signal sampled from 
. The noise is repeated if its length is smaller than 4 seconds, and the SNR between the mixture and the noise is randomly sampled between 10 and 20 dB. All microphone, speaker and noise locations in the ad-hoc array dataset are randomly sampled to be at least 0.5 m away from the room walls. In the fixed geometry array dataset, the microphone center is first sampled and then 6 microphones are evenly distributed around a circle with diameter of 10 cm. The speaker locations are then sampled such that the average speaker angle with respect to the microphone center is uniformly distributed between 0 and 180 degrees. The noise location is sampled without further constraints. The ad-hoc array dataset contains utterances with 2 to 6 microphones, where the number of utterances for each microphone configuration is set equal.
3.2 Model configurations
We compare multiple models in both array configurations. For single-channel models, we use the first stage in the original FaSNet as a modification to the time-domain audio separation network , where the separation is done by estimating filters for each context frame in the mixture instead of masking matrices on a generated front-end. We refer to this model as TasNet-filter. For adding NCC features to the models, we apply three strategies: (1) no NCC feature (pure single-channel processing), (2) concatenate the mean-pooled NCC features (i.e. first stage in FaSNet), and (3) concatenate all NCC features according to microphone indexes (similar to , only applicable in fixed geometry array). For multi-channel models, we use the four variants of FaSNet introduced in Section 2.2.2. We use DPRNN blocks  as shown in Figure 1 in all models, as it has shown that DPRNN was able to outperform the previously proposed temporal convolutional network (TCN) with a significantly smaller model size . All models are trained to maximize scale-invariant SNR (SI-SNR)  with utterance-level permutation invariant training (uPIT) . The training target is always the reverberant clean speech signals. We report SI-SNR improvement (SI-SNRi) as the separation performance metric.
The context window size is always set to 16 ms (i.e. 256 at 16k Hz sample rate) and by default we set . We also investigate smaller while keeping at 16 ms, which allows us to investigate the effect of with dimension of both beamforming filters and the NCC features unchanged (both ). The details about the dataset generation as well as model configurations is available online111https://github.com/yluo42/TAC.
4 Results and discussions
Table 1 shows the experiment results on the ad-hoc array configuration. We only report the results on 2, 4 and 6 microphones due to the space limit. For the TasNet-based models, minor performance improvement can be achieved with the averaged NCC features, however increasing the number of microphones does not necessarily improves the performance. For the two-stage FaSNet models, the performance is even worse than TasNet with NCC feature even with TAC applied at the second stage. As TasNet with averaged NCC feature is equivalent to the first stage in FaSNet, this observation indicates that the two-stage design cannot perform reliable beamforming at the second stage. On the other hand, single-stage FaSNet without TAC already outperforms both TasNet and two-stage FaSNet models, and adding TAC results in the best performance for all microphone numbers. This shows that the pre-separation stage is unnecessary in this framework, and TAC is able to estimate filters by fully using all available information. Moreover, single-stage FaSNet with TAC is the only model achieving better performance with more microphones, proving the importance of TAC in this configuration.
Although TAC is designed for the ad-hoc array configuration, where the permutation and the number of microphones are unknown, we also investigate whether improvements can be achieved in a fixed geometry array configuration. Table 2 shows the experiment results with the 6-mic circular array described earlier. We notice that TasNet with all NCC features concatenated leads to even worse performance than the pure single-channel model, indicating that we might need to rethink the properness of feature concatenation in such frameworks. The original FasNet has significantly better performance than all TasNet models, which matches the observation in . However, the single-stage FaSNet with TAC still greatly outperforms the original FaSNet across all conditions, showing that TAC is also helpful for fixed geometry arrays. A possible explanation for this is that TAC is able to learn geometry-dependent information even without explicit geometry-related features.
In the original FaSNet, it was reported that smaller center window size led to significantly worse performance due to the lack of frequency resolution. Here we argue that the worse performance was actually due to the lack of global processing in filter estimation. In the last row of both tables we can observe better performance for single-stage FaSNet with TAC with 4 ms window. This strengthens our argument and further proves the effectiveness of TAC across various model configurations.
We proposed transform-average-concatenate (TAC), a simple method for end-to-end microphone permutation and number invariant multi-channel speech separation. A TAC module first transformed each input channel feature with a sub-module, and averaged the outputs and passed it to another sub-module, finally concatenated the output from second stage with each of the output from the first stage and passed it to a third sub-module. The first and third sub-modules were shared for all channels. TAC can be viewed as a function defined on sets being invariant to the permutation and number of set elements, and guaranteed to use the fully information within the set to make global decisions. We showed how TAC can be inserted seamlessly into the filter-and-sum network (FaSNet), a recently proposed end-to-end multi-channel speech separation model, to greatly improve the separation performance in both ad-hoc and fixed geometry configurations. We hope TAC can shed light on model designs for other multi-channel processing problems.
-  (1979) Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §3.1.
-  (2018) GpuRIR: a python library for room impulse response simulation with gpu acceleration. arXiv preprint arXiv:1810.11359. Cited by: §3.1.
-  (2016) Improved mvdr beamforming using single-channel mask prediction networks.. In Proc. Interspeech, pp. 1981–1985. Cited by: §1.
-  (2019) End-to-end multi-channel speech separation. arXiv preprint arXiv:1905.06286. Cited by: §3.2.
-  (2018) Performance of mask based statistical beamforming in a smart home scenario. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6722–6726. Cited by: §1.
-  (2016) Neural network based spectral mask estimation for acoustic beamforming. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 196–200. Cited by: §1.
-  100 Nonspeech Sounds. Note: http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html Cited by: §3.1.
-  (2018) Estimation of mvdr beamforming weights based on deep neural network. In Audio Engineering Society Convention 145, Cited by: §1.
Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (10), pp. 1901–1913. Cited by: §3.2.
-  (2019) SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. Cited by: §3.2.
-  (2018) So-net: self-organizing network for point cloud analysis. In , pp. 9397–9406. Cited by: §2.1.
-  (2019) FaSNet: low-latency adaptive beamforming for multi-microphone audio processing. arXiv preprint arXiv:1909.13387. Cited by: §1, §1, §4.
-  (2019) Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. arXiv preprint arXiv:1910.06379. Cited by: §2.2.2, §3.2.
-  (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §2.2.1, §2.2.2, §3.2.
Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. arXiv preprint arXiv:1711.08016. Cited by: §1.
-  (2017) Unified architecture for multichannel end-to-end speech recognition with neural beamforming. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1274–1288. Cited by: §1.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §3.1.
-  (2018) Deep learning based speech beamforming. arXiv preprint arXiv:1802.05383. Cited by: §1.
Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (5), pp. 965–979. Cited by: §1.
-  (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §2.1.
-  (2016) Beamforming networks using spatial covariance features for far-field speech recognition. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016 Asia-Pacific, pp. 1–6. Cited by: §1.
-  (2016) A study of learning based beamforming methods for speech recognition. In CHiME 2016 workshop, pp. 26–31. Cited by: §1.
-  (2017) On time-frequency mask estimation for mvdr beamforming with application in robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 3246–3250. Cited by: §1.
-  (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §2.1.
-  (2017) Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §2.1.
-  (2017) A speech enhancement algorithm by iterating single-and multi-microphone processing and its application to robust asr. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 276–280. Cited by: §1.