We propose an end-to-end trainable approach to single-channel speech separation with unknown number of speakers. Our approach extends the MulCat source separation backbone with additional output heads: a count-head to infer the number of speakers, and decoder-heads for reconstructing the original signals. Beyond the model, we also propose a metric on how to evaluate source separation with variable number of speakers. Specifically, we cleared up the issue on how to evaluate the quality when the ground-truth hasmore or less speakers than the ones predicted by the model. We evaluate our approach on the WSJ0-mix datasets, with mixtures up to five speakers. We demonstrate that our approach outperforms state-of-the-art in counting the number of speakers and remains competitive in quality of reconstructed signals.READ FULL TEXT VIEW PDF
Most approaches to multi-talker overlapped speech separation and recogni...
Blind source separation (BSS) is addressed, using a novel data-driven
We present a novel source separation model to decompose asingle-channel
Given a multi-microphone recording of an unknown number of speakers talk...
The task of estimating the maximum number of concurrent speakers from si...
We present a unified network for voice separation of an unknown number o...
Research in deep learning for multi-speaker source separation has receiv...
Source separation is the task of decomposing a mixed signal into the original signals prior to the mixing procedure. This is an important task with many downstream applications, , improve the accuracy of automatic speech recognition with multiple speakers, or separating out singing voices and music 
. Due to the recent progress in deep learning, supervised methods have received a lot of interest[3, 1, 12, 7, 6]. These works formulate the source separation as a regression problem, , given the mixed signal regress the individual components. Various specialized deep-net architectures and losses have been proposed. For example,  proposed a loss which is permutation invariant in the ordering of the speakers, or  presented a dual-path RNN architecture to better capture both short and long-term features. However, these works have focused on the setting where the number of speakers is a priori known.
Recently, several works have also considered the case with variable number of speakers. For example,  have proposed a method for separating variable number of speakers, where they train a different model for every number of speakers. At test time, they run an activity detector on the largest speaker model to determine to number of speakers and then run the corresponding model for source separation. Another work is  where they have proposed to iteratively separate out one speaker at a time. While straight forward, these methods either require training multiple deep-nets or running multiple forward-passes at test-time, both of which scale linearly with the possible number of speakers.
To tackle the aforementioned issues, we propose to train a single model with multiple output heads: a count-head to infer the number of speakers, and multiple decoder heads to separate the signals. These output heads share the same backbone feature extractor. Therefore, our method requires a single pass through the network at test time and can be trained from end-to-end. Additionally, we propose a new metric for evaluating the separation of a variable number of speakers. In particular, our metric considers how to evaluate the quality of the reconstruction when the number of speakers differs between prediction and ground-truth.
We evaluate our approach on WSJ0-mix dataset, with up to five speaker mixtures. Our approach surpasses all existing approaches in terms of source counting and achieves similar performance to state-of-the-art models in source separation.
We present a single model approach to source separation with a variable number of speakers, illustrated in fig:pipeline. In particular, we augment the standard source separation backbone with additional count-head and decoder-heads to support prediction of variable number of speakers in a single pass. In the following, we describe our approach in more detail.
Let denote the mixed input audio, and denote the set of separated audios from each speaker. The goal is to learn a parametric function, F_θ(x) ↦Y, with trainable parameters . One of the challenges is how to construct a model to handle variable number of outputs? For example, a standard deep-net has a fixed number of output dimensions and does not change between examples.
To mitigate this problem, we assume that the maximum number of speakers, , is known. In this case, we can model a deep-net to count the number of speakers and model a decoder-head for each number of speakers. This allow us to dynamically select which decoder-head to run and output the correct number of speakers.
We propose a single end-to-end trainable deep-net to accomplish this. Our deep-net contains a count-head, which counts the number of speakers in the mixed-audio, and a list of decoder-heads to reconstruct audios for the corresponding number of speakers. These heads share input features extracted from a backbone network. In the remainder of this section, we describe the architecture details and training procedure for our method.
Our model contains a mixture encoder to transform waveform into encoding, and a backbone to extract source encoding from mixture encoding following  and . Instead of using a single decoder head with a fixed number of output channels, we replaced it with a set of decoder heads, each having a different number of output channels, where one channel contains source from one speaker. We also added a count head that chooses which decoder head to use during inference.
Encoder & Backbone: As in 
, we use convolution with ReLU to encode mixture waveform, then use repeated MulCat blocks as the backbone separation network.
We train a speaker count head in as an additional branch in parallel with the decoder heads. Given the output tensor from the backbone network, we first linearly transform the feature dimension, then apply global average pooling and ReLU. We then linearly project the result to the set of possible decoder choices, and apply softmax to the output.
Decoder-Heads: We use a list of decoders, as in  and . For the decoder, given an input tensor with feature dimension , we apply PReLU with a channel-independent parameter, and use 1x1 convolution to project feature dimension to speakers. We then divide the projected tensor into tensors, each with feature dimension , and transform the tensor back to audio with overlap-and-add.
To train the count-head, we formulate it as a classification task, , we minimize the cross-entropy loss, _count(x, Y) = -∑_k^K 1_—Y—=k ⋅log^p(—Y—=k)(x), where denotes the indicator function and
denotes the predicted probability that the mixed input audio,, has speakers.
Next, to train the decoder-heads, we utilize the permutation invariant loss, uPit , on the decoder-head selected by the ground-truth number of speakers, , _decoders(x, Y) = ∑_k 1_—Y—=k ⋅uPIT(Y, ^Y_k), where denotes the output from the decoder-head and uPIT(Y, ^Y_k) = - max_π∑_nSI-SNR(y^π(n), ^y_k^n), where
denotes a permutation on the speaker channels, and SI-SNR stands for scale-invariant signal-to-noise ratio, as defined in.
Finally, we balance the two losses with a hyper-parameter , and train over a dataset of paired mixed inputs and separated audio, , is as follows, min_θ∑_(x, Y) ∈ α⋅_count(x, Y) + (1-α) ⋅_decoders(x, Y).
At test time, the ground-truth number of speakers is not available. In this case, we use the estimated number of speakers from the count-head to select which decoder-head to run, therefore, ^Y = ^Y_^c, ^c = arg max_k ^p(—Y—=k) is the final prediction given .
Evaluating a system for source separation with variable number of speakers remains an open discussion. It may seem that standard metrics, SI-SNR, are directly applicable, however these metrics require the number of predicted signals and ground-truth signals to be identical. When the system incorrectly predicts the number of speakers, it is unclear how to compute SI-SNR.
Prior work  computes a metric as follows: Let be the number of predicted speaker and be the ground-truth. In case (a): When , they compute the correlation between all audio pairs and keep speakers from the prediction. In case (b): When , they duplicate speakers with the highest correlation to the ground-truth samples. With the speaker number matched, they compute the standard SI-SNR. We note that this choice of duplication / dropping relies on the ground-truth signal. This is not desirable, as a post-processing procedure should not be dependent on the ground-truth.
We believe that it is more natural to add “silence” speakers, either ground-truth or the prediction, until the number of speakers between the ground-truth and prediction are identical. Intuitively, a two-speaker mixed signal can be thought of as a three-speaker mixed signal where one of the speakers is silence. However, we run into the issue that SI-SNR is equal to negative infinity if the signal is zero.
To avoid this, instead of padding with silence, we choose a negative penalty termthat would be defined as the approximation to the SI-SNR measured if padded by silence. We name this metric penalized-SI-SNR (P-SI-SNR).
Given dataset , P-SI-SNR is defined as 1——∑_(x, Y) ∈ 1max(—Y—, —^Y—) (_match + _pad), where , being the number of predicted sources, and and are defined as follows: match = maxπ∑n=1min(—Y—, —^Y—) SI-SNR(yπ(n), ^yn) pad = Pref ⋅——Y— - —^Y— —.
We believe that our proposed metric is intuitive and naturally balances between the reconstruction quality and misclassifications in number of speakers. We will next discuss two possible choices of .
Measuring from data: One solution is to choose the “silence” as some zero-mean noise distribution. In this case, we measure the SNR empirically based on samples from WSJ0 recordings. We cut out 0.75 second noise segments from their start, repeat those segments to match the length of recordings, and measure the energy ratio between noise files and recordings. Based on the average of our measurements, we set to be -30dB.
Setting as average SI-SNR: Another intuitive way to penalize SI-SNR is to have each underestimated or overestimated speaker cancel out the positive contribution to SI-SNR of a correctly predicted speaker. Therefore, we choose to be the negative of the average SI-SNR from oracle source counting.
We first describe our implementation details for dataset preparation, training, testing, model architecture. Next, we provide quantitative comparisons with baselines and demonstrate that our approach achieves state-of-the-art performance in source counting and comparable performance in source separation.
Datasets: We train on WSJ0-2mix and WSJ0-3mix , in addition to WSJ0-4mix and WSJ0-5mix. We take 4-second chunks of all audios with 2-second overlap, and pad any chunks at the end that are above 2 seconds. We remove all mixtures below 2s. As mixtures have length equal to the shortest source, those with more speakers are shorter and have less chunks. In our training set, 2, 3, 4, 5 speakers all have 20000 audios, and respectively have [24773, 19066, 15986, 13809] chunks.
For each epoch, we use weighted re-sampling with replacement to ensure that chunks for each speaker number are sampled with equal probability. We set probability of choice for each chunk inversely proportional to number of chunks with the same speaker count. We train our model using Adam with learning rate , decay of every epoch, and batch size of . In total, we train our model for 40 epochs, which is much less than the 100 epochs in most previous papers [10, 6].
Testing Procedure: Given an audio signal, We first transform the full audio into chunks. We use the count head to predict which decoder head to use for each chunk and select the decoder head with the highest votes. Using the selected decoder head, we compute separated sources for each chunk, and use the overlap regions to reorder the predicted source channels in a streaming fashion. Lastly, we use overlap-and-add to recover predicted sources for the full audio, and remove the padding at the end chunk.
: For the encoder, we use a convolution kernel size of 8, stride 4, and 256 feature channels. For backbone, we use LSTM with hidden layer size 256. Similar to, we use multi-stage loss, but do not use speaker ID loss for simplicity. During training, we train both the decoders and count-heads with multi-loss, with one set of output after each pair of Mulcat blocks.111See project page for more details: https://junzhejosephzhu.github.io/Multi-Decoder-DPRNN/
Comparisons with Baselines: Many of the systems for variable speaker source separation are not publicly available, therefore we cannot directly compare with them on our proposed P-SI-SNR. To compare, we use the reported numbers from their paper. Note that since we do not have the exact SI-SNR, in the case of speaker mismatch, we compute an upper bound for the models using their published statistics on oracle SI-SNR and speaker counting accuracy.
For computing this upper bound, we assume that each misclassification of speaker number is an overestimate by one, and all the other channels have oracle SI-SNR. This is an upper bound because oracle SNR is always higher than non-oracle SNR, and the ratio of (contribution from correct channels)/penalty is greatest if the error is an overestimate by one channel. For a model with speakers with oracle SI-SNR and accuracy , the upper bound for P-SI-SNR can be computed as P-SI-SNR ≤a ×x + (1 - a) ×(k ×x + Pref)k + 1
Quantitative Results: We report quantitative comparisons for source counting performance in tab:acc, oracle SNR in tab:oracle-SNR, and our proposed P-SI-SNR in tab:PI-SI-SNR. We note that models with * are not directly comparable to our model as they train a model for each speaker number, where we have a single model for all speakers.
As can be seen from tab:acc, in the source counting task, our model outperforms all other models, even those with fewer possible choices of speaker counts. Our approach remains competitive in source separation when evaluated using Oracle-SNR, see tab:oracle-SNR. Lastly, in tab:PI-SI-SNR, when is set to -30 dB, our P-SI-SNR also outperforms all other models in 2-speaker and 4-speaker cases, and achieves similar results to best model in the 3-speaker case.
We present a unified approach to single channel speech separation with an unknown number of speakers. With our proposed multi-decoder architecture and count-head, our model requires a single forward-pass at test-time on a single network. In our experiments, we demonstrate that our model achieves state-of-the-part performance in source counting and competitive source separation quality. Additionally, we propose a new evaluation metric for evaluating source separation with an unknown number of speakers, in which we penalize SI-SNR when the number of sources estimated is incorrect.
Singing-voice separation from monaural recordings using robust principal component analysis. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 57–60. Cited by: §1.
Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (10), pp. 1901–1913. Cited by: §1, §2.3.