need more time for construction
We introduce Wavesplit, an end-to-end speech separation system. From a single recording of mixed speech, the model infers and clusters representations of each speaker and then estimates each source signal conditioned on the inferred representations. The model is trained on the raw waveform to jointly perform the two tasks. Our model infers a set of speaker representations through clustering, which addresses the fundamental permutation problem of speech separation. Moreover, the sequence-wide speaker representations provide a more robust separation of long, challenging sequences, compared to previous approaches. We show that Wavesplit outperforms the previous state-of-the-art on clean mixtures of 2 or 3 speakers (WSJ0-2mix, WSJ0-3mix), as well as in noisy (WHAM!) and reverberated (WHAMR!) conditions. As an additional contribution, we further improve our model by introducing online data augmentation for separation.READ FULL TEXT VIEW PDF
Leveraging additional speaker information to facilitate speech separatio...
We describe Parrotron, an end-to-end-trained speech-to-speech conversion...
Robust speech processing in multitalker acoustic environments requires
Deep clustering is a recently introduced deep learning architecture that...
Single-channel speech separation has recently made great progress thanks...
Recently, there has been growing interest in multi-speaker speech
Speech 'in-the-wild' is a handicap for speaker recognition systems due t...
need more time for construction
We are replicating the model from Wave Split (https://arxiv.org/abs/2002.08933)
Automatic speech separation aims at isolating individual speaker voices from a recording with overlapping speech. Application-wise, separation is particularly important in meeting or social event recordings. Separation can be used to improve the intelligibility of speech both for human listeners and speech recognition systems. Although prior work in source separation spans several decades (Vincent et al., 2018), it is still an active area of research (Hershey et al., 2016; Luo & Mesgarani, 2019; Luo et al., 2019)
. In particular, the recent years have seen the emergence of supervised deep neural networks for speaker separation. Two main families of algorithms have been proposed: frequency masking approaches(Hershey et al., 2016; Le Roux et al., 2019a) and time-domain approaches (Luo & Mesgarani, 2019; Luo et al., 2019). Frequency masking approaches output a time-frequency mask for each identified speaker which can be applied to an input spectrogram to reconstruct each speaker’s recording. Time-domain approaches directly predict the output audio for each speaker. Our model, Wavesplit, is a time-domain approach, which bypasses the difficulty of phase reconstruction required by time-frequency approaches (Le Roux et al., 2019a).
Our work targets open speaker separation, i.e. the test speakers have not been encountered at training time, but draws inspiration from the closed speaker setting in order to characterize desirable properties for latent speaker representations. Our work considers a joint training procedure for speaker identification and speech separation, which differs from prior research (Wang et al., 2018a). Our training objective encourages identifying instantaneous speaker representations such that (i) these representations can be grouped into individual speaker clusters and (ii) the centroids of such clusters provide a long-term speaker representation for the reconstruction of individual speaker signals.
As an additional contribution orthogonal to our model, we highlight the benefits of the dynamic creation of audio mixtures online during training. Rather than training a model on a pre-computed, finite set of mixtures, we constantly create new training examples by mixing short windows of clean signal uniformly sampled from the training set, reweighted by randomly sampled gains. This strategy brings systematic and substantial improvements in the performance of the final system, and can easily be derived from any speech dataset.
Wavesplit improves the state-of-the-art over standard separation benchmarks, both for clean (WSJ0-2mix, WSJ0-3mix) and noisy datasets (WHAM! and WHAMR!). We also highlight the accuracy of our model when the number of speakers varies and when the relative loudness between speakers is variable. Augmentation by dynamic mixing further improves these results.
The contributions of this paper span different axes, (i) we leverage training speaker labels but do not require any information about the test speakers beside the mixture recording, (ii) we aggregate information about speakers over the whole sequence which is advantageous for long recordings, (iii) we rely on clustering for inferring speaker identity which naturally outputs sets, i.e. order-agnostic predictions, (iv) we report state-of-the-art results on the most common speech separation benchmarks, in various conditions, (v) we analyze the empirical advantages and drawbacks of our method.
Single-channel speech separation takes a single microphone recording in which multiple speakers overlap and predicts the isolated audio of each speaker. This classical signal processing task (Roweis, 2001; Yilmaz & Rickard, 2004; Vincent et al., 2018) has witnessed fast progress in the recent years thanks to supervised neural networks (Wang & Chen, 2018). Neural models can be mainly grouped in two families: spectrogram-based and time-domain approaches.
rely on a time-frequency representation obtained through short-term Fourier Transform(Williamson, 2012). Considering the input mixture spectrogram, these approaches aim to assign each time-frequency bin (TFB) to one of the sources. For that purpose, the model produces one mask per source that identifies the source with maximal energy for each TFB. Source spectrograms are estimated by multiplying the input spectrogram with the corresponding masks. Source signals are then estimated by phase reconstruction (Griffin & Jae Lim, 1984; Wang et al., 2018c). Variants of this strategy include models with soft masking, i.e. assigning TFBs to multiple speakers with different weights (Araki et al., 2004). Hershey et al. (2016) devise a neural model capable of predicting multiple masks. They propose to learn a latent representation for each TFB mask such that the distance between TFB assigned to the same source is lower than the distance between TFB from different sources. At test time, clustering these representations allows identifying which TFBs belong to each source. Alternatively, (Yu et al., 2017; Kolbaek et al., 2017; Xu et al., 2018) avoid TFB clustering by predicting multiple masks directly. The predicted masks are compared to the target masks by searching over the permutations of the source ordering. This is called Permutation Invariant Training (PIT). PIT acknowledges that the order of predictions and labels in speech separation is irrelevant, i.e. separation is a set prediction problem.
Time domain approaches avoid the degradation introduced by phase reconstruction and the latency introduced by high resolution spectrograms. Time-domain models directly predict a set of audio outputs which are compared to the audio sources by searching over the permutations of the source ordering, i.e. PIT applied to time domain signal matching. Multiple architectures have been proposed for this scheme, including convolutional (Luo & Mesgarani, 2019; Zhang et al., 2020) and recurrent networks (Luo et al., 2019).
Leveraging speaker representation for speech separation is another axis of research relevant to our work. Short recordings of speech are sufficient to extract discriminant speaker characteristics (Wan et al., 2018; Zeghidour et al., 2016). This property has been used for speech separation in the past. For instance, Wang et al. (2018a) extract the characteristics of a speaker from a clean enrollment sequence in order to subsequently isolate this speaker in a mixture. This method however does not apply to open-set speaker separation, where no information about the test speakers is available besides the input mixture.
Our strategy relies on a convolutional architecture composed of a speaker stack and a separation stack, see Figure 1. The speaker stack takes the input mixture and produces a set of vectors representing the speakers present in the mixture. The separation stack consumes both the input mixture and the set of speaker representations from the speaker stack. It produces a multi-channel audio output with separated speech from each speaker in the set.
The separation stack is classical and resembles previous architectures conditioned on pre-trained speaker vectors (Wang et al., 2018a), or trained with PIT (Luo & Mesgarani, 2019). The speaker stack is novel and constitutes the heart of our contribution. This stack is trained jointly with the separation stack. At training time, it relies on speaker labels to learn a vector representation per speaker such that the inter-speaker distances are large, while the intra-speaker distances are small. At the same time, this representation is also learned to allow the separation stack to reconstruct the clean signals. At test time, the speaker stack relies on clustering to identify a centroid representation per speaker.
Our strategy contrasts with prior work. In contrast with VoiceFilter (Wang et al., 2018a), the representation of all speakers is directly inferred from the mixture and we do not need an enrollment sequence for test speakers. With joint training, the speaker representation is not solely optimized for identification but also extracts information necessary for isolated speech reconstruction. In contrast with PIT (Yu et al., 2017), we condition decoding with a speaker representation valid for the whole sequence. This long-term representation yields excellent performance on long sequences, especially when the relative energy between speakers is varying, see Section 4.3. Still in contrast with PIT, we resolve the permutation ambiguity during training at the level of the speaker representation, i.e. the separation stack is conditioned with speaker vectors ordered consistently with the labels. This does not force the separation stack to choose a latent ordering and allows training this stack with different permutations of the same labels.
We consider a mixture of sources. Each single-channel source waveform is represented by a continuous vector , with the length of the sequence (in samples). The task of source separation is to reconstruct each from an input , which sums all the sources, .
A separation model predicts an estimate for each channel ,
These estimates should match the reference recordings up to a permutation since the channel ordering is arbitrary. The quality of the reconstruction is usually computed as,
where denotes a single-channel reconstruction quality metric and denotes the space of permutation over . The speaker separation literature typically relies on Signal-to-Distortion Ratio (SDR) to assess reconstruction quality,
i.e. the opposite of the log squared error normalized by the energy of the reference signal. Scale-invariant SDR (SI-SDR) (Le Roux et al., 2019b) searches over rescalings of the signal to not penalize models for a wrong scale of predictions. Variants searching over richer signal transforms have also been proposed (Vincent et al., 2006).
Wavesplit is a residual convolutional network with two sub-networks or stacks. The first stack transforms the input mixture into a representation of each speaker, while the second stack transforms the input mixture into multiple isolated recordings conditioned on the speaker representation.
The speaker stack produces speaker representations at each time step and then performs an aggregation over the whole sequence. Precisely, the speaker stack first maps the input into same-length sequences of latent vectors of dimension , i.e.
where . is chosen to upper bound the maximum number of simultaneous speakers targeted by the system, while is a hyper-parameter selected by cross-validation. Intuitively, produces a latent representation of each speaker at every time step. When speakers are present at a time-step, produces a latent representation of silence for the remaining outputs. It is important to note that is not required to order speakers consistently across a sequence. E.g. a given speaker Bob could be represented by the first vector at time and by the second vector at a different time . At the end of the sequence, the aggregation step groups all produced vectors by speaker and outputs
summary vectors for the whole sequence. K-means performs this aggregation at inference(Linde et al., 1980) and returns the centroids of the identified clusters,
During training, the aggregation is derived from a speaker training objective described in Section 3.3. In the following we refer to the local vectors as the speaker vectors, and to the vectors as the speaker centroids.
The separation stack takes the vectors and the input signal and produces an -channel output ,
We rely on a residual convolutional architecture similar to Luo & Mesgarani (2019) for both stacks. Our residual block (He et al., 2016) for the speaker stack composes a dilated convolution (Yu & Koltun, 2016), a non-linearity and layer normalization (Ba et al., 2016),
We experimented with different non-linearities including rectified linear units(Glorot et al., 2011), parametric rectified linear units (He et al., 2015), gated linear units and gated tangent units (Dauphin et al., 2017). Parametric rectified linear units were consistently better on our validation experiments and were selected for our subsequent experiments. The last layer of the speaker stack applies Euclidean normalization to the speaker vectors.
The conditioning of the separation stack by the speaker centroids relies on FiLM, Feature-wise Linear Modulation (Perez et al., 2018)
. At each layer, we apply a linear transformationto the speaker centroids, and add FiLM conditioning on a block similar to Eq. (2),
where and are different linear projections of , the concatenation of the speaker centroids. We learn distinct parameters for each layer for all parametric functions, i.e. , , , and . Section 4.6 shows the advantage of FiLM conditioning on generalization and convergence compared to the bias only conditioning () used commonly (van den Oord et al., 2016).
Model training follows a dual objective: (i) it learns speaker vectors which can be clustered by identity into well separated clusters; (ii) it optimizes the reconstruction of the separated signals from aggregated speaker vectors.
Wavesplit assumes the training data is annotated with speaker identities from a finite set of training speakers but does not require any speaker annotation at test time. This information is provided by most speaker separation datasets, including Wall Street Journal variants (Hershey et al., 2016; Wichern et al., 2019; Maciejewski et al., 2019), meeting recordings (Mccowan et al., 2005) or cocktail party recordings (Barker et al., 2018)
. We experiment with different loss functions to encourage the speaker stack outputs to have small intra-speaker and large inter-speaker distances.
The speaker vector objective leverages training speaker identities. Given an input sequence with target signals and corresponding speakers , we consider different loss functions which encourage the correct identification of the speakers at each time step , i.e.
where defines a loss function between vector of and a speaker identity of . The minimum over permutations expresses that each identity should be identified at each time step, in any arbitrary order. The best permutation () at each time-step is used to re-order the speaker vectors in an order consistent with the training labels. This allows averaging the speaker vectors originating from the same speaker at training time. This makes optimization simpler compared to work relying on k-means during training (Hershey et al., 2017). This permutation per time-step differs from PIT (Xu et al., 2018): we do not force the model to pick a single ordering over the output channels and we train the separation stack with different permutations of the same labels.
For , we explore three alternatives. In all cases, we maintain an embedding table over training speakers . First, we consider a distance objective . This loss favors small distances between a speaker vector and the corresponding embedding. It also enforces the distance between speaker vectors at the same time-step to be larger than a margin of 1, i.e.
Second, we consider a local classifier objective. This classifier discriminates among the speakers present in the sequence. It relies on the log softmax function over the distance between speaker vectors and speaker embeddings, i.e.
where is the squared Euclidean distance rescaled with learned scalar parameters . Finally, the global classifier objective is similar, except that the partition function is computed over all speakers in the training set, i.e.
The reconstruction objective aims at optimizing the separation quality, as defined in Eq. (1).
Contrasting with Permutation Invariant Training, PIT (Yu et al., 2017), this expression does not require searching over the space of permutation since the vectors is consistent with the order of the labels as explained above.
For , we use negative SDR with a clipping threshold to limit the influence of the high quality training predictions,
In addition to these two losses, we consider two forms of regularization to reduce the mismatch between training and testing conditions. First, to improve generalization to new speakers, we favor well separated embeddings for the speakers of the training set. We achieve this by adding the entropy regularization from (Sablayrolles et al., 2019),
Second, to make the separation stack robust to noisy speaker vectors, we add Gaussian noise on these vectors during training. Section 4.6 reports an ablation that demonstrates the benefits of regularization.
Model training optimizes the weighted sum of and with the Adam algorithm (Kingma & Ba, 2015). We train on mini batches of fixed-size windows. We found that the choice of window size is not critical for Wavesplit, even when being as small as 750ms, unlike most PIT-based approaches (Luo & Mesgarani, 2019; Luo et al., 2019) that require longer segments (
4s). The training set is shuffled at each epoch and a window starting point is uniformly sampled each time a sequence is visited. This sampling means that each training sequence is given the same importance regardless of its length. This strategy is consistent with the averaging of per-sequence SDR commonly used for evaluation. To prevent a potential over-fitting to the ordering of the training label, we replicated the training data for all permutations of the source signals. Replication, windowing and shuffling are not applied to the validation or test set.
Source separation aims at isolating individual voices in a signal summing different sources. Source separation benchmarks such as WSJ0-2mix (Hershey et al., 2016) create a standard split between train, valid and test sequences and then generate specific input mixtures by sampling pairs of sequences along with corresponding gains to apply prior to summation. As an orthogonal contribution to the Wavesplit model, we evaluate the impact of dynamically creating mixtures during training. Our augmentation algorithm receives a batch of targets windows and creates groups of size at random by shuffling the data. We then sample gains for each of the mixtures and then sum the sequences to obtain each artificial input. In other words, we replicate the creation recipe of the dataset indefinitely. Despite its simplicity, our experiments report a strong improvement from this method. A similar mixing augmentation scheme has been used in music source separation (Uhlich et al., 2017; Défossez et al., 2019). Our experiments also report results without augmentation to isolate the impact of Wavesplit alone.
Most of our experiments are performed on the speaker separation dataset111http://www.merl.com/demos/deep-clustering built from the LDC WSJ-0 dataset (Garofolo et al., 1993) as introduced in (Hershey et al., 2016). We rely on the 8kHz version of the dataset, with 2 or 3 concurrent speakers. This setting has become the de-facto benchmark for open-speaker source separation and we compare our results to alternative methods. Table 1 reports dataset statistics. We also introduce a version of the dataset in which the number of active speakers varies from one to three within a sequence as a more realistic setting. Additionally, we perform experiments on the noised versions of the data. We rely on WHAM! with urban noise (Wichern et al., 2019) and WHAMR! with noise and reverberation (Maciejewski et al., 2019). These datasets are derived from WSJ0-2mix and have identical statistics. We further compare different variations of Wavesplit varying the optimized loss functions and architectural alternatives. Finally, we also conduct an error analysis examining a small fraction of sequences which have a strong negative impact on overall performance.
|mean length||5.4 sec||5.5 sec||5.7 sec|
|mean length||4.9 sec||4.9 sec||5.2 sec|
Our evaluation relies on signal-to-distortion ratio (SDR) and scale-invariant signal-to-distortion ratio (SI-SDR) (Vincent et al., 2006; Le Roux et al., 2019b), as introduced in Section 3.1. SDR is measured using the standard MIR eval library222http://craffel.github.io/mir_eval/. Consistently with the literature, we measure both metrics in terms of improvement, i.e. the metric obtained using the system output minus the metric obtained by using the input mixture as the prediction.
The choice of architecture is based on preliminary validation experiments on the 2 speaker separation task of WSJ0-2mix (Hershey et al., 2016), and then used through all our experiments. We selected a latent dimension of
for both stacks. For the dilated convolutions, we selected a kernel of size 3 without striding, therefore all activations have the same temporal resolution as the input signal and no upsampling is necessary to produce the output signal. The dilation factor varies with depth. The speaker stack is-layer deep with dilation growing exponentially from to . The separation stack has layers with a dilation pattern borrowed from Wavenet (van den Oord et al., 2016), i.e. . Every 10 layer the dilation is reset to 1 allowing multiple fine-to-coarse-to-fine interactions across the time axis.
After architecture selection, we experimented with learning rates in and speaker loss weights in . We found that and
were respectively the best hyperparameters and thus used the same in our other experiments. For regularization, we trained models with a distance regularization weight in
and a Gaussian noise with standard deviation in. We found and to be the best respectively. The clipping threshold on the negative SDR loss was picked between for each dataset. For clean data, was the best, while was superior in noisy conditions.
For alternatives, we found the global classifier to be the most effective, see Section 4.6.
WSJ0-2mix/3-mix is the de facto benchmark for separation. Table 2 reports the results for two simultaneous speakers, while Table 4 reports the results for three simultaneous speakers. In both cases, Wavesplit outperforms alternatives and dynamic mixing futher increases this advantage. For instance, we report SDR compared to for the recent dual path RNN (Luo et al., 2019). This number improves to with dynamic augmentation. We also made available recordings processed by our system on an anonymous webpage 333https://soundcloud.com/wavesplitdemo/sets, as well as in supplementary material.
We examine further the accuracy on WSJ0-2mix. Table 3 reports our analysis on the error distribution. Our results show that the test set exhibits a larger fraction of sentences with dB SDR than the validation set, i.e. versus . The WSJ0-2mix validation set contains the same speakers as the training data, we therefore examine further if the lower test performance is due to the representation of the test speakers. We rely on the training speaker classifier from Eq. 5
and collect the posterior distribution over training speakers for each speaker vector and average it per cluster. When the same training identity collects more than 10% of the probability mass in both clusters, we label the sequence as ”confusing speakers” since both clusters contains vectors confused with the same training identity. We observe that low SDR correlates with situations where both speakers in the test recordings are mapped to the same training identity. The training set of WSJ0-2mix has only 101 speakers. We suspect that training on datasets with a wider range of speakers will alleviate this problem and will be studied in future work.
Additionally, we evaluate oracle SDR. This metric reports SDR when the predictions at each time step are permuted to best match the labels. The of test examples with poor performance reach oracle SDR. This means that poor performing predictions are mostly the result of channel permutation throughout the sequence, i.e. the model reconstructs the signal properly but does not have a consistent speaker assignment between channels. This channel permutation is problematic for speaker separation applications and it will be particularly beneficial to focus on low
SDR quantiles specifically in future research.
The examples in WSJ0-2mix blend two speakers without changes in the dominant speaker, i.e. the same speaker stays the loudest throughout the whole recording. To evaluate our model in more challenging conditions, we create longer sequences by concatenating WSJ0-2mix examples for the same pair of speakers. We pick the concatenated recordings such that the dominating speaker alternates between the two identities. In this setting, the speaker with the highest energy varies along time. This experiment stems from research prior to PIT (Weng et al., 2015) which suggested that average energy was a good rule for channel assignment, applicable when one channel consistently dominates the other. We suspect that PIT may implicitly rely on energy to address the channel assignment ambiguity. In this case, the change in dominating speaker may degrade accuracy for such a model and even lead to cross-channel contamination, i.e. some segments of speech are swapped across speaker channels in the prediction. Wavesplit is agnostic to loudness dependent ordering since the separation stack is trained from all permutations of speaker identities at training time.
Our comparison relies on models trained on the regular WSJ0-2mix training set and only changes test conditions. We vary the sequence length from 1 to 10 times the original length. For PIT models, we retrained Conv-TasNet with an open source implementation444https://git.io/JvOT6, and reproduced the 15.6 SDR improvement of (Luo & Mesgarani, 2019). For Dual-Path RNN we used a pre-trained model555https://git.io/JvOTV which gives 19.1 SDR (against 19.0 in (Luo et al., 2019)). These models are among the best prior art on this dataset, yet they degrade significantly faster with longer sequences and alternating dominant speakers compared to our model, as shown in Table 5. This is particularly remarkable as Wavesplit is trained on only 1s long windows, unlike ConvTasNet and Dual-Path RNN that are both trained on 4s long windows.
|Deep Clustering (Isik et al., 2016)||10.8||–|
|uPIT-blstm-st (Kolbaek et al., 2017)||–||10.0|
|Deep Attractor Net. (Chen et al., 2017)||10.5||–|
|Anchored Deep Attr. (Luo et al., 2018)||10.4||10.8|
|Grid LSTM PIT (Xu et al., 2018)||–||10.2|
|ConvLSTM-GAT (Li et al., 2018)||–||11.0|
|Chimera++ (Wang et al., 2018b)||11.5||12.0|
|WA-MISI-5 (Wang et al., 2018c)||12.6||13.1|
|blstm-TasNet (Luo & Mesgarani, 2018)||13.2||13.6|
|Conv-TasNet (Luo & Mesgarani, 2019)||15.3||15.6|
|Conv-TasNet+MBT (Lam et al., 2019)||15.5||15.9|
|DeepCASA (Liu & Wang, 2019)||17.7||18.0|
|FurcaNeXt (Zhang et al., 2020)||–||18.4|
|DualPathRNN (Luo et al., 2019)||18.8||19.0|
|Wavesplit + Dynamic mixing||20.4||20.6|
|% Confusing speakers||69.2||4.8|
|% Confusing speakers||56.0||12.5|
|Deep Clustering (Isik et al., 2016)||7.1||–|
|uPIT-blsmt-st (Kolbaek et al., 2017)||–||7.7|
|Deep Attractor Net. (Chen et al., 2017)||8.6||8.9|
|Anchored Deep Attr. (Luo et al., 2018)||9.1||9.4|
|Conv-TasNet (Luo & Mesgarani, 2019)||12.7||13.1|
|Wavesplit + Dynamic mixing||16.5||16.8|
|Chimera++ (Wichern et al., 2019)||9.9||–|
|Conv-TasNet (Pariente et al., 2019)||12.7||–|
|Learnable fbank (Pariente et al., 2019)||12.9||–|
|Wavesplit + Dynamic mixing||15.3||15.8|
The evaluation protocol in the previous section considers a fixed given number of active speakers. The model is trained on recordings with the same number of speakers throughout the whole recording and tested in the same conditions. In this section, we train and test our model over sequences with at most 1, 2 or 3 speakers active at a time. We create this dataset WSJ0-123mix by padding WSJ0-3mix with silence. Each sequence in WSJ0-3mix is padded three times with the padding patterns shown in Figure2.
Compared to previous settings, target windows with silence or partial silence induce large variations in the ground-truth signal norm and yield numerical instabilities when learning with the SDR loss. We therefore replace the normalization by the reference energy with a constant energy corresponding to the average norm of the reference training windows.
Table 8 reports a breakdown of our results. The sequences with no speech overlap are excellent, highlighting that the model correctly identifies speaker turns and isolates speakers. This suggests that our model could address diarization (Anguera et al., 2012). Conversely, the sequences with three continuously active speakers report an accuracy lower than our model trained on WSJ0-3mix. We suspect that our loss clipping strategy is sub-optimal on this dataset. To address this issue, we plan to explore losses and training schemes that emphasize low performing quantiles during training in future work (Kibzun & Kurbakovskiy, 1991).
WSJ0-2mix was recorded in clean and controlled conditions and noisy variants have been introduced to represent more challenging use cases. WHAM! (Wichern et al., 2019) adds noise recorded in public areas to the mixtures. As the model should only predict clean signals, it cannot exploit the fact that predicted channels should sum to the input signal. WHAMR! (Maciejewski et al., 2019) adds the same noise, but also reverberates the clean signals. This makes the task even harder, as the model has to predict anechoic clean signals (i.e. without reverberation), and therefore solve jointly the tasks of denoising, dereverberation and source separation. Tables 6 and 7 show that our models outperform previous work by a substantial margin, even without dynamic mixing. We also adapted dynamic mixing for these datasets. For WHAM!, we also sampled a gain for the noise, and combined it to reweighted clean signals to generate noisy mixtures on the fly. We similarly remixed WHAMR!, except that we reweighted reverberated signals with noise, similarly to the original training set. On both datasets, this leads to an even larger improvement over previous work. For instance, the results on WHAMR! are comparable to the reconstruction accuracy from clean inputs (WSJ0-2mix) with models prior to (Wang et al., 2018b).
(a) 1 Active Speaker
speaker A speaker B speaker C
(b) 2 Active Speakers
speaker A speaker B speaker C
(c) 3 Active Speakers
|Subset||SI-SDR (dB)||SDR (dB)|
|1 active speaker||18.6||19.2|
|2 active speakers||15.7||16.3|
|3 active speakers||13.3||13.7|
|w/ distance loss||20.8|
|w/o distance reg.||21.0|
|w/o gaussian noise||20.9|
Table 9 compares the base result obtained with the global classifier loss, Eq. (5), with the distance loss, Eq. (3). Although this type of loss is common in distance learning for clustering (Wang et al., 2018b), the global classifier reports better results. We also ran experiments with the local classifier loss, Eq. (4), which was found to yield very slow training and worse genereralization.
Table 9 also reports the advantage of multiplicative FiLM conditioning compared to more classical additive conditioning (van den Oord et al., 2016). Not only reported SDR are better but FiLM allows using a higher learning rate and hence enables faster training. Table 9 also shows the benefit of regularization of the speaker representation.
We introduce Wavesplit, a neural network for speech separation. Our model decomposes the source separation problem in two jointly trained tasks, i.e. our model extracts a representation for each speaker present in the recording and performs separation conditioned on the inferred representation. Inference therefore relies on a consistent representation of each speaker throughout the sequence. This allows the model to explicitly keep track of speaker/channel assignments, limiting channel swapping, i.e. inconsistencies in such assignments.
Our approach is advantageous compared to prior work and redefines the state-of-the-art on standard benchmarks, both in clean and noisy conditions. We also observe the advantage of our model when the number of speaker varies throughout the sequence and when the relative loudness of speakers varies.
We want to extend this work by introducing an autoregressive or flow-based decoder to further enhance speech outputs, especially for noisy inputs for which the separation problem is particularly under-determined. We also see potential practical benefits in increasing the influence of low accuracy examples during training, e.g. by optimizing directly loss quantiles.
The authors are grateful to Adam Roberts, Chenjie Gu and Raphael Marinier for their advice on model implementation. They are also grateful to Jonathan Le Roux, John Hershey, Richard F. Lyon and Norman Casagrande for their help navigating speech separation prior work.
International Conference on Machine Learning (ICML), 2017.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pp. 315–323, 2011.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
New Era for Robust Speech Recognition, Exploiting Deep Learning, pp. 135–164. 2017. doi: 10.1007/978-3-319-64680-0_6. URL https://doi.org/10.1007/978-3-319-64680-0_6.
Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks.IEEE/ACM Trans. Audio, Speech & Language Processing, 25(10):1901–1913, 2017.