Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation

04/25/2019 ∙ by Yuzhou Liu, et al. ∙ The Ohio State University 0

We address talker-independent monaural speaker separation from the perspectives of deep learning and computational auditory scene analysis (CASA). Specifically, we decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping. Simultaneous grouping is first performed in each time frame by separating the spectra of different speakers with a permutation-invariantly trained neural network. In the second stage, the frame-level separated spectra are sequentially grouped to different speakers by a clustering network. The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives. Experimental results on the benchmark WSJ0-2mix database show that the new approach achieves the state-of-the-art results with a modest model size.



There are no comments yet.


page 1

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Speech usually occurs simultaneously with interference in real acoustic environments. Interference suppression is needed in a wide variety of speech applications, including automatic speech recognition, speaker identification, and hearing aids. One particular kind of interference is the speech signal from competing speakers. Although human listeners excel at attending to a target speaker even without any spatial cues [4], speech separation remains a challenge for machines despite decades of research. In this study, we address monaural (one microphone) speaker separation, mainly in the case of two concurrent speakers, which is also known as co-channel speech separation.

A traditional approach to monaural speech separation is computational auditory scene analysis (CASA) [35], which is inspired by human auditory scene analysis (ASA) mechanisms [3]. CASA addresses speech separation in two main stages: simultaneous grouping and sequential grouping. With an acoustic mixture decomposed into a matrix of time-frequency (T-F) units, simultaneous grouping aggregates T-F units overlapping in time to short segments, each originating from the same source. In sequential grouping, segments are grouped across time into auditory streams, each corresponding to a distinct source. For example, an unsupervised speaker separation method [14] first generates T-F segments based on pitch and onset/offset analysis, and then uses clustering to sequentially group T-F segments into speakers.

Recently deep learning has been employed to address speaker separation. The general idea is to train a deep neural network (DNN) to predict T-F masks or spectra of two speakers in a mixture [7] [16] [42]. There are usually two output layers in such a DNN, one for an individual speaker. These studies assume that the two speakers do not change between training and testing. It has been shown that such talker-dependent training leads to significant intelligibility improvement for hearing impaired listeners [11]. However, talker-dependent training does not generalize to untrained speakers. Talker-independent speaker separation has to address the permutation problem[12] [21], i.e., how the output layers are tied to the underlying speakers. The details of the permutation problem are introduced in Section II-A.

Frame-level permutation invariant training (denoted by tPIT) [21] tackles this problem by examining all possible label permutations within each frame during training, and uses the one with the lowest frame-level loss to train the separation network. A locally optimized output-speaker pairing can thus be reached, which leads to excellent frame-level separation performance. However, the correct speaker assignment in tPIT’s output may swap frequently across frames. In other words, the frame-level optimized outputs cannot be readily streamed into underlying speakers without reorganization. To address this issue, an utterance-level PIT (uPIT) algorithm [21] is proposed to align each speaker to a fixed output layer throughout a whole training utterance. Recent uPIT improvements include new network structure [23] [41] and new training objectives [23]. TasNet [27] extends uPIT to the waveform domain using a convolutional encoder-decoder structure. FurcaNeXt [32] integrates gated activations and ensemble learning into TasNet, and reports very high performance.

Deep clustering (DC) [12]

looks at the permutation problem from a different perspective. In DC, a recurrent neural network (RNN) with bi-directional long short-term memory (BLSTM) is trained to assign one embedding vector to each T-F unit of the mixture spectrogram. The Frobenius norm between the affinity matrix of embedding vectors and the affinity matrix of the ideal speaker assignment (or the ideal binary mask) is used as the training objective. DC avoids the permutation problem due to the permutation-invariant property of affinity matrices. As training unfolds, embedding vectors of T-F units dominated by the same source are drawn closer together, and embeddings of those units dominated by different sources become farther apart. Clustering these embedding vectors using the K-means algorithm assigns each T-F unit to one of the speakers in the mixture, which can be viewed as binary masking for speech separation. In


, a concept of attractors is introduced to DC to enable ratio masking and real-time processing. Alternative training objectives, together with a chimera network which simultaneously estimates DC embeddings and uPIT outputs, are proposed in

[37]. In [38], iterative phase reconstruction is integrated into the chimera network to alleviate phase distortions. In [39], a phase prediction network is further added to [38] to estimate the clean phase of each speaker source.

DC and PIT represent major approaches to talker-independent speaker separation. There are, however, limitations. As indicated in [21] [25], uPIT sacrifices frame-level performance for better assignments at the utterance level. The speaker tracking mechanism in uPIT works poorly for same-gender mixtures. On the other hand, DC is better at speaker tracking, but its frame-level separation is suboptimal compared to ratio masking used in tPIT.

Inspired by CASA, PIT and DC, we proposed a deep learning based two-stage method in our preliminary study [25] to perform talker-independent speaker separation. The method consists of two stages, a simultaneous grouping stage and a sequential grouping stage. In the first stage, a tPIT-BLSTM is trained to predict the spectra of the two speakers at each frame without speaker assignment. This stage separates spectral components of the two speakers at the same frame, corresponding to simultaneous grouping in CASA. In the sequential grouping stage, frame-level separated spectra and the mixture spectrogram are fed to another BLSTM to predict embedding vectors for the estimated spectra, such that the embedding vectors corresponding to the same speaker are close together, and those corresponding to different speakers are far apart. A constrained K-means algorithm is then employed to group the two spectral estimates at the same frame across time to different speakers. This stage corresponds to sequential grouping in CASA.

In this study, we adopt the same divide-and-conquer strategy but improve its realization in major ways, resulting in what we call a deep CASA approach. In the simultaneous grouping stage, we utilize a UNet [30]convolutional neural network (CNN) with densely-connected layers [15]

to improve the performance of frame-level separation. A frequency mapping layer is added to deal with inconsistencies between different frequency bands. To overcome the effects of noisy phase in inverse short-time Fourier transform (STFT), we explore complex STFT objectives and time-domain objectives as the training targets. In the sequential grouping stage, we introduce a new embedding representation and weighted objective function. In addition, we leverage the latest development in temporal convolutional networks (TCNs)

[2] [22] [27] [29], and use a TCN for sequential grouping, which greatly improves speaker tracking. A new dropout scheme is proposed for TCNs to overcome the overfitting problem. The evaluation results and comparisons demonstrate the resulting system achieves better frame-level separation and speaker tracking at the same time compared to uPIT and [25].

The rest of the paper is organized as follows. Section  II presents details on monaural speaker separation and permutation invariant training. The proposed algorithm, including the simultaneous and sequential grouping stages, is introduced in Section III. Section IV presents experimental results, comparisons and analysis. Conclusion and related issues are discussed in Section  V.

Ii Monaural speaker separation and permutation invariant training

Ii-a Monaural Speaker Separation

The goal of monaural speaker separation is to estimate independent speech signals , , from a single-channel recording of speech mixture , where and indexes time. In this work, we focus on the co-channel situation where .

Many deep learning based speaker separation systems [7] [16] [42] address this problem in the T-F domain, where STFT is calculated using an analysis window with FFT length and frame shift :


where and denote the frame and frequency, respectively. The magnitude STFT of the mixture signal , together with other spectral features, are fed into a neural network to predict a T-F mask for each speaker . The masks are multiplied by the mixture to estimate the original sources:


Here denotes element-wise multiplication, and denotes the estimated magnitude STFT of speaker . An estimate of complex STFT can be obtained by coupling with noisy phase. In the end, separated waveforms are resynthesized using inverse STFT (iSTFT):


Various training targets of have been explored for masking based speech separation [36]. Phase-sensitive approximation (PSA) is found to be effective as it accounts for errors introduced by the noisy phase [8] [21]. In PSA, the desired reconstructed signal is defined as: , where is the phase difference between and . Overall, the training loss at each frame is computed as:


where denotes the norm.

The above formulation works well only when each output layer is tied to a training target signal with similar characteristics. For instance, we may tie each output to a specific speaker, leading to talker-dependent training. We may also tie two outputs with male and female speakers respectively, leading to gender-dependent training. However, for talker-independent training data, how to select output-speaker pairing becomes a nontrivial problem. Think of a training set consisting of three female speakers. For speaker 1-2 mixtures, we can tie output1 to speaker1, and output2 to speaker2. For speaker 1-3 mixtures, again output1 can be tied to speaker1, and output2 tied to speaker3. However, it is hard to decide the pairing arrangement for speaker 2-3 mixtures. If output-speaker pairing is not arranged properly, conflicting gradients may be generated during training, preventing the neural network from converging. This is referred as the permutation problem [12] [21].

Ii-B Permutation Invariant Training

Frame-level PIT [21] overcomes the permutation problem by providing target speakers as a set instead of an ordered list, and output-speaker pairing , for a given frame

, is defined as the pairing that minimizes the loss function over all possible speaker permutations

. For tPIT, the frame-level training loss in Eq. 5 is rewritten as:


We omit in and for brevity.

tPIT does a good job in separating two speakers at the frame level [21] [25]. However, due to its locally optimized training objective, an output layer may be tied to different speakers at different frames, and the correct speaker assignment may swap frequently. If we reassign the outputs with respect to the minimum loss for each speaker, tPIT can almost perfectly reconstruct both speakers [25].

Optimal speaker assignments are not obtainable in practice as the targets are not given beforehand. To address this issue, uPIT fixes output-speaker pairing for a whole utterance, which corresponds to the pairing that provides the minimum utterance-level loss over all possible permutations.

As reported in [21] [25], uPIT considerably improves the separation performance with a default output assignment. But it has the following shortcomings. First, uPIT’s output-speaker pairing is fixed throughout a whole utterance, which prevents frame-level loss to be optimized as in tPIT. As a result, uPIT always underperforms tPIT if their outputs are optimally reassigned. Second, uPIT addresses separation and speaker tracking simultaneously and due to limited modeling capacity of a neural network, uPIT does not work well for speaker tracking, especially for same-gender mixtures.

Iii Deep CASA approach to monaural speaker separation

We employ a divide and conquer idea to break down monaural speaker separation into two stages. In the simultaneous grouping stage, a tPIT based neural network separates spectral components of different speakers at the frame-level. The sequential grouping stage then streams frame-level estimates belonging to the same speaker. Unlike uPIT, separation and tracking are optimized in turn in the deep CASA framework. The two stages are detailed in the following subsections.

Iii-a Simultaneous Grouping Stage

Iii-A1 Baseline system

We adopt the tPIT framework described in [25] as the baseline simultaneous grouping system. The magnitude STFT of the mixture is used as the input. BLSTM is employed as the learning machine. The system is trained using the loss function in Eq. 6. In the end, frame-level spectral estimates are passed to the second stage for sequential grouping.

Iii-A2 Alternative training targets for tPIT

As mentioned, the PSA training target partially accounts for STFT phase, unlike the ideal binary mask (IBM) and ideal ratio mask (IRM). However, PSA cannot completely restore the phase information in clean sources, because it uses noisy phase during iSTFT. Recently, complex ratio masking [40] (cRM) attempts to restore clean phase. The complex ideal ratio mask (cIRM) is defined in the complex STFT domain, with real and imaginary parts. When applied to the complex STFT of the mixture, it perfectly reconstructs clean sources:


where denotes point-wise complex multiplication.

We propose complex ratio masking to perform monaural speaker separation. Instead of directly using the cIRM as the training target, we first multiply the complex mixture by the estimated complex mask to perform complex domain reconstruction:


The reconstructed sources are then compared with clean sources to form the training objective:


where the norm is applied to both the real and imaginary parts of the loss. We call this training objective complex approximation (CA).

We also consider a training objective based on time-domain signal-to-noise ratio (SNR). The proposed framework consists of two steps: First, we organize all frame-level complex estimates with respect to the minimum frame-level loss, so that each organized output corresponds to a single speaker. The frame-level loss for organization can be defined in three domains: the complex STFT, magnitude STFT and time domain. In each domain, we compare the estimates and ground-truth targets, and calculate the norm of the difference as the loss. We found the complex STFT loss to be slightly better. Second, we apply iSTFT (Eq. 4) to , and compute utterance-level SNR for the final time-domain estimates :


Iii-A3 Convolutional neural networks for simultaneous grouping

Fig. 1: Diagram of the Dense-UNet used in simultaneous grouping. Gray blocks denote dense CNN layers. DS blocks denote downsampling layers and US blocks denote upsampling layers. Skip connections are added to connect layers at the same level. The inputs, masks and outputs can be defined in either magnitude or complex STFT domain.

Motivated by the recent success of DenseNet [15] and UNet [30] in music source separation [18] [33], we propose a Dense-UNet structure for simultaneous grouping.

The proposed Dense-UNet is shown in Fig. 1, and it is based on a UNet architecture [30]

. It consists of a series of convolutional layers, downsampling layers and upsampling layers. The first half of the network encodes utterance-level STFT feature maps into a higher level of abstraction. Convolutional layers and downsampling layers are alternated in this half, allowing the network to model large T-F contexts. Convolutional layers and upsampling layers are alternated in the second half to project the encoded features back to its original resolution. In this study, we use strided

depthwise convolutional layers [5] as downsampling layers. Strided transpose convolutional layers are used as upsampling layers. Skip connections are added between layers at the same hierarchical level in the encoder and decoder to preserve raw information in the mixture.

Next, we replace convolutional layers in the original UNet with densely-connected CNN blocks (DenseNet) [15]. The basic idea of DenseNet is to decompose one convolutional layer with many channels into a sequence of densely connected convolutional layers with fewer channels, where each layer is connected to every other layer in a feed-forward fashion:


where denotes the input feature map, the output of the layer, concatenation, and the convolutional layer followed by ELU (exponential linear unit) activation [6] and layer normalization [1]. The DenseNet structure has shown excellent performance in image classification [15] and music source separation [33]. In this study, all output layers in a dense block have the same number of channels, denoted by . The total number of layers in each dense block is denoted by . As shown in Fig. 1, we alternate 9 dense blocks with 4 downsampling layers and 4 upsampling layers. After the last dense block, we use a CNN layer to reorganize the feature map, and then output two masks.

In CNNs, convolutional kernels are usually applied across the entire input field. This is reasonable in the case of visual processing, where similar patterns can appear anywhere in the visual field with translation and rotation. However, in the auditory representation of speech, patterns that occur in different frequency bands are usually different. A generic CNN kernel may result in inconsistent outputs at different frequencies. To address this problem, Takahashi and Mitsufuji [33] split the spectral input into several subbands, and train band-dependent CNNs, leading to a substantial rise in model size.

We propose a frequency mapping layer which effectively alleviates this problem with a significant reduction of parameters. The basic idea is to project inconsistent frequency outputs to an organized space using a fully-connected layer. We replace one CNN layer in each dense block with a frequency mapping layer. The input to a frequency mapping layer is a concatenation of CNN layers , where and denote time and frequency respectively, the number of channels in the input. is passed to a convolutional layer, followed by ELU activation and layer normalization, to reduce the number of channels to . The resulting output is denoted by . We then transpose the and dimension of to get . Next, is fed to a convolutional layer, followed by ELU activation and layer normalization, to output . This layer can also be viewed as a frequency-wise fully connected layer, which takes all frequency estimates as the input and reorganize them in a different space. Finally, is transposed back, and the output of the frequency mapping layer is generated.

Iii-B Sequential Grouping Stage

Iii-B1 Baseline system

In this stage, we group frame-level spectral estimates across time using a clustering network, which corresponds to sequential grouping in CASA. In deep clustering based speaker separation, T-F level embedding vectors estimated by BLSTM are clustered into different speakers. We extend this framework to frame-level speaker tracking.

Fig. 2 illustrates our sequential grouping. We first stack the mixture spectrogram and two spectral estimates (including real, imaginary and magnitude STFT) as the input to the system. A neural network then projects frame-level inputs to a -dimensional unit-length embedding vector . The target label is a two-dimensional indicator vector, denoted by . During the training of tPIT, if the minimum loss is achieved when is paired with speaker 1, and is paired with speaker 2, we set to [1 0]. Otherwise, is set to [0 1]. In other words, indicates the optimal output assignment of each frame. and can be reshaped into a matrix , and a matrix , respectively. A permutation independent objective function [12] is:


where is the Frobenius norm. Optimizing forces corresponding to the same optimal assignment to get closer during training, and otherwise to become farther apart.

Because we care more about the speaker assignment of frames where the two outputs are substantially different, a weight is used during training where represents the frame-level loss difference (LD) between the two possible speaker assignments. is large if two conditions are both satisfied: 1) the frame-level energy of the mixture is high; 2) the two frame-level outputs, and , are quite different, so that the losses with respect to different speaker assignments are significantly different. can be used to construct a diagonal matrix . The final weighted objective function is:


This objective function emphasizes frames where the speaker assignment plays an important role.

During inference, the K-means algorithm is first applied to cluster into two groups. We then organize frame-level outputs according to their K-means labels. Finally, iSTFT is employed to convert complex outputs to the time domain.

Fig. 2: Diagram of the sequential grouping stage. We use BLSTM or TCN as the neural network in this stage.
Fig. 3: Diagram of the TCN used in sequential grouping. Outputs from the previous stage are fed into a series of dilated convolutional blocks to predict frame-level embedding vectors. The dilation factor of each block is marked on the right. The detailed structure of a dilated convolutional block is illustrated in the large gray box. The network within the dashed box can be also used for uPIT based speaker separation.

Iii-B2 Temporal convolutional networks for sequential grouping

Temporal convolutional networks (TCNs) have been used as a replacement for RNNs, and have shown comparable or better performance in various tasks [2] [22] [27] [29]. In TCNs, a series of dilated convolutional layers are stacked to form a deep network, which enables very long memory. In this study, we adopt a TCN similar to TasNet [27] for sequential grouping, as illustrated in Fig. 3.

In the proposed TCN, input features are first passed to a 2-D dense CNN block, a convolutional layer and a layer normalization module, to perform frame-level feature preprocessing. The convolutional layer here refers to a 1-D CNN layer with a kernel size of 1. The preprocessed features are then passed to a series of dilated convolutional blocks, with an exponentially increasing dilation factor (, , …, ) to exploit large temporal contexts. Next, the stacked dilated convolutional blocks are repeated 3 times to further increase the receptive field. Lastly, the outputs are fed into a convolutional layer for embedding estimation.

In each dilated convolutional block, a bottleneck input with channels is first passed to a

convolutional layer, followed by PReLU (parametric rectified linear unit) activation

[10] and layer normalization, to extend the number of channels to , with output denoted by . A depthwise dilated convolutional layer [5] with kernel , followed by PReLU activation and layer normalization, is then employed to capture the temporal context. The number 3 here indicates the size of the temporal filter in each channel, and there are depthwise separable filters in the kernel. We adopt non-causal filters to exploit both past and future information, with a dilation factor from ,… , as in [27]. The output of this part is denoted by , which is then passed to a convolutional layer to project the number of channels back to , denoted by

. In the end, an identity residual connection combines

and and forms the final output.

Overfitting is a major concern in sequence models. If not regularized properly, sequence models tend to memorize the patterns in the training data, and get trapped in local minima. To address this issue, various dropout techniques [9] [28] [31] have been proposed for RNNs. Consistent improvement has been achieved if dropout is applied to recurrent connections [28]. Meanwhile, a simple dropout scheme for TCNs is used in [2], i.e., dropping

in each dilated convolutional block, but it does not yield satisfactory performance in our experience. Based on these findings, we design a new dropout scheme for the TCN model, denoted by dropDilation. In dropDilation, the dilated connections in depthwise dilated convolutional layers are dropped with a probability of

, where denotes the keep rate. To be more specific, a binary mask, , is multiplied with each depthwise dilated convolutional kernel during training, with and

drawn independently from a Bernoulli distribution

. In dropDilation, we only drop the dilated connections while keeping the direct connections to preserve local information.

Iv Evaluation and comparison

Iv-a Experimental Setup

We use the WSJ0-2mix dataset, a monaural two-talker speaker separation dataset introduced in [12], for evaluations. WSJ0-2mix has a 30-hour training set and a 10-hour validation set generated by selecting random speaker pairs in the Wall Street Journal (WSJ0) training set si_tr_s, and mixing them at various SNRs between 0 dB and 5 dB. Evaluation is conducted on the 5-hour open-condition (OC) test set, which is similarly generated using 16 untrained speakers from the WSJ0 development set si_dt_05 and si_et_05. All mixtures are sampled at 8 kHz. STFT with a frame length of 32ms, a frame shift of 8 ms, and a square root hanning window is taken for the whole system.

We report results in terms of signal-to-distortion ratio improvement (SDR) [34], perceptual evaluation of speech quality (PESQ) [17], and extended short-time objective intelligibility (ESTOI) [19], to measure source separation performance, speech quality and speech intelligibility, respectively. We also report the final result in terms of scale-invariant signal-to-noise ratio improvement (SI-SNR) [27] for a systematical comparison with other competitive systems.

Iv-B Models

Iv-B1 Simultaneous grouping models

Two models are evaluated for simultaneous grouping: BLSTM and Dense-UNet.

The baseline BLSTM contains 3 BLSTM layers, with 8962 units in each layer. In each dense block of Dense-UNet, the number of channels is set to 64, the total number of dense layers is set to 5, and all CNN layers have a kernel size of and a stride of

. The middle layer in each dense block is replaced with a frequency mapping layer. We use valid padding (a term in CNN literature referring to no input padding) for the last CNN layer in each dense block, and same padding (padding the input with zeros so that the output has the same dimension as the original input) for all other layers. The input STFT is zero-padded accordingly.

For both models, when trained with , the magnitude STFT of the mixture is adopted as the input, and ELU activation is applied to output layers for phase-sensitive mask estimation. If or is used for training, a stack of real and imaginary STFT is used as the input, and linear output layers are used to predict the real and imaginary parts of complex ratio masks separately.

Both networks are trained with the Adam optimization algorithm [20] and dropout regularization [13]. The initial learning rate is set to 0.0002 for BLSTM, and 0.0001 for Dense-UNet. Learning rate adjustment and early stopping are employed based on the loss on the validation set.

Iv-B2 Sequential grouping models

Two models are evaluated for sequential grouping: BLSTM and TCN. Both models are trained on top of a well-tuned simultaneous grouping model.

The baseline BLSTM contains 4 BLSTM layers, with 3002 units in each layer. In TCN, the maximum dilation factor is set to , to reach a theoretical receptive field of 8.128s. The number of bottleneck units is selected as 256. The number of units in depthwise dilated convolutional layers is set to 512. Same padding is employed in all CNN layers. DropDilation with is applied during training.

A 2-D dense CNN block is used in both models for frame-level feature preprocessing, with , , a kernel size of and a stride of . The dimensionality of embedding vectors is set to 40. Both networks are trained with the Adam optimization algorithm, with an initial learning rate of 0.001 for BLSTM, and 0.00025 for TCN. Learning rate adjustment and early stopping are again adopted.

Iv-B3 One stage uPIT models

To systematically evaluate the proposed methods, we train a Dense-UNet and a TCN with SNR objectives and uPIT training criterion, i.e., . Other training recipes follow those in Section IV-B1 and IV-B2..

Iv-C Results and Comparisons

Objective # of param. SDR (dB) PESQ ESTOI (%)
Mixture - - 0.0 2.02 56.1
tPIT BLSTM PSA 46.3M 13.0 3.13 86.7
tPIT Dense-UNet PSA 4.7M 14.7 3.41 90.5
tPIT Dense-UNet CA 4.7M 18.6 3.57 93.8
tPIT Dense-UNet SNR 4.7M 19.1 3.63 94.3
TABLE I: SDR, PESQ and ESTOI for simultaneous grouping models with optimal output assignment on WSJ0-2mix OC.

We first evaluate the simultaneous grouping stage. Table I summarizes the performance of tPIT models with respect to different network structures and training objectives. For all models, outputs are organized with the optimal speaker assignment before evaluation. Scores of mixtures are presented in the first row. Compared to BLSTM, Dense-UNet drastically reduces the number of trainable parameters to 4.7 million, and introduces significant performance gain. The frequency mapping layers in our Dense-UNet introduce a 0.3 dB increment in SDR, 0.1 increment in PESQ, 0.8% increment in ESTOI and a parameter reduction of 0.9 million. Next, we switch from magnitude STFT to complex STFT, and change the training objective to . This change leads to large improvement, revealing the importance of phase information for source separation. The SNR objective further outperforms the CA objective. We thus adopt tPIT Dense-UNet trained with for simultaneous grouping in the following evaluations.

Output Assign. SDR (dB) PESQ ESTOI (%)
tPIT Dense-UNet Optimal 19.1 3.63 94.3
Default 0.0 1.99 55.8
uPIT Dense-UNet Optimal 17.0 3.40 91.6
Default 15.2 3.24 88.9
TABLE II: SDR, PESQ and ESTOI for tPIT and uPIT based Dense-UNet trained with SNR objectives.
Fig. 4: Speaker separation results of PIT based models in log-scale magnitude STFT. Two models, tPIT Dense-UNet and uPIT Dense-UNet, are trained with CA objectives. The complex outputs from the models are converted to log magnitude STFT for visualization. (a) A male-male test mixture. (b) Speaker1 in the mixture. (c) Speaker2 in the mixture. (d) tPIT’s output1 with default assignment. (e) tPIT’s output2 with default assignment. (f) tPIT’s output1 with optimal assignment. (g) tPIT’s output2 with optimal assignment. (h) uPIT’s output1 with default assignment. (i) uPIT’s output2 with default assignment. (j) uPIT’s output1 with optimal assignment. (k) uPIT’s output2 with optimal assignment.

Table II compares tPIT and uPIT based Dense-UNet in terms of both optimal and default output assignments. Both models are trained with SNR objectives. Thanks to the utterance-level output-speaker pairing, uPIT’s default assignment is improved by a large margin over tPIT. However, since frame-level loss is not optimized in uPIT, there is a significant gap between uPIT and tPIT with optimal assignment.

Fig. 4 illustrates the differences between tPIT and uPIT based Dense-UNet in more details. Because SNR objectives lead to less structured outputs in the T-F domain, the models illustrated in the figure are trained with CA objectives. Speaker assignment swaps frequently in the default outputs of tPIT. However, if we organize the outputs with the optimal assignment, the outputs almost perfectly match the clean sources, as shown in the fourth row. On the other hand, the default outputs of uPIT are much closer to the clean sources compared to tPIT. However, for this same-gender mixture, uPIT makes several assignment mistakes in the default outputs, e.g., from 2s to 2.5s, and from 5s to 5.2s. If we optimally organize uPIT’s outputs, as in the last row, we can see uPIT exhibits much worse frame-level performance than tPIT. In some frames, e.g., around 4.9s, the predicted frequency patterns are totally mixed up. These observations reveal uPIT’s limitations in both frame-level separation and speaker tracking for challenging speaker pairs.

Simul. Group. Seq. Group. SDR (dB) PESQ ESTOI (%)
tPIT Dense-UNet BLSTM 16.4 3.31 90.8
tPIT Dense-UNet TCN 17.9 3.49 92.9
uPIT Dense-UNet - 15.2 3.24 88.9
uPIT Dense-UNet Optimal 17.0 3.40 91.6
uPIT TCN - 13.5 3.06 85.9
uPIT TCN Optimal 14.9 3.19 88.1
TABLE III: Comparison of different sequential grouping methods on WSJ0-2mix OC.

Next, we evaluate different sequential grouping models in Table III. The first two models are trained on top of the tPIT Dense-UNet with the SNR objective. As shown in the table, TCN substantially outperforms BLSTM, both having around 8 million parameters. The dropDilation technique in our TCN introduces 0.5 dB SDR gain compared to conventional dropout [2].

In the last four rows of Table III, we report the results of uPIT models. The first uPIT model is trained using Dense-UNet, and it significantly underperforms both deep CASA systems. Even if the outputs are optimally reassigned, uPIT Dense-UNet still systematically underperforms deep CASA (tPIT Dense-UNet + TCN), due to its frame-level separation errors. We also train a TCN model with uPIT objectives, and it yields much worse results than uPIT Dense-UNet.

To further analyze the differences between deep CASA and uPIT, we present frame assignment error (FAE) for the best performing deep CASA system and the two uPIT based models in Table IV. FAE is defined as the percentage of incorrectly assigned frames in terms of the minimum frame-level loss. As shown in the table, uPIT Dense-UNet generates the highest FAE, because the network is not specifically designed for sequence modeling. uPIT TCN slightly outperforms uPIT Dense-UNet due to its long receptive field. However, because uPIT TCN does not handle frequency patterns as well, its overall separation performance is worse than uPIT Dense-UNet. Deep CASA cuts FAE by half compared to uPIT models. Such results demonstrate the benefits of the proposed divide-and-conquer strategy, which optimizes frame-level separation and speaker tracking in turn, and achieves better performance in both objectives.

Simul. Group. Seq. Group. Frame Assign. Errors (%)
tPIT Dense-UNet TCN 1.38
uPIT Dense-UNet - 3.49
uPIT TCN - 3.07
TABLE IV: Frame assignment errors for different methods for frames with significant energy (at least -20 dB relative to maximum frame-level energy).
Model Gender Comb. SDR (dB) PESQ ESTOI (%)
tPIT Dense-UNet + TCN Assign. Female-Male 18.9 3.57 93.9
Female-Female 15.7 3.32 90.5
Male-Male 17.2 3.45 92.5
uPIT Dense-UNet Female-Male 17.4 3.42 91.9
Female-Female 10.9 2.89 83.5
Male-Male 13.6 3.12 86.7
tPIT Dense-UNet + Opt. Assign. Female-Male 19.4 3.64 94.4
Female-Female 18.8 3.61 93.9
Male-Male 18.7 3.62 94.3
TABLE V: SDR, PESQ and ESTOI for deep CASA and uPIT for different gender combinations.
Fig. 5: Speaker separation results of the deep CASA system, with tPIT Dense-UNet trained with SNR objectives for simultaneous grouping and TCN for sequential grouping. The same test mixture is used as in Fig. 4. The complex outputs from the models are converted to log magnitude STFT for visualization. (a). Speaker1 in the mixture. (b) Speaker2 in the mixture. (c) tPIT’s output1 with default assignment. (d) tPIT’s output2 with default assignment. (e) Optimal assignment (black and white bars represent two different assignments). (f) K-means assignment. (g) tPIT’s output1 with K-means assignment. (h) tPIT’s output2 with K-means assignment. (i) tPIT’s output1 with K-means assignment after iSTFT and STFT. (j) tPIT’s output2 with K-means assignment after iSTFT and STFT.
# of param. SDR (dB) SI-SNR (dB) PESQ ESTOI (%)
Mixture - 0.0 0.0 2.02 56.1
uPIT [21] 94.6M 10.0 - 2.84 -
TasNet [27] 8.8M 15.0 14.6 3.25 -
Wang et al. [39] 56.6M 15.4 15.2 3.45 -
FurcaNeXt [32] 51.4M 18.4 - - -
Deep CASA 12.8M 18.0 17.7 3.51 93.2
IBM - 13.8 13.4 3.28 89.1
IRM - 13.0 12.7 3.68 92.9
PSM - 16.7 16.4 3.98 96.0
TABLE VI: Number of parameters, SDR, SI-SNR, PESQ and ESTOI for various state-of-the-art systems evaluated on WSJ0-2mix OC.

Table V compares deep CASA and uPIT systems for different gender combinations. Both systems achieve better results on male-female combinations than same gender conditions. The performance gap is larger for female-female mixtures, consistent with the observation in [24]. This might be due to the unbalanced gender distribution in WSJ0-2mix OC, which contains 1086 male-male mixtures, but only 394 female-female mixtures. On the other hand, the performance gap between different gender combinations is much smaller in deep CASA than in uPIT, likely because deep CASA is better at speaker tracking.

Fig. 5 illustrates the results of deep CASA. As shown in the second row, tPIT Dense-UNet trained with SNR objectives generates entirely different default outputs compared to the same model trained with CA (cf. Fig. 4). The optimal assignments alternate almost every frame, leading to striped patterns. To study the phenomenon, we analyze the overall training process of tPIT Dense-UNet trained with . At the beginning, the SNR objective leads to similar outputs as the CA objective. However, because there is 75% overlap between neighboring frames in the proposed STFT, models trained with SNR only need to make accurate predictions every other frame, with frames in between left blank. Such patterns start to occur after a few hundred training steps. The competing speaker then gradually fills in the blanks, and the striped patterns are thus formed. As shown in Fig. 5(f), the K-means labels predicted by the sequential grouping system almost perfectly match the optimal labels in speech-dominant frames. However, organizing the default outputs with respect to the K-means labels leads to magnitude STFT that is quite different from the clean sources. Residual patterns from the interfering speaker still exist in some frames. If we convert the complex outputs in Fig. 5(g) and 5(h) to the time-domain, these residual patterns will be cancelled by the overlap-and-add operation in iSTFT due to their opposite phases. In the last row, we apply iSTFT and STFT in turn to the organized complex outputs, and the new results can almost perfectly match the clean sources.

Simultaneous and sequential grouping are optimized in turn in the above deep CASA systems. We now consider joint optimization, where the two stages are trained together with small learning rates (1/8 of the initial learning rates) for 40 epochs. For the simultaneous grouping module, we organize the outputs using estimated K-means labels, and compare them with the clean sources to form an SNR objective. Meanwhile, the sequential grouping module is trained using the weighted objective in Eq. 

13. As joint training unfolds, we observe smoother outputs. Joint optimization introduces slight but consistent improvement in all three metrics (0.1 dB SDR, 0.02 PESQ, and 0.3% ESTOI).

Finally, Table VI compares the deep CASA system with joint optimization and other state-of-the-art talker-independent methods on WSJ0-2mix OC. For all methods, we list the best reported results, and leave unreported fields blank. The numbers of parameters in different methods are estimated according to the papers. The uPIT system [21] is the basis of this study. TasNet [27] extends uPIT to the waveform domain, where a TCN is utilized for separation. We have also trained a similar uPIT TCN in this work. However, due to the different domains of signal representation, our uPIT TCN yields slightly worse results than TasNet, which suggests that better performance may be achieved by extending the deep CASA framework to the time domain. In [39], a phase prediction network is trained on top of a DC network. It yields high PESQ. FurcaNeXt [32] produces very high SDR. The deep CASA system generates slightly lower SDR results, but has much fewer parameters. In addition, deep CASA yields the best results in terms of SI-SNR, PESQ and ESTOI. The last three rows present the results of the IBM, IRM and ideal phase-sensitive mask (PSM) with the STFT configuration in Section IV-A. Deep CASA systematically outperforms the ideal masks in terms of SDR and SI-SNR. However, there is still room for improvement in terms of PESQ.

V Concluding remarks

We have proposed a deep CASA approach to talker-independent monaural speaker separation. Simultaneous grouping is first conducted to separate two speakers at the frame level. Sequential grouping is then employed to stream separated frame-level spectra into two sources. The deep CASA algorithm optimizes frame-level separation and speaker tracking in turn in the two-stage framework, leading to much better performance than DC and PIT. Our contributions also include novel techniques such as complex ratio masking, SNR objectives, Dense-UNet with frequency mapping layers and TCN with dropDilation. Experimental results on the benchmark WSJ0-2mix dataset show that the proposed algorithm produces the state-of-the-art results, with a modest model size.

A major difference between our sequential grouping stage and deep clustering is that embedding operates at the T-F unit level in DC, and at the frame level in deep CASA. There are several advantages to our approach. First, DC excels at speaker tracking due to clustering, but it is not better than ratio masking for frame-level separation. Therefore, divide and conquer is a natural choice. Second, deep CASA is more flexible. Almost all DC based algorithms are built on time-frequency processing. Our sequential grouping works on frame-level outputs, which can be produced by estimating magnitude STFT, complex masks, or even time-domain signals. In addition, we reduce the computational complexity of clustering from in DC to in deep CASA.

Although the deep CASA algorithm is formulated for two speakers, it can be extended to three or more speakers. First, additional output layers can be added in the simultaneous grouping stage. In sequential grouping, we can employ the setup in [25] to predict one embedding vector for each frame-level spectral estimate. A constrained K-means algorithm can then be used to assign each frame-level embedding to a different speaker.


  • [1] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  • [2] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
  • [3] A. Bregman, Auditory scene analysis.   Cambridge MA: MIT Press, 1990.
  • [4] D. S. Brungart, “Informational and energetic masking effects in the perception of two simultaneous talkers,” J. Acoust. Soc. Amer., vol. 109, pp. 1101–1109, 2001.
  • [5] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proc. CVPR, 2017, pp. 1251–1258.
  • [6] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” in Proc. ICLR, 2016.
  • [7] J. Du, Y. Tu, Y. Xu, L. R. Dai, and C. H. Lee, “Speech separation of a target speaker based on deep neural networks,” in Proc. ICSP, 2014, pp. 65–68.
  • [8] H. Erdogan, J. R. Hershey, and S. Watanabe, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, 2015, pp. 708–712.
  • [9] Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” in Proc. NIPS, 2016, pp. 1019–1027.
  • [10]

    K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in

    Proc. ICCV, 2015, pp. 1026–1034.
  • [11] E. W. Healy, M. Delfarah, J. L. Vasko, B. L. Carter, and D. L. Wang, “An algorithm to increase intelligibility for hearing impaired listeners in the presence of a competing talker,” J. Acoust. Soc. Amer., vol. 141, pp. 4230–4239, 2017.
  • [12] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016, pp. 31–35.
  • [13] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
  • [14] K. Hu and D. L. Wang, “An unsupervised approach to cochannel speech separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, pp. 122–131, 2013.
  • [15] G. Huang, Z. Liu, L. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. CVPR, 2017, pp. 4700–4708.
  • [16] P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” in Proc. ICASSP, 2014, pp. 1562–1566.
  • [17] ITU-R, “Perceptual evaluation of speech quality (PESQ) An objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs,” Recommendation P.862, 2001.
  • [18] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep U-Net convolutional networks,” in Proc. ISMIR, 2017.
  • [19] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE Trans. Audio, Speech, Lang. Process., vol. 24, pp. 2009–2022, 2016.
  • [20] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
  • [21] M. Kolbæk, D. Yu, Z. H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, pp. 1901–1913, 2017.
  • [22] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in Proc. ECCV, 2016, pp. 47–54.
  • [23] C. Li, L. Zhu, S. Xu, P. Gao, and B. Xu, “CBLDNN-based speaker-independent speech separation via generative adversarial training,” in Proc. ICASSP, 2018, pp. 711–715.
  • [24] Z.-X. Li, Y. Song, L.-R. Dai, and I. McLoughlin, “Listening and grouping: an online autoregressive approach for monaural speech separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 27, pp. 692–703, 2019.
  • [25] Y. Liu and D. L. Wang, “A CASA approach to deep learning based speaker-independent co-channel speech separation,” in Proc. ICASSP, 2018, pp. 5399–5403.
  • [26] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separation with deep attractor network,” IEEE Trans. Audio, Speech, Lang. Process., vol. 26, pp. 787–796, 2018.
  • [27] Y. Luo and N. Mesgarani, “TasNet: Surpassing ideal time-frequency masking for speech separation,” arXiv preprint arXiv:1809.07454, 2018.
  • [28] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing LSTM language models,” arXiv preprint arXiv:1708.02182, 2017.
  • [29] A. Pandey and D. L. Wang, “TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain,” in Proc. ICASSP, 2019.
  • [30] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, 2015, pp. 234–241.
  • [31] S. Semeniuta, A. Severyn, and E. Barth, “Recurrent dropout without memory loss,” arXiv preprint arXiv:1603.05118, 2016.
  • [32] Z. Shi, H. Lin, L. Liu, R. Liu, and J. Han, “FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks,” arXiv preprint arXiv:1902.04891, 2019.
  • [33] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band DenseNets for audio source separation,” in Proc. WASPAA, 2017, pp. 21–25.
  • [34] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, pp. 1462–1469, 2006.
  • [35] D. L. Wang and G. Brown, Eds., Computational Auditory Scene Analysis: Principles, Algorithms and Applications.   Wiley-IEEE Press, 2006.
  • [36] Y. Wang, A. Narayanan, and D. L. Wang, “On training targets for supervised speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, pp. 1849–1858, 2014.
  • [37] Z.-Q. Wang, J. L. Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in Proc. ICASSP, 2018, pp. 686–690.
  • [38] Z.-Q. Wang, J. L. Roux, D. L. Wang, and J. R. Hershey, “End-to-end speech separation with unfolded iterative phase reconstruction,” in Proc. Interspeech, 2018, pp. 2708–2712.
  • [39] Z.-Q. Wang, K. Tan, and D. L. Wang, “Deep learning based phase reconstruction for speaker separation: A trigonometric perspective,” arXiv preprint arXiv:1811.09010, 2018.
  • [40] D. S. Williamson, Y. Wang, and D. L. Wang, “Complex ratio masking for monaural speech separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 24, pp. 483–492, 2016.
  • [41] C. Xu, W. Rao, X. Xiao, E. S. Chng, and H. Li, “Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM,” in Proc. ICASSP, 2018, pp. 6–10.
  • [42] X. L. Zhang and D. L. Wang, “A deep ensemble learning method for monaural speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, pp. 967–977, 2016.