Phasebook and Friends: Leveraging Discrete Representations for Source Separation

by   Jonathan Le Roux, et al.

Deep learning based speech enhancement and source separation systems have recently reached unprecedented levels of quality, to the point that performance is reaching a new ceiling. Most systems rely on estimating the magnitude of a target source by estimating a real-valued mask to be applied to a time-frequency representation of the mixture signal. A limiting factor in such approaches is a lack of phase estimation: the phase of the mixture is most often used when reconstructing the estimated time-domain signal. Here, we propose `MagBook', `phasebook', and `Combook', three new types of layers based on discrete representations that can be used to estimate complex time-frequency masks. MagBook layers extend classical sigmoidal units and a recently introduced convex softmax activation for mask-based magnitude estimation. Phasebook layers use a similar structure to give an estimate of the phase mask without suffering from phase wrapping issues. Combook layers are an alternative to the MagBook-Phasebook combination that directly estimate complex masks. We present various training and inference regimes involving these representations, and explain in particular how to include them in an end-to-end learning framework. We also present an oracle study to assess upper bounds on performance for various types of masks using discrete phase representations. We evaluate the proposed methods on the wsj0-2mix dataset, a well-studied corpus for single-channel speaker-independent speaker separation, matching the performance of state-of-the-art mask-based approaches without requiring additional phase reconstruction steps.


page 1

page 8


Learned complex masks for multi-instrument source separation

Music source separation in the time-frequency domain is commonly achieve...

On the Use of Deep Mask Estimation Module for Neural Source Separation Systems

Most of the recent neural source separation systems rely on a masking-ba...

End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

This paper proposes an end-to-end approach for single-channel speaker-in...

Deep Filtering: Signal Extraction Using Complex Time-Frequency Filters

Signal extraction from a single-channel mixture with additional undesire...

Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective

This study investigates phase reconstruction for deep learning based mon...

Filterbank design for end-to-end speech separation

Single-channel speech separation has recently made great progress thanks...

Differentiable Consistency Constraints for Improved Deep Speech Enhancement

In recent years, deep networks have led to dramatic improvements in spee...

I Introduction

The field of speech separation and speech enhancement has witnessed dramatic improvements in performance with the recent advent of deep learning-based techniques [1, 2, 3, 4, 5, 6, 7, 8]. Most of these algorithms rely on the estimation of some sort of time-frequency (T-F) mask to be applied to the time-frequency representation of an input mixture signal, the estimated signal then being resynthesized using some inverse transform. Let us denote by , , and the complex-valued time-frequency representations of a mixture signal, a target source signal, and an interference signal, respectively, where denotes the time frame index and the frequency bin index. We also denote by

the phase difference between the mixture and the target source. The time-frequency representation is here typically taken to be the short-term Fourier transform (STFT), such that

. The goal of speech enhancement or separation can be formulated as that of recovering an estimate of from , and we’re interested in particular in algorithms that do so by estimating a mask such that . Note that the interference signal itself could also be a separate target, such as in the case of speaker separation.

In most cases, these time-frequency masks are real-valued, which means that they only modify the magnitude of the mixture in order to recover the target signal. Their values are also typically constrained to lie between 0 and 1, both for simplicity and because this was found to work well under the assumption that only the magnitude is modified, retaining the mixture phase for resynthesis.

Several reasons can be cited for focusing on modifying only the magnitude: the noisy phase is actually the minimum mean-squared error (MMSE) estimate [9] under some simplistic statistical independence assumptions (which typically do not hold in practice); combining the noisy phase with a good estimate of the magnitude is straightforward and gives somewhat satisfactory results; until recently, getting a good estimate of the magnitude was already difficult enough such that optimizing the phase estimate was not a priority, or to put it in other words, phase was not the limiting factor in performance; estimating the phase of the target signal is believed to be a hard problem.

With the advent of recent deep learning algorithms, the quality of the magnitude estimates has improved significantly, to the point that the noisy phase has now become a limiting factor to the overall performance. Because the noisy phase is typically inconsistent with the estimated magnitude [10, 11], the reconstructed time-domain signal has a different magnitude spectrogram from the intended, estimated one. As an added drawback, further improving the magnitude estimate by making it closer to the true target magnitude may actually lead to worse results when pairing it with the noisy phase, in terms of performance measures such as signal to noise ratio (SNR). Indeed, if the noisy phase is incorrect and for example opposite to the true phase, using as the estimate for the magnitude is a “better” choice than using the correct magnitude value, which may point far away in the wrong direction. Using the noisy phase is thus not only sub-optimal as a phase estimate, it likely also forces the magnitude estimation algorithms to limit their accuracy with respect to the true magnitude.

Fig. 1: Illustration of the complex mask estimates obtained when using the noisy/mixture phase. The closest point to the clean source along the line of estimates with phase equal to that of the mixture is , whose magnitude is very different from the true clean magnitude. The point along that line with true clean magnitude lies further from the clean source.

Together with the simplicity of using logistic sigmoid output activations, use of the mixture phase is in particular one of the reasons why mask estimation algorithms typically do not attempt to estimate mask values larger than . Indeed, such values are expected to occur in regions where there was a canceling interference between the sources, and it is likely that the noisy phase is a bad estimate there; increasing the magnitude without fixing the phase is thus likely to bring the estimate further away from the target, compared to where the original mixture was in the first place. These issues are illustrated in Fig. 1, where for simplicity we only consider the case of a single T-F bin in the complex plane, and we omit the time-frequency subscripts . The phase-sensitive filter (PSF) estimate corresponds to the orthogonal projection of the clean source on the line defined by the mixture [2]; because of the cancelling interference, the PSF estimate here lies in the opposite direction of the magnitude. The truncated PSF estimate , where the mask is constrained to lie in , is thus equal to here. The ideal amplitude mask (IAM) estimate , which has the correct clean magnitude, is further from the clean source than either or the PSF estimate.

By improving upon the noisy phase, we could thus unshackle magnitude estimation algorithms and allow them to attempt bolder estimates that are also more faithful to the true reference magnitude, unlocking new heights in performance. In particular, it would now be worth attempting to involve mask estimates that are allowed to go beyond . For example, one may consider estimating the IAM mentioned above, or a version of it truncated to . One may also consider estimating a discretized magnitude mask, where the discrete values are not restricted to lie in . An example of typical distributions for the magnitude and phase components of the ideal complex mask , with the magnitude truncated to , are shown in Fig. 2. It is clear that a significant proportion of the magnitude mask data lies strictly above .

We have already started exploring this regime for the magnitude, with the introduction of a convex softmax activation function which interpolates between the values

to obtain a continuous representation of the interval as the target interval for the magnitude mask [12]. We showed that this activation function led to significantly better performance when optimizing for best reconstruction after a phase reconstruction algorithm. This intuitively makes sense, because the reconstructed phase used to obtain the final time-domain signal is likely to better exploit a magnitude estimate more faithful to the clean magnitude, in particular at time-frequency bins where the clean magnitude is larger than the mixture magnitude due to cancelling interference.

Fig. 2: Phase and magnitude distributions of , truncated to

We propose here a generalization of this idea of relying on discrete values to build representations for the masks. We extend the concept of convex softmax activation for the magnitude to the combination of a magnitude codebook, or MagBook, with a softmax layer to build various magnitude representations, either discrete or continuous. Similarly, we propose to combine a phase codebook, or phasebook, with a softmax layer to build various phase representations, again either discrete or continuous. Finally, we propose an alternate representation which foregoes the factorization between magnitude and phase and combines a complex codebook, or Combook, with a softmax layer to build various complex mask representations. These representations are flexible and can be incorporated within optimization frameworks that are regression-based, classification-based, or a combination of both.

Related works:

This paper’s contributions are at the intersection of multiple directions of research: classification-based separation, discrete phase representations, complex mask estimation, phase-difference modelling, and phase reconstruction. The idea of considering separation as a classification problem was explored first using shallow methods, in particular support vector machines

[13, 14, 15]

, and later deep neural networks

[16], and was arguably at the onset of the deep learning revolution in this field. A few works have proposed to consider discrete representations of the phase for source separation, such as [17] and [18], in both cases within a generative model based on mixtures of Gaussians. Some works have attempted to incorporate phase modeling for deep-learning-based source separation, in particular with the so-called complex ratio mask [19], which does consider ranges of values that are not limited to . While the complex ratio mask used a continuous real-imaginary representation, we here focus mainly on discrete representations involving a magnitude-phase factorization or a direct modelling of the complex value (with the real and imaginary parts considered jointly). We also model not the clean phase but a phase mask, that is, a phase difference between the mixture and the clean source, or in other words a correction to be applied to the mixture phase to get closer to the clean phase. Estimating the phase difference was recently considered within an audio-visual separation framework in [20], where it is reconstructed using a convolutional network that takes the estimated magnitude and the noisy phase as input. Another, potentially complementary, way to improve the phase is to use phase reconstruction. Recent works from our team applied phase reconstruction at the output of a good magnitude estimation network [8], then trained through an unfolded iterative phase reconstruction algorithm [12]. We finally trained the time-frequency representations used within the phase reconstruction algorithm themselves [21], which is the current state-of-the-art in methods relying on time-frequency representations.

As we were finalizing this article, two related works worth mentioning were published. First, a deep-learning-based source separation algorithm, referred to as PhaseNet [22], attempts to estimate discretized values of the target source phase; the discretized values are fixed to a uniform quantization along the unit circle, and the network is trained using cross-entropy. As it will become clear in this article, apart from the fact that PhaseNet attempts to estimate the target phase instead of the phase difference, the representation used corresponds to a particular setup of our framework, with a fixed uniform phasebook, cross-entropy training, and argmax based inference. Our framework allows for much more variety in both training and inference regimes, in particular allowing fully end-to-end training which is cumbersome with argmax inference. Second, an updated version of the TasNet algorithm [23] just established a new state-of-the-art on the wsj0-2mix dataset, surpassing our previous numbers as well as those presented in this article. The TasNet article introduced several interesting techniques that could be adopted in our framework, such as the use of convolution layers instead of recurrent ones, layer normalization schemes, and the use of SI-SDR as the objective instead of the waveform approximation loss that we consider. It is unclear how much these techniques would influence the performance of TasNet’s competing methods, and we shall consider incorporating them in our framework as future work.

Ii Designing masks based on discrete representations

We propose to rely on discrete values to build representations for a complex ratio mask, either via its factorization into magnitude and phase components or directly as a complex value. In particular, we propose to model the magnitude mask using a combination of a magnitude codebook, or MagBook, with a softmax layer, and to model the phase mask (i.e., the correction term between mixture phase and clean phase) using a combination of a phase codebook, or phasebook, with a softmax layer. Alternatively, we consider modelling the complex ratio mask directly using a combination of a complex codebook, or Combook, with a softmax layer; magnitude and phase are then modelled jointly.

We consider scalar codebooks for the magnitude mask, for the phase mask, and for the complex mask. At each time-frequency bin

, a network can estimate softmax probability vectors for the magnitude mask, the phase mask, or the complex mask, denoted by


where denotes the input features, the network parameters, and is the unit -simplex. We consider several options for using these softmax layer output vectors in order to build a final output, either as probabilities, to sample a value or to select the most likely one, or as weights within some interpolation scheme:

  • select the one-best (“argmax”):

  • sample from the softmax distribution (“sampling”):

  • interpolate over the distribution (“interpolation”):


Note that the interpolation for the phase in Eq. (11) is performed in the complex domain and that taking the angle implies a renormalization step; this interpolation is illustrated in Fig. 3. Further note that the interpolation regime for the magnitude is an extension of the classical sigmoid activation function for the case of a fixed MagBook of size with elements (referred to here as Uniform Magbook 2), and an extension of the convex softmax considered in [12] for the case of a fixed MagBook of size with elements (referred to here as Uniform Magbook 3).

Fig. 3: Illustration of the phase interpolation regime for a uniform phasebook with 8 elements. Softmax probabilities are displayed via the surface of each circle.

In the following, we shall call “phasebook layer” a layer computing phase values based on the outputs of a softmax layer and a phasebook via a method such as those above, and similarly for a “MagBook layer” and a “Combook layer”.

There are multiple motivations for using such representations. For both magnitude and phase, the combination of a discrete codebook with a softmax layer leads to a very flexible framework, where one can define both discrete and continuous representations which can be involved in both classification-based and regression-based optimization frameworks. The continuous representations may lead to more accurate estimates, or be easier to include within an end-to-end training scheme. On the other hand, the discrete representations open the possibility to consider conditional probability relationships across variables combined with the chain rule, and may also avoid regression issues, for example where the estimated value is an interpolation of two values with high probability but itself has low probability. For the magnitude specifically, as mentioned above, this representation provides a way to generalize classical activations. For the phase specifically, relying on discrete values makes it possible to design simple representations that take into account phase wrapping, that is, the fact that any measure of difference between phase values should be considered modulo

. Indeed, if the phasebook values are used as is, either via sampling or selection, there is no need to introduce a notion of proximity between various values; if the phasebook values are used within an interpolation scheme in the complex domain such as in Eq. (11), then the phase is defined by its location around the unit circle, varies continuously with the softmax probabilities, and values such as and for small

can be obtained with softmax probabilities that are close to each other. This would not be the case if one for example modelled phase via a linear transformation of a logistic sigmoid function, such as

: then, and would be represented internally by the network via values very far from each other. Regarding phase, note that one could use the same representation to directly model the clean phase instead of a phase difference, or on top of it and then combine the two estimates.

Iii Phasebook in the argmax regime

To get an idea of the potential benefits of a better phase modeling, we first consider the argmax regime for the phase mask, in which the system attempts to select the best codebook value at each T-F bin.

Given a phasebook , the goal of our system is to estimate at each T-F bin the codebook index such that:


where is some estimate for the magnitude of the mask. The estimation is in fact independent of the magnitude mask value:


Iii-a Codebook optimization

An important question is how to best design the phasebook. An obvious and easy choice is to use regularly spaced values. But ideally, one would like to optimize them for best performance on some training data. This can be done independently of the classification system, or together with it, optimizing both the phasebook and the classification system jointly in an end-to-end fashion. We first consider how to optimize the codebook offline in a pre-training step, for optimal performance given a magnitude estimate. That magnitude estimate may be obtained either with a pre-trained magnitude estimation network, or with an oracle mask.

The objective function for the phasebook training is:


It can be optimized using an EM-like algorithm. In the E-step, the optimal codebook assignments are computed for each T-F bin according to Eq. 14. In the M-step, we update the phasebook to further decrease the objective function by solving


which can easily be shown to be equivalent to


leading to the following update equation:


Note that a MagBook could be similarly (and jointly) optimized under an argmax regime, at each step looping in order over the updates of the MagBook values, the MagBook assignments, the phasebook assignments, and the phasebook values, the latter two as described above. Finally, optimization of a Combook under an argmax regime can be simply obtained via the k-means algorithm.

In our experiments, we optimize the codebooks on a speech separation task using 50 randomly selected utterances from the wsj0-2mix training dataset [4]. Note that we noticed similar behaviors in terms of optimized codebook configurations and separation performance on a speech enhancement task with data from the CHiME2 training set [24]. The initial codebooks can be randomly sampled from the data, or set manually. In the latter case, the phasebooks are initialized using uniform codebooks with values that partition the unit circle into equal angular intervals, making sure that is one of the elements of the codebook: . We run the optimization algorithm for epochs, which was enough to ensure convergence. It is likely that the output of the optimization is only a local optimum, and even better codebooks could potentially be obtained by running multiple optimizations with different initializations, but we did not consider this here.

Figure 4 shows the optimized phasebooks for and a magnitude obtained using an oracle IAM magnitude mask, together with the uniform phasebooks they were initialized from.

Fig. 4: Uniform and optimized phasebooks for and an oracle IAM estimate for magnitude, where the radius of each circle is equal to .

Iii-B Oracle performance

We compare here the performance of various classical masks as well as truncated ratio masks with various truncation thresholds in terms of signal-to-distortion ratio (SDR), which we define here as the scale-invariant signal-to-noise ratio between the target speech and the estimate [25]. The evaluation is performed under oracle conditions (i.e., the mask values are obtained using both the mixture and the true reference signals) on randomly selected files from the wsj0-2mix training set (different than those used for optimizing the phasebooks) [4]. For each mask, we report results where we combined the magnitude part of the mask with the noisy phase, the true phase (i.e., that of the reference), and quantized phases using phasebooks with elements, each phasebook being optimized for the particular magnitude mask it is used with similarly to the algorithm described above. The results are shown in Fig. 5.

The classical masks we investigate are the most popular types of masks that were reviewed and whose oracle performance when paired with the noisy phase was compared in [2]. They include the ideal amplitude mask (IAM), phase sensitive filter (PSF), and its truncated version to (TPSF), all defined in Section I, as well as the ideal binary mask (IBM: ), ideal ratio mask (IRM: ), and Wiener-filter-like mask (WF: ). All these masks are real-valued, and only modify the magnitude of the mixture signal (except for PSF, which allows a reversal of the phase).

Fig. 5: Speech SDR for truncated true ratio mask, various classical masks with quantized phase difference, and optimal codebooks for various codebook sizes.

We first notice that, apart from the phase-sensitive masks PSF and TPSF, all masks lead to similar results when paired with the noisy phase. This confirms that the noisy phase drastically limits the performance. As soon as a slightly better estimate of the phase is considered, performance significantly increases, especially for those masks that consider magnitude ratio values above . For phases other than the noisy phase, we notice a very big jump in performance when allowing the truncation ratio to go from a classical value to an only slightly larger value . Interestingly, very small codebook sizes already lead to very high oracle performance, e.g., . In non-oracle conditions, of course, we need to find the right balance between upper-bound performance and classification accuracy.

Fig. 6 shows results with uniform and optimized phasebooks for truncated ideal amplitude masks. Optimizing the codebooks leads in all cases to significant improvements, with typical gains around 2 to 3 dB.

Fig. 6: Influence of codebook optimization for truncated ratio masks with quantized phase, and for fully quantized masks.

Iv Objective functions

We consider the above representations as layers within a deep learning model for source separation, and we need to optimize the parameters of the model under some objective function. We note that the MagBook , phasebook , and Combook themselves can be considered fixed (to uniform or pre-trained values as described in the previous section), or optimized jointly with the rest of the network, with the codebook values considered as part of the network parameters.

We present multiple objective functions for the magnitude and phase components as well as for the complex mask; in practice, these objective functions can be combined with each other within a multi-task learning framework. Note also that, for simplicity, we define here the objective functions on a single source-estimate pair, but the definitions can be straightforwardly extended to the permutation-free training scheme commonly used in speech separation [4, 5, 6].

Iv-a Cross-entropy objectives

Let denote the reference values for the magnitude mask, and the reference values for the phase mask, which are here the corresponding reference codebook indices. The reference indices for the phase can be obtained using Eq. (14). The reference indices for the magnitude depend on the phase mask that is expected to be used, for example a true reference phase mask as defined above or a current estimate obtained by a network. For a phase mask value at bin , the corresponding optimal mask magnitude index is obtained as


The reference indices for the complex mask are denoted as and simply obtained for each T-F bin as the complex number closest to the ratio mask for some distance, for example .

We can now define an objective function based on the cross-entropy against the oracle codebook assignments for the softmax layer outputs of the MagBook, phasebook, and Combook layers respectively as:


If cross-entropy is used for the magnitude, the phase mask used to compute the reference magnitude can either be fixed (to , to a reference computed offline given some phasebook values, or to an initial estimate obtained by an initial phasebook network), or updated throughout training (using the reference phase mask obtained with the current phasebook if it is being optimized as well, or with the current estimate of the phase mask obtained by the network).

When using these training objectives, either sampling or argmax inference seem most appropriate for use at test time.

Iv-B Magnitude objectives in the T-F domain: MA, MSA, PSA

All the classical objectives used to train mask inference networks that modify the magnitude can be used here, such as mask approximation (MA), magnitude spectrum approximation (MSA), and phase-sensitive spectrum approximation (PSA). Any norm can be considered to define these objective functions, with and (squared) being most commonly used. Using as an example, we can define:


where is the oracle phase difference between mixture and target, and is an oracle magnitude mask such as the IAM.

Iv-C Time-domain objectives: WA, WA-MISI

Recently, we introduced a waveform approximation (WA) objective defined on the time-domain signal reconstructed by inverse STFT from the masked mixture [12]. We also proposed training through an unfolded phase reconstruction algorithm such as multiple input spectrogram inversion (MISI) [26], using the WA objective on the reconstructed time-domain signal after iterations.

Denoting by the reference time-domain signal, and again using as an example, we define:


In the same way as we did for magnitude-only mask inference networks [12], we can train a network that estimates both a magnitude mask and a phase mask, or alternatively a complex mask, end-to-end using the above time-domain objective functions.

V Experimental validation

V-a Experimental setup

We validate the proposed algorithms on the publicly available wsj0-2mix corpus [4], which is widely used in speaker-independent speech separation works. It contains 20,000, 5,000 and 3,000 two-speaker mixtures in its 30 h training, 10 h validation, and 5 h test sets, respectively. The speakers in the validation set are seen during training, while the speakers in the test set are completely unseen. The sampling rate is 8 kHz.

For our neural networks, we follow the same basic architecture as in [12], containing four BLSTM layers, each with 600 units in each direction, followed by output layers. A dropout of is applied on the output of each BLSTM layer except the last one. The networks are trained on 400-frame segments using the Adam algorithm. The window length is 32 ms and the hop size is 8 ms. The square root Hann window is employed as the analysis window and the synthesis window is designed accordingly to achieve perfect reconstruction after overlap-add. A 256-point DFT is performed to extract 129-dimensional log magnitude input features. All systems are implemented using the Chainer deep learning toolkit [27].

V-B Chimera++ network with Phasebook-MagBook mask inference head

We build our system based on the state-of-the-art chimera++ network [12], which combines within a multi-task learning framework a deep clustering head outputting a -dimensional embedding for each T-F bin ( here), and a mask-inference head with convex softmax output which predicts a magnitude mask with values in . The chimera++ objective function is


where can be any of the objective functions described in Section IV, and the weight is typically set to a high value, e.g., 0.975. The loss used on the deep clustering head is the whitened k-means loss


where is the embedding matrix consisting of vertically stacked embedding vectors, and is the label matrix consisting of vertically stacked one-hot label vector representing which source in a mixture dominates at each T-F bin.

Fig. 7: Chimera++ network with Phasebook-MagBook mask inference head.

As we explained above, the mask-inference head with convex softmax output predicting a magnitude mask can be generalized to a MagBook layer. We now add a phasebook layer, similar to the MagBook layer, as a new head at the output of the final BLSTM layer, as illustrated in Fig. 7. The final complex mask is obtained by combining the outputs of the MagBook and phasebook layers as


and then multiplied with the complex mixture to obtain a complex T-F representation of the target estimate:


We still refer to the branch of the network used in computing the final output as the mask-inference (MI) head, which now predicts a complex mask.

V-C Training and inference regimes for phasebook

In this experiment, we start by pre-training chimera++ networks with MagBook mask-inference head, where for now we use the fixed convex softmax of [12]

for the MagBook layer, referred to here as Uniform MagBook 3. For each of the MSA, PSA, and WA losses as MI objective function, we train such a network from scratch within the multi-task learning setting involving the deep clustering and MI objectives, then discard the deep clustering head and fine-tune the MI head only. We also considered a complex spectrum approximation loss function, but in preliminary experiments, it did not perform as well as the WA loss, so we did not explore it further.

We now add a phasebook mask-inference head to these networks as described in Section V-B, where we assume a fixed uniform codebook with values , referred to as Uniform Phasebook , and we consider: (1) training the phasebook layer by itself while keeping the rest of the network fixed, with the cross-entropy loss , and using the argmax regime in Eq. 5 at inference time; (2) training the phasebook layer by itself while keeping the rest of the network fixed, with the WA loss , assuming the interpolation regime in Eq. (11) is used to obtain the final phase mask value; and (3) training the whole network with the WA loss , again assuming the interpolation regime for the phase.

Network Joint mag. Mag. pretraining
Phase estimate Objective training MSA PSA WA
Noisy - 10.5 11.1 11.8
Uniform phasebook 8 argmax CE 10.7 11.1 11.8
Uniform phasebook 8 interp. WA 11.2 11.1 12.0
Uniform phasebook 8 interp. WA 12.2 12.4 12.4
TABLE I: SI-SDR (dB) performance on the wsj0-2mix test set for various training paradigms from various pre-trained magnitude estimation networks.

For this experiment, we consider a Uniform phasebook with elements. Results are shown in Table I in terms of scale-invariant SDR (dB) [25] on the wsj0-2mix test set. From Table I, we see that the CE objective only provides SDR improvements for networks pre-trained with the phase-unaware MSA objective, and is generally outperformed by the WA objective both with and without joint training of the magnitude. This intuitively makes sense, as the MSA-based magnitude estimates are likely to be closer to the true magnitude than those obtained with PSA and WA, which try to compensate for the errors in the noisy phase; once the phasebook layer fixes these errors, which it learns to do without considering the interaction with the magnitude in the CE case, the compensation performed by the magnitude estimate may become extraneous or even detrimental. When training the phasebook layer with WA objective, the largest improvement is again observed for MSA. Finally, when allowing joint training of the MagBook layer, all pre-training objectives obtain their best performance, with PSA and WA obtaining slightly larger values than MSA. Overall, the WA objective with the interpolation regime appears the most robust, both for pretraining and for training networks involving MagBook and phasebook layers. We thus focus on this configuration going forward.

Fig. 8: SDR improvement (dB) for various phasebook configurations.
Fig. 9: Uniform, pre-trained, and jointly trained phasebooks for ; the pre-trained phasebooks are optimized assuming an oracle IAM estimate for magnitude, while the jointly trained phasebooks are optimized together with the rest of the network.
Fig. 10:

Jointly trained MagBooks for different constraints on the MagBook values (no constraint, as the output of a linear layer, or non-negative constraint, as the output of a ReLU layer) and different phase models (noisy phase or uniform phasebook layer with

elements). Red dots represent jointly trained MagBook values, while crosses represent the fixed MagBook, i.e., Uniform MagBook 3, as a reference.

V-D Influence of the phasebook size

Figure 8 shows SDR improvements for various phasebook sizes, where phasebook values are either uniform, pre-trained offline assuming an oracle IAM magnitude, or jointly trained together with the rest of the network. In each case, both the magnitude mask and phase mask layers in the inference head are jointly fine-tuned using the WA loss function, after pre-training of a chimera++ network with WA loss on the MI head. From Fig. 8, we see that all phasebooks improve on the noisy phase SDR of 11.7 dB. We also note that phasebooks of size 8 appear to perform best, and the uniform phasebooks perform comparable to those with learned values. Note that, since we are interpolating over phasebook values, we can theoretically achieve the desired phase difference from any codebook, assuming it is dense enough, so the difference is mainly in the ease for the network to produce softmax outputs that are able to produce a correct estimate. We may see a different trend if we were to pick the argmax or to sample instead.

Figure 9 shows a comparison of the uniform phasebook with the pre-trained and jointly trained values for various codebook sizes. We see that both the pre-trained and jointly trained values tend to place more weight between and as a majority of the learned values cluster in this range; this matches the empirical distribution shown in Fig. 2. We also note that the jointly trained phasebooks appear to be quite redundant, especially for .

V-E MagBook

We showed in [12] that a convex softmax interpolation of fixed values for the magnitude mask leads to state-of-the-art performance when combined with an unfolded phase reconstruction algorithm. This corresponds to a Uniform MagBook 3 in our proposed framework. We here consider an extension of this case using the MagBook formulation, where we further train end-to-end the values to be interpolated jointly with the softmax layer under a waveform approximation objective.

We consider two parameterizations for the magnitude: we let the parameters take any value in (“linear”), or we train them under a non-negative constraint, which we implemented using a ReLU non-linearity (“ReLU”). We also consider two types of phase models: using the noisy phase as is as in previous works, or using a phase mask obtained with a jointly trained phasebook layer with elements. All networks are first pre-trained from scratch as chimera networks then fine-tuned, each time using the WA objective on the MI head.

Figure 10 shows examples of such learned MagBooks. Interestingly, in the linear case, the network finds it best to use one or more negative magnitude elements: it is intuitive in the case of the noisy phase, where the network has an incentive to use its freedom to take negative values in order to fix the noisy phase in regions where a phase inversion is warranted; it is maybe slightly less intuitive when a phasebook layer is involved, as one may think that the phasebook layer should take care of phase inversions where they are needed instead of relying on negative magnitude mask values, but there is in fact no specific incentive in the objective function to favor a positive magnitude value associated with some phase versus the opposite magnitude value with phase , assuming both these phase values can be equally well generated by the phasebook layer. In the ReLU case, the network can no longer use negative magnitudes, and tends to place multiple points close to . To our surprise, it appears that MagBooks obtained with the noisy phase featured slightly larger maximum values than those obtained with a phasebook layer, whereas we argued earlier that using the noisy phase should encourage the network to under-estimate the magnitude mask value. We plan to further investigate the behavior of the estimated masks in these cases by analyzing the estimated softmax probabilities and interpolated values.

Corresponding SI-SDR results are shown in Table II. We first observe that, when used together with the noisy phase, learning the MagBook values appears slightly beneficial, especially when using the unconstrained (linear) MagBooks, perhaps indicating that the network finds it useful to allocate some MagBook values for phase inversion. However, when pairing the magnitude estimate with a better phase estimate obtained with a phasebook layer, learning the MagBook values no longer brings improvements over the Uniform MagBook 3: this is in line with the oracle results showed in Fig. 5, where truncation of the oracle IAM magnitude mask to brings significant benefits over the classical truncation to , and truncation to a maximum value greater than brings little additional gain; this is also in line with the true distribution of magnitude mask values observed in Fig. 2, with large peaks at , , and the truncation threshold of .

Phase estimate
Magnitude estimate       Noisy Phasebook 8
Uniform Magbook 3 11.7 12.4
Jointly trained Magbook 3 (linear) 11.9 12.2
Jointly trained Magbook 4 (linear) 12.1 12.2
Jointly trained Magbook 6 (linear) 12.1 12.4
Jointly trained Magbook 3 (ReLU) 11.8 12.2
Jointly trained Magbook 4 (ReLU) 11.8 12.3
Jointly trained Magbook 6 (ReLU) 11.9 12.2
TABLE II: SI-SDR (dB) performance on the wsj0-2mix test set for various MagBook sizes and nonlinearities.

V-F Combook

Fig. 11: Chimera++ network with Combook mask-inference head.
Fig. 12: Jointly trained Combooks for for chimera++ training followed by mask-inference fine-tuning with WA objective.

We have so far considered factorized representations of the complex mask as a product of a magnitude mask and a phase mask. We now consider a similar use of a discrete representation to model the complex mask, but directly using a codebook of complex values. We train Chimera++ networks where the magnitude mask estimation layer is replaced by a complex mask estimation layer consisting of a softmax layer used to interpolate values of a Combook, as illustrated in Fig. 11. The networks are trained from scratch with both deep clustering and WA objectives, then fine-tuned with WA objective only.

Examples of learned Combooks are shown in Fig. 12 for . We note that the Combook size should not be directly compared to the phasebook size and MagBook size of the previous sections, since the phasebook and MagBook combine to lead to complex values: in the argmax regime, setting aside one MagBook (and Combook) value which will most likely be at 0, we have phasebook values for each of the remaining MagBook values, e.g., is akin to . Interestingly, for small sizes such as and , the Combook layer does not take advantage of non-real values, focusing first on covering negative values (for phase inversion), , and positive values. This is similar to what we observe with some of the linear MagBooks in Fig. 10 that learn to allocate magnitude values for phase inversion. Only with in Fig. 12 do we start seeing non-real values. We note however that the network does not appear to be very efficient in its usage of the available values, learning seemingly redundant values, such as the cluster of points near in the far right plot of Fig. 12.

Table III compares SI-SDR results for Combooks of various sizes, in addition to the best performing MagBook and phasebook configurations. It appears that, in the current setup, the ability of the Combook layer to estimate a complex mask via a single network layer works slightly better than trying to estimate magnitude and phase via separate layers.

Codebook SI-SDR (dB)
Jointly trained Combook 4 12.1
Jointly trained Combook 8 12.1
Jointly trained Combook 12 12.6
Jointly trained MagBook 4 w/ noisy phase 12.1
Uniform MagBook 3 w/ Uniform phasebook 8 12.4
TABLE III: SI-SDR (dB) performance on the wsj0-2mix test set for various Combook sizes. Best MagBook and phasebook results are also shown.
Fig. 13: Mask inference part of a Chimera++ network with unfolded MISI reconstruction.

V-G Training through unfolded MISI

Following [12], we now consider adding an unfolded MISI network with iterations at the output of the MI head, as illustrated in Fig. 13, and training the full network using the WA-MISI-K loss function. In the figure, the magnitude masks shall be replaced by complex masks when phasebook or Combook layers are involved.

Fig. 14: SI-SDR improvement (dB) for a given number of unfolded MISI iterations from a complex time-frequency domain speech estimate obtained by: (1) combining an estimated magnitude mask from a Uniform MagBook 3 layer with the noisy phase; (2) combining an estimated magnitude mask from a Uniform MagBook 3 layer with the phase mask obtained with a Uniform phasebook 8 layer; and (3) using a complex mask obtained with a Jointly trained Combook 12 layer trained jointly with the rest of the network.

Results are shown in Fig. 14 for various numbers of unfolded MISI iterations, and three different types of networks: the original chimera++ network using the noisy phase with a Uniform MagBook 3 layer with fixed elements , as a (state-of-the-art) baseline; a chimera++ network with the same architecture and an additional phasebook layer with uniformly distributed elements; a chimera++ network with a Combook layer as MI head whose elements are learned end-to-end together with the rest of the network parameters. We observe that the Combook network improves significantly over the noisy phase baseline and obtains the best performance among all methods for direct iSTFT reconstruction (i.e., MISI iterations), but its performance does not further improve. The phasebook network also improves significantly over the baseline, and converges to an SDR value similar to that of the Combook in iterations. Both Combook and phasebook enable a better phase estimate which can match state-of-the-art performance without the need for unfolded phase reconstruction required when using the noisy phase.

Table IV shows a comparison of the best proposed systems with three recently proposed approaches: the original Chimera++ network using noisy phase and MISI phase reconstruction as a post-processing only [8]; a Chimera++ network trained through unfolded MISI phase reconstruction [12], which is equivalent in our framework to a Uniform MagBook 3 with noisy phase as the initial phase; and a Chimera++ network with unfolded phase reconstruction in which the STFT and iSTFT transforms are replaced by separate (or “untied”) transforms at each layer, learned together with the rest of the network [21]. The Jointly trained Combook 12 system obtains the best performance when no MISI iteration is performed, at 12.6 dB, beating the previous state-of-the-art 12.2 dB which involves further learning a transform replacing the final iSTFT [21]. If we allow ourselves 5 MISI iterations, all proposed systems reach 12.6 dB, but they are slightly outperformed by the system which learns replacements for the STFT/iSTFT transforms, with 12.8 dB. We shall leave it to future work to combine such transform learning with our proposed systems.

Approach Iterations [dB]
Chimera++ [8] 0 11.2
5 11.5
Uniform MagBook 3 w/ noisy phase [12] 0 11.8
5 12.6
Unfolded MISI with learned untied transforms [21] 0 12.2
5 12.8
Uniform MagBook 3 w/ Uniform phasebook 8 0 12.4
5 12.6
Jointly trained Combook 12 0 12.6
5 12.6
TABLE IV: SI-SDR (dB) comparison with other recent systems on the wsj0-2mix test set.

Vi Conclusion and future works

According to the above experiments, both a Combook layer and a combination of MagBook and phasebook layers can significantly improve the performance of single-channel multi-speaker speech separation, especially reducing the need for further phase reconstruction. We have here focused mostly on end-to-end training using the waveform approximation objective, because it has led to the best results both here and in recent work [12]: the most convenient way to use this objective was to rely on an interpolation regime for the phasebook layer, but we could also consider training through the argmax regime using methods such as the Gumbel-Softmax [28], also known as the Concrete distribution [29]. This would in particular allow us to use the discrete nature of the representation to introduce conditional probability relationships between T-F bins. As an alternative to the interpolation regime, where our losses are computed on the expected outputs over the codebooks, we also plan to investigate expected loss functions that consider the expectation of the loss computed over each possible value in the codebook. Finally, while we here considered estimating the difference between the noisy and clean phase, we can consider also estimating the clean phase directly, and train the network to merge the two estimates based on the context.


  • [1]

    F. J. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, “Discriminatively trained recurrent neural networks for single-channel speech separation,” in

    GlobalSIP Machine Learning Applications in Speech Processing Symposium

    , 2014.
  • [2] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, Apr. 2015.
  • [3] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in Latent Variable Analysis and Signal Separation.   Springer, 2015.
  • [4] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, Mar. 2016.
  • [5] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” in Proc. ISCA Interspeech, Sep. 2016.
  • [6] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017.
  • [7] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, 2017.
  • [8] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Apr. 2018.
  • [9] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 6, Dec. 1984.
  • [10] J. Le Roux, H. Kameoka, N. Ono, A. de Cheveigné, and S. Sagayama, “Computational auditory induction by missing-data non-negative matrix factorization,” in Proc. ISCA Workshop on Statistical and Perceptual Audition (SAPA), Sep. 2008.
  • [11] T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, “Phase processing for single-channel speech enhancement: History and recent advances,” IEEE Signal Processing Magazine, vol. 32, no. 2, Mar. 2015.
  • [12] Z.-Q. Wang, J. Le Roux, D. Wang, and J. R. Hershey, “End-to-end speech separation with unfolded iterative phase reconstruction,” in Proc. ISCA Interspeech, Sep. 2018.
  • [13] D. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech separation by humans and machines.   Springer, 2005.
  • [14] G. Hu, “Monaural speech organization and segregation,” Ph.D. dissertation, The Ohio State University, 2006.
  • [15] G. Kim, Y. Lu, Y. Hu, and P. C. Loizou, “An algorithm that improves speech intelligibility in noise for normal-hearing listeners,” J. Acoust. Soc. Am., vol. 126, no. 3, 2009.
  • [16] Y. Wang and D. Wang, “Cocktail party processing via structured prediction,” in Advances in neural information processing systems (NIPS), 2012.
  • [17] S. J. Rennie, K. Achan, B. J. Frey, and P. Aarabi, “Variational speech separation of more sources than mixtures.” in AISTATS, 2005.
  • [18] A. Liutkus, C. Rohlfing, and A. Deleforge, “Audio source separation with magnitude priors: the BEADS model,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
  • [19] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, 2016.
  • [20] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” arXiv preprint arXiv:1804.04121, 2018.
  • [21] G. Wichern and J. Le Roux, “Phase reconstruction with learned time-frequency representations for single-channel speech separation,” in Proc. IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 2018.
  • [22] N. Takahashi, P. Agrawal, N. Goswami, and Y. Mitsufuji, “PhaseNet: Discretized phase modeling with deep neural networks for audio source separation,” Proc. Interspeech, 2018.
  • [23] Y. Luo and N. Mesgarani, “TasNet: Surpassing ideal time-frequency masking for speech separation,” arXiv preprint arXiv:1809.07454, Sep. 2018.
  • [24] E. Vincent, J. Barker, S. Watanabe, J. Le Roux, F. Nesta, and M. Matassoni, “The second ‘CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines,” in Proc. of ICASSP, Vancouver, Canada, 2013.
  • [25] J. Le Roux, J. R. Hershey, S. T. Wisdom, and H. Erdogan, “SDR – half-baked or well done?” Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA, Tech. Rep., 2018.
  • [26] D. Gunawan and D. Sen, “Iterative phase estimation for the synthesis of separated sources from single-channel mixtures,” in IEEE Signal Processing Letters, 2010.
  • [27] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source framework for deep learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015.
  • [28] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
  • [29] C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” arXiv preprint arXiv:1611.00712, 2016.