Blind source separation (BSS) is the task of separating a mixed signal into different components, usually referred to as sources. In the context of sound processing, it can be used to separate speakers whose voices have been recorded simultaneously. A common way to address this task is to decompose the signal spectrogram by non-negative matrix factorization (NMF, Lee and Seung 2001), as proposed for example by Schmidt and Olsson (2006) as well as Sun and Mysore (2013). Denoting the (complex) short-time Fourier transform (STFT) coefficient of the input signal at frequency bin and time frame , and its magnitude spectrogram defined as , the BSS problem can be tackled by solving the NMF problem
where is the number of sources, is the number of time windows, is the th column of and
is a loss function. Each dictionary matrixand weight matrix are related to a single source. In a supervised setting, each source has training data and all the s are learned in advance during a training phase. At test time, given a new signal, separated spectrograms are recovered from the s and s and corresponding signals can be reconstructed with suitable post-processing. Several loss functions have been considered in the literature, such as the squared Euclidean distance (Lee and Seung 2001; Schmidt and Olsson 2006)2001; Sun and Mysore 2013) or the Itakura-Saito divergence (Févotte et al. 2009; Sawada et al. 2013).
In the present article, we propose to use optimal transport as a loss between spectrograms to perform supervised speech BSS with NMF. Optimal transport is defined as the minimum cost of moving the mass from one histogram to another. By taking into account a transportation cost between frequencies, this provides a powerful metric to compare STFT spectrograms. One of the main advantage of using optimal transport as a loss is that it can quantify the amplitude of a frequency shift noise, coming for example from quantization or the tuning of a musical instrument. Other metrics such as the Euclidean distance or Kullback-Leibler divergence, which compare spectrograms element-wise, are almost blind to this type of noise (see Figure 1). Another advantage over element-wise metrics is that optimal transport enables the use of different quantizations, i.e. frequency supports, at training and test times. Indeed, the frequencies represented on a spectrogram depend on the sampling rate of the signal and the time-windows used for its computation, both of which can change between training and test times. With optimal transport, we do not need to re-quantize the training and testing data so as they share the same frequency support: optimal transport is well-defined between spectrograms with distinct supports as long as we can define a transportation cost between frequencies. Finally, the optimal transport framework enables us to generalize the Wiener filter, a common post-processing for source separation, by using optimal transport plans, so that it can be applied to data quantized on different frequencies.
NMF with an optimal transport loss was first proposed by Sandler and Lindenbaum (2009). They solved this problem by using a bi-convex formulation and relied on an approximation of optimal transport based on wavelets (Shirdhonkar and Jacobs 2008). Recently, Rolet et al. (2016) proposed fast algorithms to compute NMF with an entropy-regularized optimal transport loss, which are more flexible in the sense that they do not require any assumption on the frequency quantization or on the cost function used.
Using optimal transport as a loss between spectrograms was also proposed by Flamary et al. (2016) under the name “optimal spectral transportation”. They developed a novel method for unsupervised music transcription which achieves state-of-the-art performance. Their method relies on a cost matrix designed specifically for musical instruments, allowing them to use Diracs as dictionary columns. That is, they fix
each dictionary column to a vector with a single non-zero entry and learn only the corresponding coefficients. This trivial structure of the dictionary results in efficient coefficient computation. However, this approach cannot be applied as is to speech separation since it relies on the assumption that a musical note can be represented as its fundamental. It also requires designing the cost of moving the fundamental to its harmonics and neighboring frequencies. Because human voices are intrinsically more complex, it is therefore necessary to learnboth the dictionary and the coefficients, i.e., solve full NMF problems.
In this paper, we extend the optimal transport NMF of Rolet et al. (2016) to the case where the columns of the input matrix are not normalized in order to propose an algorithm which is suitable for spectrogram data. Normalizing all time frames so that they have the same total weight is not desirable in sound processing tasks because it would amplify noise. We define a cost between frequencies so that the optimal transport objective between spectrograms provides a relevant metric between them. We apply our NMF framework to single voice reconstruction and blind source separation and show that an optimal transport loss provides better results over the usual squared Euclidean loss. Finally, we show how to use our framework for cross domain BSS, where frequencies represented in the test spectrograms may be different from the ones in the dictionary. This may happen for example when train and test data are recorded with different equipment, or when the STFT is computed with different parameters.
We denote matrices in upper-case, vectors in bold lower-case and scalars in lower-case. If is a matrix, is its transpose, is its th column and its th row. denotes the all-ones vector in ; when the dimension can be deduced from context we simply write . For two matrices and of the same size, we denote their inner product . We denote the -dimensional simplex: .
We start by introducing optimal transport, its entropy regularization, which we will use as the loss , and previous works on optimal transport NMF. For a more comprehensive overview of optimal transport from a computational perspective, see Peyré and Cuturi (2017).
2.1 Optimal Transport
Exact Optimal Transport. Let . The polytope of transportation matrices between and is defined as
Given a cost matrix , the minimum transportation cost between and is defined as
When and the cost matrix is the -th power () of a distance matrix, i.e. for some in a metric space , then is a distance on the set of vectors in with the same - norm (Villani 2003 Theorem 7.3). We can see the vectors as features, and and as the quantization weights of the data onto these features. In sound processing applications, the vectors are real numbers corresponding to the frequencies of the spectrogram and and are their corresponding magnitude. By computing the minimal transportation cost between frequencies of two spectrograms, optimal transport exhibits variations in accordance with the frequency noise involved in the signal generative process, which results for instance from the tuning of musical instruments or the subject’s condition in speech processing.
Unnormalized Optimal Transport. In this work, we wish to define optimal transport when and are non-negative but not necessarily normalized. Note that the transportation polytope is not empty as long as and sum to the same value: iif . Hence, we define optimal transport between possibly unnormalized vectors and as,
Computing the optimal transport cost (3
) amounts to solve a linear program (LP) which can be done with specialized versions of the simplex algorithm with worst-case complexity inwhen (Orlin 1997). When considering as a loss between histograms supported on more than a few hundred bins, such computation becomes quickly intractable. Moreover, using as a loss involves differentiating , which is not differentiable everywhere. Hence, one would have to resort to subgradient methods. This would be prohibitively slow since each iteration would require to obtain a subgradient at the current iterate, which requires to solve the LP (3).
Entropy Regularized Optimal Transport. To remedy these limitations, Cuturi (2013) proposed to add an entropy-regularization term to the optimal transport objective, thus making the loss differentiable everywhere and strictly convex. This entropy-regularized optimal transport has since been used in numerous works as a loss for diverse tasks (see for example Gramfort et al. 2015; Frogner et al. 2015; Rolet et al. 2016).
Let , we define the (unnormalized) entropy-regularized OT between as
where is the entropy of the transport plan . Let us denote the convex conjugate of with respect to its second variable
Cuturi and Peyré (2016) showed that its value and gradient can be computed in closed-form:
where and .
2.2 Optimal Transport NMF
NMF can be cast as an optimization problem of the form
where both and are optimized at train time, and is fixed at test time. When is , problem (5) is convex in and separately, but not jointly. It can be solved by alternating full optimization with respect to and . Each resulting sub-problem is a very high dimensional linear program with many constraints (Sandler and Lindenbaum 2009), which is intractable with standard LP solvers even for short sound signals. In addition, convergence proofs of alternate minimization methods for NMF typically assume strictly convex sub-problems (see e.g. Tropp 2003; Bertsekas 1999 Prop. 2.7.1), which is not the case when using non-regularized as a loss.
To address this issue, Rolet et al. (2016) proposed to use instead, and showed how to solve each sub-problem in the dual using fast gradient computations. Formally, they tackle problems of the form:
where and are convex regularizers that enforce non-negativity constraints, and is the -dimensional simplex.
It was shown that each sub-problem of (6) with either or fixed has a smooth Fenchel-Rockafellar dual, which can be solved efficiently, leading to a fast overall algorithm. However, their definition of optimal transport requires inputs and reconstructions to have a - norm equal to . This is achieved by normalizing the input beforehand, restricting the columns of and to the simplex, and using as regularizers negative entropies defined on the simplex:
They showed that the coefficients and dictionary can be updated according to the following duality results.
Coefficients Update. For fixed, the optimizer of
We can solve Problem (10) with accelerated gradient descent (Nesterov 1983), and recover the optimal weight matrix with the primal-dual relationship (9). The value and gradient of the convex conjugate of with respect to its second variable are:
Dictionary Update. For fixed, the optimizer of
These duality results allow us to go from a constrained primal problem for which each evaluation of the objective and its gradient requires solving an optimal transport problem, to a non-constrained dual problem whose objective and gradient can be evaluated in closed form. The primal constraints and are enforced by the primal-dual relationship. Moreover, the use of an entropy regularization, with , makes smooth with respect to its second variable.
We now present our approach for optimal transport BSS. First we introduce the changes to Rolet et al. (2016) that are necessary for computing optimal transport NMF on STFT spectrograms of sound data. We then define a transportation cost between frequencies. Finally we show how to reconstruct sound signals from the separated spectrograms.
3.1 Signal Separation With NMF
We use a supervised BSS setting similar to the one described in Schmidt and Olsson (2006). For each source we have access to training data , on which we learn a dictionary with NMF
Then, given the STFT spectrum of a mixture of voices , we reconstruct separated spectrograms for where sare the solutions of
The separated signals are then reconstructed from each with the process described in Section 3.4.
In practice at test time, the dictionaries are concatenated in a single matrix , and a single matrix of coefficients is learned, which we decompose as . This allows us to focus on problems of the form
3.2 Non-normalized Optimal Transport NMF
Normalizing the columns of the input , as in Rolet et al. (2016), is not a good option in the context of signal processing, since frames with low amplitudes are typically noise and it would amplify them.
However, our definition of optimal transport does not require inputs to be in the simplex, only to have the same - norm. With this definition, the convex conjugate of and its gradient still have the same value as in Cuturi and Peyré (2016), and we can simply relax the condition on to be in Problem (6). We keep a simplex constraint on the columns of the dictionary so that each update is guaranteed to stay in a compact set. We use , a negative entropy defined on the non-negative orthant as the coefficient matrix regularizer and for we keep the non-negative entropy defined on the simplex. The problem then becomes
The dictionary update is the same as in Rolet et al. (2016). However, the coefficient updates need to be modified as follows.
Coefficients Update. For fixed, the optimizer of
is , with
The concave conjugate of and its gradient can be evaluated with:
3.3 Cost Matrix Design
In order to compute optimal transport on spectrogams and perform NMF, we need a cost matrix , which represents the cost of moving weight from frequencies in the original spectrogram to frequencies in the reconstructed spectrogram. Schmidt and Olsson (2006) use the mel scale to quantize spectrograms, relying on the fact that the perceptual difference between frequencies is smaller for the high frequency than for the low frequency domain. Following the same intuition, we propose to map frequencies to a log-domain and apply a cost function in that domain. Let be the frequency of the -th bin in an input data spectrogram, where . Let be the frequency of the -th bin in a reconstruction spectrogram, where . We define the cost matrix as
with parameters and . Since the mel scale is a log scale, it is included in this definition for some parameter . Some illustrations of our cost matrix for different values of are shown in Figure 2, with . It shows that with our definition, moving weights locally is less costly for high frequencies than low ones, and that this effect can be tuned by selecting .
Figure 3 shows the effect of on the learned dictionaries. Using yields a cost that is more spiked, leading to dictionary elements that can have several spikes in the same frequency bands, whereas tends to produce smoother dictionary elements.
Note that with this definition and , is a distance matrix to the power when the source and target frequencies are the same. If , is the point-wise square-root of a distance matrix and as such is a distance matrix itself. .
Parameters and yielded better results for Blind Source Separation on the validation set and were accordingly used in all our experiments.
Wiener Filter. In the case where the reconstruction is in the same frequency domain as the original signal, the classical way to recover each voice in the time domain is to apply a Wiener filter. Let be the original Fourier spectrum, and the separated spectra such that . The Wiener filter builds and , before applying the original spectra’s phase and performing the inverse STFT.
Generalized Filter. We propose to extend this filtering to the case where and are not in the same domain as . This may happen for example if the test data is recorded using a different sample frequency, or if the STFT is performed with a different time-window than the train data. In such a case, and are in the domain of the train data, and to are and , but is in a different domain, and its coefficients correspond to different sound frequencies. As such, we cannot use Wiener filtering.
Instead we propose to use the optimal transportation matrices to produce separated signals and in the same domain as . Let . With Weiner filtering, is decomposed into its components generated by and . We use the same idea and separate the transport matrix into:
(resp. ) is a transport matrix between (resp. ) and (resp. ), where
Similarly to the classical Wiener filter, we have
Heuristic Mapping. As an alternative to this generalized filter, we propose to simply map the reconstructed signal to the same domain as by assigning the weight of a in a spectrogram to its closest neighbor in , according to the distance we defined for the cost matrix (see Section 3.3).
Separated Signal Reconstruction. Separated sounds are reconstructed by inverse STFT after applying a Wiener filter or generalized filter to and .
In this section we present the main empirical findings of this paper. We start by describing the dataset that we used and the pre-processing we applied to it. We then show that the optimal transport loss allows us to have perceptually good reconstructions of single voices, even with few dictionary elements. Finally we show that the optimal transport loss improves upon a Euclidean loss for BSS with an NMF model, both in single-domain and cross-domain settings.
4.1 Dataset and Pre-processing
We evaluate our method on the English part of the Multi-Lingual Speech Database for Telephonometry 1994 dataset111http://www.ntt-at.com/product/speech2002/. The data consists of recordings of the voice of four males and four females pronouncing each 24 different English sentences. We split each person’s audio file time-wise into - train-test data. The files are re-sampled to and treated as mono signal.
One of the male voices and one of the female voices are only used for hyper-parameter selection, and are not included in the results.
The signals are analysed by STFT with a Hann window, and a window-size of , leading to frequency bins ranging from to kHz. The constant coefficient is removed from the NMF analysis and added for reconstruction in post-processing.
Hyper-parameters are selected on validation data consisting if the first male and female voice, which are excluded from the evaluation set.
Initialization is performed by setting each dictionary column to the optimal transport barycenter of all the time frames of the training data, to which we added Gaussian noise (separately for each column). The barycenters are computed using the algorithm of Benamou et al. (2015).
4.2 NMF Audio Quality
We first show that using an optimal transport loss for NMF leads to better perceptual reconstruction of voice data. To that end, we evaluated the PEMO-Q score (Huber and Kollmeier 2006) of isolated test voices. The dictionaries are learned on the isolated voices in the train dataset, and are the same as in the following separation experiment.
shows the mean and standard deviation of the scores forwith optimal transport and Euclidean NMF. The PEMO-Q score of optimal transport NMF is significantly higher for any value of . We found empirically that other scores such as SDR or SNR tend to be better for the Euclidean NMF, even though the reconstructed voices are clearly worse when listening to them (see additional files 1 and 2). Optimal transport can reconstruct clear and intelligible voices with as few as dictionary elements.
4.3 Blind Source Separation
We evaluate our Blind Source Separation using the PEASS score proposed in Emiya et al. (2011), which they claim is closer to how humans would score BSS than SDR. We only consider mixtures of two voices, where the mixture is simply an addition of the sound signals.
Single-Domain Blind Source Separation. We first show that using an optimal transport NMF improves on Euclidean NMF for BSS using the same frequencies in the spectrogram of the train and test data. In this experiment, both the training and test data are processed in exactly the same way, so that at train and test time . For Euclidean-based BSS, we reconstruct the signal using a Wiener filter before applying inverse STFT. For optimal transport-based source separation, we evaluate separation using either the Wiener filter or our generalized filter.
Figure 5 shows mean and standard deviation of the PEASS scores for . The scores are higher with or and in both cases optimal transport yields better results.
Figure 6 shows a comparison for each pair of mixed voices, with selected on the validation set ( for Euclidean and for optimal transport NMF). It shows that the PEASS score is better with an optimal transport loss for almost all files. We can further see that in the case of single domain BSS, the Wiener filter and our generalized Wiener filter yields very similar results.
Cross-Domain Blind Source Separation. In this experiment, we keep the dictionaries trained for the single domain experiment, but we re-process the test data with a different time-window of for the STFT. Although , we can still compute optimal transport between the spectrograms thanks to our cost matrix.
Figure 7 shows the resuts on the train set. The score for Euclidean NMF is computed by first mapping the test data to the same domain as the train data, using heuristic mapping, and then performing same-domain separation. Both the heuristice mapping and generalized filter improve upon using Euclidean NMF, and they both achieve similar results. Still, the use of our generalized filter allows to have the exact same processing whether performing single domain or cross domain separation, the only difference being the cost matrix , while the heuristic mapping requires additional post-processing and also requires to choose rules for the mapping.
Regularization of the Transport Plan. In this work we considered entropy-regularized optimal transport as introduced by Cuturi (2013). This allows us to get an easy-to-solve dual problem since its convex conjugate is smooth and can be computed in closed form. However, any convex regularizer would yield the same duality results, and could be considered as long as its conjugate is computable. For instance, the squared norm regularization was considered in several recent works (Blondel et al. 2018; Seguy et al. 2017) and was shown to have desirable properties such as better numerical stability or sparsity of the optimal transport plan. Moreover, similarly to entropic regularization, it was shown that the convex conjugate and its gradient can be computed in closed form (Blondel et al. 2018).
Learning Procedure. Following the work of Rolet et al. (2016), we solved the NMF problem with an alternating minimization approach, in which at each iteration a complete optimization is performed on either the dictionary or the coefficients. While this seems to work well in our experiments, it would be interesting to compare with smaller steps approach like in Lee and Seung (2001). Unfortunately such updates do not exist to our knowledge: gradient methods in the primal would be prohibitively slow, since they involve solving large optimal transport problems at each iteration.
We showed that using an optimal transport based loss can improve performance of NMF-based models for voice reconstruction and separation tasks. We believe this is a first step towards using optimal transport as a loss for speech processing, possibly using more complicated models such neural networks. The versatility of optimal transport, which can compare spectrograms on different frequency domains, lets us use dictionaries on sounds that are not recorded or processed in the same way as the training set. This property could also be beneficial to learn common representations (e.g. dictionaries) for different datasets.
All of the additional files are wav files.
Additional file 1 — Reconstruction with optimal transport NMF
This file contains the reconstructed signal for 6 test sentences of the male validation voice with optimal transport NMF and a dictionary of rank 5 (5 columns), where the dictionary was learnt on the training sentences of the same voice.
Additional file 2 — Reconstruction with Euclidean NMF
This file contains the reconstructed signal for 6 test sentences of the male validation voice with Euclidean NMF and a dictionary of rank 5 (5 columns), where the dictionary was learnt on the training sentences of the same voice.
The authors would like to thank Arnaud Dessein, who gave helpful insight on the cost matrix design.
- Benamou et al.  Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
- Bertsekas  Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
- Blondel et al.  Mathieu Blondel, Vivien Seguy, and Antoine Rolet. Smooth and sparse optimal transport. In Artificial Intelligence and Statistics, 2018.
- Cuturi  Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.
- Cuturi and Peyré  Marco Cuturi and Gabriel Peyré. A smoothed dual approach for variational wasserstein problems. SIAM Journal on Imaging Sciences, 9(1):320–343, 2016.
- Emiya et al.  Valentin Emiya, Emmanuel Vincent, Niklas Harlander, and Volker Hohmann. Subjective and objective quality assessment of audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2046–2057, 2011.
- Févotte et al.  Cédric Févotte, Nancy Bertin, and Jean-Louis Durrieu. Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis. Neural computation, 21(3):793–830, 2009.
- Flamary et al.  Rémi Flamary, Cédric Févotte, Nicolas Courty, and Valentin Emiya. Optimal spectral transportation with application to music transcription. In Advances in Neural Information Processing Systems, pages 703–711, 2016.
- Frogner et al.  Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.
- Gramfort et al.  Alexandre Gramfort, Gabriel Peyré, and Marco Cuturi. Fast optimal transport averaging of neuroimaging data. In International Conference on Information Processing in Medical Imaging, pages 261–272. Springer, 2015.
- Huber and Kollmeier  Rainer Huber and Birger Kollmeier. Pemo-q—a new method for objective audio quality assessment using a model of auditory perception. IEEE Transactions on audio, speech, and language processing, 14(6):1902–1911, 2006.
- Lee and Seung  Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pages 556–562, 2001.
- Nesterov  Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983.
- Orlin  J.B. Orlin. A polynomial time primal network simplex algorithm for minimum cost flows. Mathematical Programming, 78(2):109–129, 1997.
- Peyré and Cuturi  Gabriel Peyré and Marco Cuturi. Computational Optimal Transport. 2017.
- Rolet et al.  Antoine Rolet, Marco Cuturi, and Gabriel Peyré. Fast dictionary learning with a smoothed wasserstein loss. In Artificial Intelligence and Statistics, pages 630–638, 2016.
- Sandler and Lindenbaum  R. Sandler and M. Lindenbaum. Nonnegative matrix factorization with earth mover’s distance metric. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1873–1880. IEEE, 2009.
- Sawada et al.  Hiroshi Sawada, Hirokazu Kameoka, Shoko Araki, and Naonori Ueda. Multichannel extensions of non-negative matrix factorization with complex-valued data. IEEE Transactions on Audio, Speech, and Language Processing, 21(5):971–982, 2013.
- Schmidt and Olsson  Mikkel N Schmidt and Rasmus Kongsgaard Olsson. Single-channel speech separation using sparse non-negative matrix factorization. In Spoken Language Proceesing, ISCA International Conference on (INTERSPEECH), 2006.
- Seguy et al.  Vivien Seguy, Bharath Bhushan Damodaran, Rémi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Large-scale optimal transport and mapping estimation. arXiv preprint arXiv:1711.02283, 2017.
- Shirdhonkar and Jacobs  S. Shirdhonkar and D.W. Jacobs. Approximate earth mover’s distance in linear time. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
- Sun and Mysore  Dennis L Sun and Gautham J Mysore. Universal speech models for speaker independent single channel source separation. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 141–145. IEEE, 2013.
- Tropp  JOEL A Tropp. An alternating minimization algorithm for non-negative matrix approximation, 2003.
- Villani  Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.