Humans are remarkably good at separating data coming from a mixture of distributions, e.g. hearing a person speaking in a crowded cocktail party. Artificial intelligence, on the the hand, is far less adept at separating mixed signals. This is an important ability as signals in nature are typically mixed, e.g. speakers are often mixed with other speakers or environmental sounds, objects in images are typically seen along other objects as well as the background. Understanding mixed signals is harder than understanding pure sources, making source separation an important research topic.
Mixed signal separation appears in many scenarios corresponding to different degrees of supervision. Most previous work focused on the following settings:
Full supervision: The learner has access to a training set including samples of mixed signals as well as the ground truth sources of the same signals and (such that ). Having such strong supervision is very potent, allowing the learner to directly learn a mapping from the mixed signal to its sources . Obtaining such strong supervision is typically unrealistic, as it requires manual separation of mixed signals. Consider for example a musical performance, humans are often able to separate out the different sounds of the individual instruments, despite never having heard them play in isolation. The fully supervised setting does not allow the clean extraction of signals that cannot be observed in isolation e.g. music of a street performer, car engine noises or reflections in shop windows.
Synthetic full supervision: The learner has access to a training set containing samples from the mixed signal as well as samples from all source distributions and . The learner however does not have access to paired sets of the mixed and unmixed signal ground truth (that is for any given in the training set, and are unknown). This supervision setting is more realistic than the fully supervised case, and occurs when each of the source distributions can be sampled in its pure form (e.g. we can record a violin and piano separately in a studio and can thus obtain unmixed samples of each of their distributions). It is typically solved by learning to separate synthetic mixtures of randomly sampled and .
No supervision: The learner only has access to training samples of the mixed signal but not to sources and . Although this settings puts the least requirements on the training dataset, it is a hard problem and can be poorly specified in the absence of strong assumptions and priors. It is generally necessary to make strong assumptions on the properties of the component signals (e.g. smoothness, low rank, periodicity) in order to make progress in separation. This unfortunately severely limits the applicability of such methods.
In this work we concentrate on the semi-supervised setting: unmixing of signals in the case where the mixture consists of a signal coming from an unobserved distribution and another signal from an observed distribution (i.e. the learner has access to a training set of clean samples such that along with different mixed samples ). One possible way of obtaining such supervision, is to label every signal sample by a label, indicating if the sample comes only from the observed distribution or if it is a mixture of both distributions . The task is to learn a parametric function able to separate the mixed signal into sources and s.t. . Such supervision is much more generally available than full supervision, while the separation problem becomes much simpler than when fully unsupervised.
We introduce a novel method: Neural Egg Separation (NES) - consisting of ) iterative estimation of samples from the unobserved distribution ) synthesis of mixed signals from known samples of and estimated samples of ) training of separation regressor to separate the mixed signal. Iterative refinement of the estimated samples of significantly increases the accuracy of the learned masking function. As an iterative technique, NES can be initialization sensitive. We therefore introduce another method - GLO Masking (GLOM) - to provide NES with a strong initialization. Our method trains two deep generators end-to-end using GLO to model the observed and unobserved sources ( and ). NES is very effective when and are uncorrelated, whereas initialization by GLOM is most important when and are strongly correlated such as e.g. separation of musical instruments. Initialization by GLOM was found to be much more effective than by adversarial methods.
Experiments are conducted across multiple domains (image, music, voice) validating the effectiveness of our method, and its superiority over current methods that use the same level of supervision. Our semi-supervised method is often competitive with the fully supervised baseline. It makes few assumptions on the nature of the component signals and requires lightweight supervision.
2 Previous Work
Source separation: Separation of mixed signals has been extensively researched. In this work, we focus on single channel separation. Unsupervised (blind) single-channel methods attempt to use coarse priors about the signals such as low rank, sparsity (e.g. RPCA (Huang et al., 2012)) or non-gaussianity. HMM can be used as a temporal prior for longer clips (Roweis, 2001), however here we do not assume long clips. Supervised source separation has also been extensively researched, classic techniques often used learned dictionaries for each source e.g. NMF (Wilson et al., 2008)
. Recently, neural network-based gained popularity, usually learning a regression between the mixed and unmixed signals either directly(Huang et al., 2014) or by regressing the mask (Wang et al., 2014; Yu et al., 2017). Some methods were devised to exploit the temporal nature of long audio signal by using RNNs (Mimilakis et al., 2017), in this work we concentrate on separation of short audio clips and consider such line of works as orthogonal. One related direction is Generative Adversarial Source Separation (Stoller et al., 2017; Subakan & Smaragdis, 2017) that uses adversarial training to match the unmixed source distributions. This is needed to deal with correlated sources for which learning a regressor on synthetic mixtures is less effective. We present an Adversarial Masking (AM) method that tackles the semi-supervised rather than the fully supervised scenario and overcomes mixture collapse issues not present in the fully supervised case. We found that non-adversarial methods perform better for the initialization task.
The most related set of works is semi-supervised audio source separation (Smaragdis et al., 2007; Barker & Virtanen, 2014), which like our work attempt to separate mixtures given only samples from the distribution of one source . Typically NMF or PLCA (which is a similar algorithm with a probabilistic formulation) are used. We show experimentally that our method significantly outperforms NMF.
Disentanglement: Similarly to source separation, disentanglement also deals with separation in terms of creating a disentangled representation of a source signal, however its aim is to uncover latent factors of variation in the signal such as style and content or shape and color e.g. Denton et al. (2017); Higgins et al. (2016). Differently from disentanglement, our task is separating signals rather than the latent representation.
Generative Models: Generative models learn the distribution of a signal directly. Classical approaches include: SVD for general signals and NMF (Lee & Seung, 2001)
for non-negative signals. Recently several deep learning approaches dominated generative modeling including: GAN(Goodfellow et al., 2016), VAE (Kingma & Welling, 2013) and GLO (Bojanowski et al., 2018). Adversarial training (for GANs) is rather tricky and often leads to mode-collapse. GLO is non-adversarial and allows for direct latent optimization for each source making it more suitable than VAE and GAN.
3 Neural Egg Separation (NES)
In this section we present our method for separating a mixture of sources of known and unknown distributions. We denote the mixture samples , the samples with the observed distribution and the samples from the unobserved distribution . Our objective is to learn a parametric function , such that .
Full Supervision: In the fully supervised setting (where pairs of and are available) this task reduces to a standard supervised regression problem, in which a parametric function (typically a deep neural network) is used to directly optimize:
Where typically is the Euclidean or the loss. In this work we use .
Mixed-unmixed pairs are usually unavailable, but in some cases it is possible to obtain a training set which includes unrelated samples and e.g. (Wang et al., 2014; Yu et al., 2017). Methods typically randomly sample and sources and synthetically create mixtures . The synthetic pairs can then be used to optimize Eq. 1. Note that in cases where and are correlated (e.g. vocals and instrumental accompaniment which are temporally dependent), random synthetic mixtures of and might not be representative of and fail to generalize on real mixtures.
Semi-Supervision: In many scenarios, clean samples of both mixture components are not available. Consider for example a street musical performance. Although crowd noises without street performers can be easily observed, street music without crowd noises are much harder to come by. In this case therefore samples from the distribution of crowd noise are available, whereas the samples from the distribution of the music are unobserved. Samples from the distribution of the mixed signal i.e. the crowd noise mixed with the musical performance are also available.
The example above illustrates a class of problems for which the distribution of the mixture and a single source are available, but the distribution of another source is unknown. In such cases, it is not possible to optimize Eq. 1 directly due to the unavailability of pairs of and .
Neural Egg Separation: Fully-supervised optimization (as in Eq. 1) is very effective when pairs of and are available. We present a novel algorithm, which iteratively solves the semi-supervised task as a sequence of supervised problems without any clean training examples of . We name the method Neural Egg Separation (NES), as it is akin to the technique commonly used for separating egg whites and yolks.
The core idea of our method is that although no clean samples from are given, it is still possible to learn to separate mixtures of observed samples from distribution combined with some estimates of the unobserved distribution samples . Synthetic mixtures are created by randomly sampling an approximate sample from the unobserved distribution and combining with training sample :
thereby creating pairs for supervised training. Note that the distribution of synthetic mixtures might be different from the real mixture sample distribution , but the assumption (which is empirically validated) is that it will eventually converge to the correct distribution.
During each iteration of NES, a neural separation function is trained on the created pairs by optimizing the following term:
At the end of each iteration, the separation function can be used to approximately separate the training mixture samples into their sources:
The refined domain estimates are used for creating synthetic pairs for finetuning in the next iteration (as in Eq. 3).
The above method relies on having an estimate of the unobserved distribution samples as input to the first iteration. One simple scheme is to initialize the estimates of the unobserved distribution samples in the first iteration as , where is a constant fraction (typically 0.5). Although this initialization is very naive, we show that it achieves very competitive performance in cases where the sources are independent. More advanced initializations will be discussed below.
At test time, separation is simply carried out by a single application of the trained separation function (exactly as in Eq. 4).
Our full algorithm is described in Alg. 1. For optimization, we use SGD using ADAM update with a learning rate of . In total we perform iterations, each consisting of optimization of and estimation of , epochs are used for each optimization of Eq. 3.
GLO Masking: NES is very powerful in practice despite its apparent simplicity. There are some cases for which it can be improved upon. As with other synthetic mixture methods, it does not take into account correlation between and e.g. vocals and instrumental tracks are highly related, whereas randomly sampling pairs of vocals and instrumental tracks is likely to synthesize mixtures quite different from . Another issue is finding a good initialization—this tends to affect performance more strongly when and are dependent.
We present our method GLO Masking (GLOM), which separates the mixture by a distributional constraint enforced via GLO generative modeling of the source signals. GLO (Bojanowski et al., 2018) learns a generator , which takes a latent code and attempts to reconstruct an image or a spectrogram: . In training, GLO learns end-to-end both the parameters of the generator as well as a latent code for every training sample . It trains per-sample latent codes by direct gradient descent over the values of
(similar to word embeddings), rather than by a feedforward encoder used by autoencoders (e.g.). This makes it particularly suitable for our scenario. Let us define the set of latent codes: . The optimization is therefore:
We propose GLO Masking, which jointly trains generators: for and for such that their sum results in mixture samples . We use the supervision of the observed source to train , while the mixture contributes residuals that supervise the training of . We also jointly train the latent codes for all training images: for all , and for all . The optimization problem is:
As GLO is able to overfit arbitrary distributions, it was found that constraining each latent code vectorto lie within the unit ball is required for generalization. Eq. 6 can either be optimized end-to-end, or the left-hand term can be optimized first to yield , then the right-hand term is optimized to yield . Both optimization procedures yield similar performance (but separate training does not require setting ). Once and are trained, for a new mixture sample we infer its latent codes:
Our estimate for the sources is then:
Masking Function: In separation problems, we can exploit the special properties of the task e.g. that the mixed signal is the sum of two positive signals and . Instead of synthesizing the new sample, we can instead simply learn a separation mask , specifying the fraction of the signal which comes from . The attractive feature of the mask is always being in the range (in the case of positive additive mixtures of signals). Even a constant mask will preserve all signal gradients (at the cost of introducing spurious gradients too). Mathematically this can be written as:
For NES (and baseline AM described below), we implement the mapping function using the product of the masking function . In practice we find that learning a masking function yields much better results than synthesizing the signal directly (in line with other works e.g. Wang et al. (2014); Gabbay et al. (2017)).
GLOM models each source separately and is therefore unable to learn the mask directly. Instead we refine its estimate by computing an effective mask from the element-wise ratio of estimated sources:
Initializing Neural Egg Separation by GLOM: Due to the iterative nature of NES, it can be improved by a good initialization. We therefore devise the following method: ) Train GLOM on the training set and infer the mask for each mixture. This is operated on images or mel-scale spectrograms at resolutions ) For audio: upsample the mask to the resolution of the high-resolution linear spectrogram and compute an estimate of the source linear spectrogram on the training set ) Run NES on the observed spectrograms and estimated spectrograms. We find experimentally that this initialization scheme improves NES to the point of being competitive with fully-supervised training in most settings.
To evaluate the performance of our method, we conducted experiments on distributions taken from multiple real-world domains: images, speech and music, in cases where the two signals are correlated and uncorrelated.
We evaluated our method against baseline methods:
Constant Mask (Const): This baseline uses the original mixture as the estimate.
Semi-supervised Non-negative Matrix Factorization (SS-NMF): This baseline method, proposed by Smaragdis et al. (2007), first trains a set of bases on the observed distribution samples by Sparse NMF (Hoyer, 2004; Kim & Park, 2007). It factorizes , with activations and bases , all matrices are non-negative. The optimization is solved using the Non-negative Least Squares solver by Kim & Park (2011). It then proceeds to train another factorization on the mixture training samples with bases, where the first bases () are fixed to those computed in the previous stage: . The separated sources are then: and . More details can be found in the appendix B.
Adversarial Masking (AM): As an additional contribution, we introduce a new semi-supervised method based on adversarial training, to improve over the shallow NMF baseline. AM trains a masking function so that after masking, the training mixtures are indistinguishable from the distribution of source under an adversarial discriminator
. The loss functions (using LS-GAN(Mao et al., 2017)) are given by:
Differently from CycleGAN (Zhu et al., 2017) and DiscoGAN (Kim et al., 2017), AM is not bidirectional and cannot use cycle constraints. We have found that adding magnitude prior improves performance and helps prevent collapse. To partially alleviate mode collapse, we use Spectral Norm (Miyato et al., 2018) on the discriminator.
We evaluated our proposed methods:
GLO Masking (GLOM): GLO Masking on mel-spectrograms or images at resolution.
Neural Egg Separation (NES): The NES method detailed in Sec. 3. Initializing estimates using a constant () mask over training samples.
Finetuning (FT): Initializing NES with the estimates obtained by GLO Masking.
To upper bound the performance of our method, we also compute a fully supervised baseline, for which paired data of , and are available. We train a masking function with the same architecture as used by all other regression methods to directly regress synthetic mixtures to unmixed sources. This method uses more supervision than our method and is an upper bound.
More implementation details can be found in appendix A.
4.1 Separating Mixed Images
In this section we evaluate the effectiveness of our method on image mixtures. We conduct experiments both on the simpler MNIST dataset and more complex Shoes and Handbags datasets.
To evaluate the quality of our method on image separation, we design the following experimental protocol. We split the MNIST dataset (LeCun & Cortes, 2010) into two classes, the first consisting of the digits - and the second consisting of the digits -. We conduct experiments where one source has an observed distribution while the other source has an unobserved distribution . We use training images as the training set, while for each of the other training images, we randomly sample a image and additively combine the images to create the training set. We evaluate the performance of our method on images similarly created from the test set of and . The experiment was repeated for both directions i.e. - being while - in , as well as - being while - in .
In Tab. 1, we report our results on this task. For each experiment, the top row presents the results (PSNR and SSIM) on the test set. Due to the simplicity of the dataset, NMF achieved reasonable performance on this dataset. GLOM achieves better SSIM but worse PSNR than NMF while AM performed 1-2dB better. NES achieves much stronger performance than all other methods, achieving about 1dB worse than the fully supervised performance. Initializing NES with the masks obtained by GLOM, results in similar performance to the fully-supervised upper bound. FT from AM achieved similar performance ( and ) to FT from GLOM.
4.1.2 Bags and Shoes
In order to evaluate our method on more realistic images, we evaluate on separating mixtures consisting of pairs of images sampled from the Handbags (Zhu et al., 2016) and Shoes (Yu & Grauman, 2014) datasets, which are commonly used for evaluation of conditional image generation methods. To create each mixture image, we randomly sample a shoe image from the Shoes dataset and a handbag image from the Handbags dataset and sum them. For the observed distribution, we sample another different images from a single dataset. We evaluate our method both for cases when the class is Shoes and when it is Handbags.
From the results in Tab. 1, we can observe that NMF failed to preserve fine details, penalizing its performance metrics. GLOM (which used a VGG perceptual loss) performed much better, due to greater expressiveness. AM performance was similar to GLOM on this task, as the perceptual loss and stability of training of non-adversarial models helped GLOM greatly. NES performed much better than all other methods, even when initialized from a constant mask. Finetuning from GLOM, helped NES achieve stronger performance, nearly identical to the fully-supervised upper bound. It performed better than finetuning from AM which achieved and (numbers for finetuning from AM were omitted from the tables for clarity, as they were inferior to finetuning from GLOM in all experiments). Similar conclusions can be drawn from the qualitative comparison in the figure above.
4.2 Separating Speech and Environmental Noise
Separating environmental noise from speech is a long standing problem in signal processing. Although supervision for both human speech and natural noises can generally be obtained, we use this task as a benchmark to evaluate our method’s performance on audio signals where and are not dependent. This benchmark is a proxy for tasks for which a clean training set of sounds cannot be obtained e.g. for animal sounds in the wild, background sounds training without animal noises can easily be obtained, but clean sounds made by the animal with no background sounds are unlikely to be available.
We obtain clean speech segments from the Oxford-BBC Lip Reading in the Wild (LRW) Dataset (Chung & Zisserman, 2016), and resample the audio to 16 kHz. Audio segments from ESC-50 (Piczak, 2015), a dataset of environmental audio recordings organized into 50 semantic classes, are used as additive noise. Noisy speech clips are created synthetically by first splitting clean speech into clips with duration of 0.5 seconds, and adding a random noise clip, such that the resulting SNR is zero. We then compute a mel-scale spectrogram with 64 bins, using STFT with window size of 25 ms, hop length of 10 ms, and FFT size of 512, resulting in an input audio feature of scalars. Finally, power-law compression is performed with , i.e. , where is the input audio feature.
From the results in Tab. 2, we can observe that GLOM, performed better than Semi-Supervised NMF by about 1dB better. AM training, performed about 2dB better than GLOM. Due to the independence between the sources in this task, NES performed very well, even when trained from a constant mask initialization. Performance was less than 1dB lower than the fully supervised result (while not requiring any clean speech samples). In this setting due to the strong performance of NES, initializing NES with the speech estimates obtained by GLOM (or AM), did not yield improved performance.
4.3 Music Separation
Separating vocal music into singing voice and instrumental music as well as instrumental music and drums has been a standard task for the signal processing community. Here our objective is to understand the behavior of our method in settings where and are dependent (which makes synthesis by addition of random and training samples a less accurate approximation).
For this task we use the MUSDB18 Dataset (Rafii et al., 2017), which, for each music track, comprises separate signal streams of the mixture, drums, bass, the rest of the accompaniment, and the vocals. We convert the audio tracks to mono, resample to Hz, and then follow the procedure detailed in Sec. 4.2 to obtain input audio features.
From the results in Tab. 3, we can observe that NMF was the worst performer in this setting (as its simple bases do not generalize well between songs). GLOM was able to do much better than NMF and was even competitive with NES on Vocal-Instrumental separation. Due to the dependence between the two sources and low SNR, initialization proved important for NES. Constant initialization NES performed similarly to AM and GLOM. Finetuning NES from GLOM masks performed much better than all other methods and was competitive with the supervised baseline. GLOM was much better than AM initialization (that achieved and ).
GLO vs. Adversarial Masking: GLO Masking as a stand alone technique usually performed worse than Adversarial Masking. On the other hand, finetuning from GLO masks was far better than finetuning from adversarial masks. We speculate that mode collapse, inherent in adversarial training, makes the adversarial masks a lower bound on the source distribution. GLOM can result in models that are too loose (i.e. that also encode samples outside of ). But as an initialization for NES finetuning, it is better to have a model that is too loose than a model which is too tight.
Supervision is important for source separation. Completely blind source separation is not well specified and simply using general signal statistics is generally unlikely to yield competitive results. Obtaining full supervision by providing a labeled mask for training mixtures is unrealistic but even synthetic supervision in the form of a large training set of clean samples from each source distribution might be unavailable as some sounds are never observed on their own (e.g. sounds of car wheels). Our setting significantly reduces the required supervision to specifying if a certain sound sample contains or does not contain the unobserved source. Such supervision can be quite easily and inexpensively provided. For further sample efficiency increases, we hypothesize that it would be possible to label only a limited set of examples as containing the target sound and not, and to use this seed dataset to finetune a deep sound classifier to extract more examples from an unlabeled dataset. We leave this investigation to future work.
Signal-Specific Losses: To showcase the generality of our method, we chose not to encode task specific constraints. In practical applications of our method however we believe that using signal-specific constraints can increase performance. Examples of such constraints include: repetitiveness of music (Rafii & Pardo, 2011), sparsity of singing voice, smoothness of natural images.
Non-Adversarial Alternatives: The good performance of GLOM vs. AM on the vocals separation task, suggests that non-adversarial generative methods may be superior to adversarial methods for separation. This has also been observed in other mapping tasks e.g. the improved performance of NAM (Hoshen & Wolf, 2018) over DCGAN (Radford et al., 2015).
Convergence of NES: A perfect signal separation function is a stable global minimum of NES as i) the synthetic mixtures are equal to real mixtures ii) real mixtures are perfectly separated. In all NES experiments (with constant, AM or GLOM initialization), NES converged after no more than 10 iterations, typically to different local minima. It is empirically evident that NES is not guaranteed to converge to a global minimum (although it converges to good local minima). We defer formal convergence analysis of NES to future work.
In this paper we proposed a novel method—Neural Egg Separation—for separating mixtures of observed and unobserved distributions. We showed that careful initialization using GLO Masking improves results in challenging cases. Our method achieves much better performance than other methods and was usually competitive with full-supervision.
Barker & Virtanen (2014)
Tom Barker and Tuomas Virtanen.
Semi-supervised non-negative tensor factorisation of modulation spectrograms for monaural speech separation.In 2014 International Joint Conference on Neural Networks (IJCNN), pp. 3556–3561. IEEE, 2014.
- Bojanowski et al. (2018) Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. Optimizing the latent space of generative networks. In ICML’18, 2018.
Chung & Zisserman (2016)
J. S. Chung and A. Zisserman.
Lip reading in the wild.
Asian Conference on Computer Vision, 2016.
- Denton et al. (2017) Emily L Denton et al. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pp. 4414–4423, 2017.
- Gabbay et al. (2017) Aviv Gabbay, Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. Seeing through noise: Visually driven speaker separation and enhancement. arXiv preprint arXiv:1708.06767, 2017.
- Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT Press, 2016.
- Higgins et al. (2016) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016.
- Hoshen & Wolf (2018) Yedid Hoshen and Lior Wolf. Nam: Non-adversarial unsupervised domain mapping. In ECCV’18, 2018.
- Hoyer (2004) Patrik O Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research, 5(Nov):1457–1469, 2004.
Huang et al. (2012)
Po-Sen Huang, Scott Deeann Chen, Paris Smaragdis, and Mark Hasegawa-Johnson.
Singing-voice separation from monaural recordings using robust principal component analysis.In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pp. 57–60. IEEE, 2012.
- Huang et al. (2014) Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. Deep learning for monaural speech separation. In ICASSP, 2014.
- Kim & Park (2007) Hyunsoo Kim and Haesun Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12):1495–1502, 2007.
- Kim & Park (2011) Jingu Kim and Haesun Park. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing, 33(6):3261–3281, 2011.
- Kim et al. (2017) Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
- Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
- Lee & Seung (2001) Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp. 556–562, 2001.
- Mao et al. (2017) Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2813–2821. IEEE, 2017.
- Mimilakis et al. (2017) Stylianos Ioannis Mimilakis, Konstantinos Drossos, João F Santos, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio. Monaural singing voice separation with skip-filtering connections and recurrent inference of time-frequency mask. arXiv preprint arXiv:1711.01437, 2017.
- Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
- Piczak (2015) Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. ACM Press, 2015.
- Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Rafii & Pardo (2011) Zafar Rafii and Bryan Pardo. A simple music/voice separation method based on the extraction of the repeating musical structure. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pp. 221–224. IEEE, 2011.
- Rafii et al. (2017) Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, December 2017. URL https://doi.org/10.5281/zenodo.1117372.
- Roweis (2001) Sam T Roweis. One microphone source separation. In Advances in neural information processing systems, pp. 793–799, 2001.
Smaragdis et al. (2007)
Paris Smaragdis, Bhiksha Raj, and Madhusudana Shashanka.
Supervised and semi-supervised separation of sounds from
International Conference on Independent Component Analysis and Signal Separation, pp. 414–421. Springer, 2007.
- Stoller et al. (2017) Daniel Stoller, Sebastian Ewert, and Simon Dixon. Adversarial semi-supervised audio source separation applied to singing voice extraction. arXiv preprint arXiv:1711.00048, 2017.
- Subakan & Smaragdis (2017) Cem Subakan and Paris Smaragdis. Generative adversarial source separation. arXiv preprint arXiv:1710.10779, 2017.
- Wang et al. (2014) Yuxuan Wang, Arun Narayanan, and DeLiang Wang. On training targets for supervised speech separation. TASLP, 2014.
- Wilson et al. (2008) Kevin W Wilson, Bhiksha Raj, Paris Smaragdis, and Ajay Divakaran. Speech denoising using nonnegative matrix factorization with priors. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pp. 4029–4032. IEEE, 2008.
- Yu & Grauman (2014) Aron Yu and Kristen Grauman. Fine-grained visual comparisons with local learning. In CVPR, 2014.
- Yu et al. (2017) Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In ICASSP, 2017.
- Zhu et al. (2016) Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597–613. Springer, 2016.
Zhu et al. (2017)
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros.
Unpaired image-to-image translation using cycle-consistent adversarial networks.In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
Appendix A Architectures
In this section we give details of the architectures used in our experiments:
a.1 Image Architectures:
GLOM Generator: GLOM trains generators for and , each generates an image given a latent vector. The architecture followed the architecture first employed by DCGAN (Radford et al., 2015). We used filters across MNIST and Shoes-Bags experiments. MNIST had one fewer layer, owing to its resolution. Generators were followed by sigmoid layers to ensure outputs within .
GAN Discriminator: The discriminator used by Adversarial Masking is a DCGAN discriminator with filter dimensionality of . SpectralNorm is implemented exactly as described in (Miyato et al., 2018).
Masking Network: Adversarial Masking, NES and the fully supervised baseline all use the same masking function architecture. The masking function takes a mixed image and outputs a mask, that when multiplied by the image results in the source: . The architecture is an autoencoder similar to the one used in DiscoGAN (Kim et al., 2017). MNIST has two fewer layers owing to its lower resolution. We used filters on the top and bottom layers, and doubling / halving the filter number before / after the autoencoder mid-way layer.
a.2 Audio Architectures:
GLOM and AM use the same generator and discriminator architectures respectively for audio as they do for images. They operate on mel-scale spectrogram at resolution.
Masking Network: The generator for AM operates on mel-scale audio spectrograms. It consists of convolutional and
deconvolutional layers with stride
and no pooling. Outputs of convolutional layers are normalized with BatchNorm and rectified with ReLU activation, except for the last layer where sigmoid is used.
In addition to the LSGAN loss, an additional magnitude loss is used, with relative weight of .
NES and the supervised method operate on full linear spectrogram of dimensions , without compression. They use the same DiscoGAN architecture, which contains two additional convolutional and deconvolutional layers.
Appendix B NMF Semi-Supervised Separation:
In this section we describe our implementation of the NMF semi-supervised source separation baseline (Smaragdis et al., 2007). NMF trains a decomposition: where are the weights and are the per sample latent codes. Both and are non-negative. Regularization is important for the performance of the method. We follow (Hoyer, 2004; Kim & Park, 2007) and use regularization to ensure sparsity of the weights. The optimization problem therefore becomes:
This equation is solved iteratively using non-negative least squares (NNLS) with the solver by Kim & Park (2011). The iteration solves the following NNLS problem:
The iteration optimizes the following NNLS problem:
and iterations are optimized until convergence.
Following Smaragdis et al. (2007), we first train sparse NMF for the training samples: . Using the weights from this stage, we proceed to train another NMF decomposition on the residuals of the mixture:
The iteration consists of NNLS optimization:
The iteration consists of NNLS optimization of both and on the mixture samples:
In the above we neglected the sparsity constraint for pedagogical reasons. It is implemented exactly as in Eq. 14.
At inference time, the optimization is equivalent to Eq. 17. After inference of and , our estimated and sources are given by:
In our experiments we used dimension of and sparsity . The hyper-parameters were determined by performance on a small validation set.
Appendix C Further qualitative analysis of GLOM and NES
We present a qualitative analysis of the results of GLOM and NES. To understand the quality of generations of GLO and the effect of the masking function, we present in Fig.2 the results of the GLO generations given different mixtures from the Speech dataset. We also show the results after the masking operation described in Eq. 10. It can be observed that GLO captures the general features of the sources, but is not able to exactly capture fine detail. The masking operation in GLOM helps it recover more fine-grained details, and results in much cleaner separations.
We also show in Fig.2 the evolution of NES as a function of iteration for the same examples. NES() denotes the result of NES after iterations. It can be seen the NES converges quite quickly, and results improve further with increasing iterations. In Fig.3, we can observe the performance of NES on the Speech dataset in terms of SDR as a function of iteration. The results are in line with the qualitative examples presented before, NES converges quickly but makes further gains with increasing iterations.