Single channel audio source separation is an important and challenging problem and has received considerable interest in the research community in recent years. Since there is limited information in the mixed signal, usually one needs to use training data for each source to model each source and to improve the quality of separation. In this work, we introduce a new method for improved source separation using nonlinear models of sources trained using a deep neural network.
1.1 Related work
Many approaches have been introduced so far to solve the single channel source separation problem. Most of those approaches strongly depend on training data for the source signals. The training data can be modeled using probabilistic models such as Gaussian mixture model (GMM)[1, 2, 3]
, hidden Markov model (HMM) or factorial HMM[4, 5, 6]. These models are learned from the training data and usually used in source separation under the assumption that the sources appear in the mixed signal with the same energy level as they appear in the training data. Fixing this limitation requires complicated computations as in [7, 8, 9, 10, 11, 12]. Another approach to model the training data is to train nonnegative dictionaries for the source signals [13, 14, 15]
. This approach is more flexible with no limitation related to the energy differences between the source signals in training and separation stages. The main problem in this approach is that any nonnegative linear combination of the trained dictionary vectors is a valid estimate for a source signal which may decrease the quality of separation. Modeling the training data with both nonnegative dictionary and cluster models like GMM and HMM was introduced in[16, 17, 18, 19] to fix the limitation related to the energy scaling between the source signals and training more powerful models that can fit the data properly. Another type of approach which is called classification-based speech separation aims to find hard masks where each time-frequency bin is classified as belonging to either of the sources. For example in 
, various classifiers based on GMM, support vector machines, conditional random fields, and deep neural networks were used for classification.
In this paper, we model the training data for the source signals using a single joint deep neural network (DNN). The DNN is used as a spectral domain classifier which can classify its input spectra into each possible source type. Unlike classification-based speech separation where the classifiers are used to segment time-frequency bins into sources, we can obtain soft masks using our approach. Single channel source separation problem is formulated as an energy minimization problem where each source spectral estimate is encouraged to fit the trained DNN model and the mixed signal spectrum is encouraged to be written as a weighted sum of the estimated source spectra. Basically, we can think of the DNN as checking whether the estimated source signals are lying in their corresponding nonlinear manifolds which are represented by the trained joint DNN. Using a DNN for modeling the sources and handling the energy differences in training and testing is considered to be the main novelty of this paper. Deep neural network (DNN) is a well known model for representing the detailed structures in complex real-world data [21, 22]. Another novelty of this paper is using nonnegative matrix factorization  to find initial estimates for the sources rather than using random initialization.
1.3 Organization of the paper
This paper is organized as follows: In Section 2 a mathematical formulation for the SCSS problem is given. Section 3 briefly describes the NMF method for source separation. In Section 4, we introduce our new method. We present our experimental results in Section 5. We conclude the paper in Section 6.
2 Problem formulation
In single channel source separation problems, the aim is to find estimates of source signals that are mixed on a single channel
. For simplicity, in this paper we assume the number of sources is two. This problem is usually solved in the short time Fourier transform (STFT) domain. Letbe the STFT of , where represents the frame index and is the frequency-index. Due to the linearity of the STFT, we have:
where and are the unknown STFT of the first and second sources in the mixed signal. In this framework [13, 24], the phase angles of the STFT were usually ignored. Hence, we can approximate the magnitude spectrum of the measured signal as the sum of source signals’ magnitude spectra as follows:
We can write the magnitude spectrogram in matrix form as follows:
where are the unknown magnitude spectrograms of the source signals and need to be estimated using the observed mixed signal and the training data.
3 NMF for supervised source separation
In this section, we briefly describe the use of nonnegative matrix factorization (NMF) for supervised single channel source separation. We will relate our model to the NMF idea and we will use the source estimates obtained from using NMF as initilization for our method, so it is appropriate to introduce the use of NMF for source separation first.
To find a suitable initialization for the sources signals, we use nonnegative matrix factorization (NMF) as in . NMF  factorizes any nonnegative matrix into a basis matrix (dictionary) and a gain matrix as
The matrix contains the basis vectors that are optimized to allow the data in to be approximated as a linear combination of its constituent columns. The solution for and can be found by minimizing the following Itakura-Saito (IS) divergence cost function :
This divergence cost function is a good measurement for the perceptual difference between different audio signals . The IS-NMF solution for equation (5) can be computed by alternating multiplicative updates of and as follows:
where is a matrix of ones with the same size of , the operation is an element-wise multiplication, all divisions and are element-wise operations. The matrices and are usually initialized by positive random numbers and then updated iteratively using equations (6) and (7).
In the initialization stage, NMF is used to decompose the frames for each source into a multiplication of a nonnegative dictionary and a gain matrix as follows:
where is the nonnegative matrix that contains the spectrogram frames of the training data of source . After observing the mixed signal, we calculate its spectrogram . NMF is used to decompose the mixed signal’s spectrogram matrix with the trained dictionaries as follows:
The only unknown here is the gains matrix since the dictionaries are fixed. The update rule in equation (6) is used to find . After finding the value of , the initial estimate for each source magnitude spectrogram is computed as follows:
where is an element-wise multiplication and the divisions are done element-wise.
The magnitude spectrograms of the initial estimates of the source signals are used to initialize the sources in the separation stage of the DNN approach.
4 The method
In NMF, the basic idea is to model each source with a dictionary, so that source signals appear in the nonnegative span of this dictionary. In the separation stage, the mixed signal is expressed as a nonnegative linear combination of the source dictionaries and separation is performed by taking the parts corresponding to each source in the decomposition.
The basic problem in NMF is that each source is modeled to lie in a nonnegative cone defined by all the nonnegative linear combinations of its dictionary entries. This assumption may be a limiting assumption usually since the variability within each source indicates that nonlinear models may be more appropriate. This limitation led us to consider nonlinear models for each source. It is not trivial to use nonlinear models or classifiers in source separation. Since deep neural networks were recently used with increased success in speech recognition and other object recognition tasks, they can be considered as superior models of highly variable real-world signals.
We first train a DNN to model each source in the training stage. We then use an energy minimization objective to estimate the sources and their gains during the separation stage. Each stage is explained below.
4.1 Training the DNN
We train a DNN that can classify sources present in the mixed signal. The input to the network is a frame of normalized magnitude spectrum, . The neural network architecture is illustrated in Figure 1. There are two outputs in the DNN, each corresponding to a source. The label of each training instance is a binary indicator function, namely if the instance is from source one, the first output label and the second output label . Let be the number of hidden nodes in layer for where is the number of layers. Note that and . Let be the weights between layers and , then the values of a hidden layer are obtained as follows:
is the elementwise sigmoid function. We skip the bias terms to avoid clutter in our notation. The input to the network isand the output is .
Training a deep network necessitates a good initialization of the parameters. It is shown that layer-by-layer pretraining using unsupervised methods for initialization of the parameters results in superior performance as compared to using random initial values. We used Restricted Boltzmann Machines (RBM) for initialization. After initialization, supervised backpropagation algorithm is applied to fine-tune the parameters. The learning criteria we use is least-squares minimization. We are able to get the partial derivatives with respect to the inputs, and this derivative is also used in the source separation part. Letbe the DNN, then and
are the scores that are proportional to the probabilities of source one and source two respectively for a given frame of normalized magnitude spectrum. We use these functions to measure how much the separated spectra carry the characteristics of each source as we elaborate more in the next section.
4.2 Source separation using DNN and energy minimization
In the separation stage, our algorithm works independently in each frame of the mixed audio signal. For each frame of the mixed signal spectrum, we calculate the normalized magnitude spectrum . We would like to express where and are the gains and and are normalized magnitude spectra of source one and two respectively.
We formulate the problem of finding the unknown parameters as an energy minimization problem. We have a few different criteria that the source estimates need to satisfy. First, they must fit well to the DNN trained in the training stage. Second, their linear combination must sum to the mixed spectrum and third the source estimates must be nonnegative since they correspond to the magnitude spectra of each source.
The energy functions and below are least squares cost functions that quantify the fitness of a source estimate to each corresponding source model in the DNN.
Basically, we expect to have when comes from source one and vice versa. We also define the following energy function which quantifies the energy of error caused by the least squares difference between the mixed spectrum and its estimate found by linear combination of the two source estimates and :
Finally, we define an energy function that measures the negative energy of a variable,
In order to estimate the unknowns in the model, we solve the following energy minimization problem.
is the joint energy function which we seek to minimize. and are regularization parameters which are chosen experimentally. Here is a vector containing all the unknowns which must all be nonnegative. Note that, the nonnegativity can be given as an optimization constraint as well, however we obtained faster solution of the optimization problem if we used the negative energy function penalty instead. If some of the parameters are found to be negative after the solution of the optimization problem (which rarely happens), we set them to zero. We used the LBFGS algorithm for solving the unconstrained optimization problem.
We need to calculate the gradient of the DNN outputs with respect to the input to be able to solve the optimization problem. The gradient of the input with respect to is given as for , where,
and , where contains the weights between node of the output layer and the nodes at the previous layer, in other words the th row of .
The flowchart of the energy minimization setup is shown in Figure 2. For illustration, we show the single DNN in two separate blocks in the flowchart. The fitness energies are measured using the DNN and the error energy is found from the summation requirement.
Note that, since there are many parameters to be estimated and the problem is clearly non-convex, the initialization of the parameters is very important. We initialize the estimates and from the NMF result after normalizing by their -norms. is initialized by the -norm of the initial NMF source estimate divided by the -norm of the mixed signal . is initialized in a similar manner.
After we obtain as the result of the energy minimization problem, we use them as spectral estimates in a Wiener filter to reconstruct improved estimates of each source spectra, e.g. for source one we obtain the final estimate as follows:
5 Experiments and Discussion
We applied the proposed algorithm to separate speech and music signals from their mixture. We simulated our algorithm on a collection of speech and piano data at 16kHz sampling rate. For speech data, we used the training and testing male speech data from the TIMIT database. For music data, we downloaded piano music data from piano society web site . We used 39 pieces with approximate 185 minutes total duration from different composers but from a single artist for training and left out one piece for testing. The magnitude spectrograms for the speech and music data were calculated using the STFT: A Hamming window with 480 points length and overlap was used and the FFT was taken at 512 points, the first 257 FFT points only were used since the conjugate of the remaining 255 points are involved in the first points.
The mixed data was formed by adding random portions of the test music file to 20 speech files from the test data of the TIMIT database at different speech to music ratio. The audio power levels of each file were found using the “speech voltmeter” program from the G.191 ITU-T STL software suite .
For the initialization of the source signals using nonnegative matrix factorization, we used a dictionary size of 128 for each source. For training the NMF dictionaries, we used 50 minutes of data for music and 30 minutes of the training data for speech. For training the DNN, we used a total 50 minute subset of music and speech training data for computational reasons.
For the DNN, the number of nodes in each hidden layer were 100-50-200 with three hidden layers. Sigmoid nonlinearity was used at each node including the output nodes. DNN was initialized with RBM training using contrastive divergence. We used 150 epochs for training each layer’s RBM. We used 500 epochs for backpropagation training. The first five epochs were used to optimize the output layer keeping the lower layer weights untouched.
In the energy minimization problem, the values for the regularization parameters were and . We used Mark Schmidt’s minFunc matlab LBFGS solver for solving the optimization problem .
Performance measurements of the separation algorithm were done using the signal to distortion ratio (SDR) and the signal to interference ratio (SIR) . The average SDR and SIR over the 20 test utterances are reported. The source to distortion ratio (SDR) is defined as the ratio of the target energy to all errors in the reconstructed signal. The target signal is defined as the projection of the predicted signal onto the original speech signal. Signal to interference ratio (SIR) is defined as the ratio of the target energy to the interference error due to the music signal only. The higher SDR and SIR we measure the better performance we achieve. We also use the output SNR as additional performance criteria.
The results are presented in Tables 1 and 2. We experimented with multi-frame DNN where the inputs to the DNN were taken from neighbor spectral frames for both training and testing instead of using a single spectral frame similar to . We can see that using the DNN and the energy minimization idea, we can improve the source separation performance for all input speech-to-music ratio (SMR) values from -5 to 5 dB. In all cases, DNN is better than regular NMF and the improvement in SDR and SNR is usually around 1-1.5 dB. However, the improvement in SIR can be as high as 3 dB which indicates the fact that the introduced method can decrease remaining music portions in the reconstructed speech signal. We performed experiments with neighboring frames which improved the results as compared to using a single frame input to the DNN. For , we used 500 nodes in the third layer of the DNN instead of 200. We conjecture that better results can be obtained if higher number of neighboring frames are used.
In this work, we introduced a novel approach for single channel source separation (SCSS) using deep neural networks (DNN). The DNN was used in this paper as a helper to model each source signal. The training data for the source signals were used to train a DNN. The trained DNN was used in an energy minimization framework to separate the mixed signals while also estimating the scale for each source in the mixed signal. Many adjustments for the model parameters can be done to improve the proposed SCSS using the introduced approach. Different types of DNN such as deep autoencoders and deep recurrent neural networks which can handle the temporal structure of the source signals can be tested on the SCSS problem. We believe this idea is a novel idea and many improvements will be possible in the near future to improve its performance.
-  T. Kristjansson, J. Hershey, and H. Attias, “Single microphone source separation using high resolution signal reconstruction,” in IEEE International Conference Acoustics, Speech and Signal Processing (ICASSP), 2004.
-  A. M. Reddy and B. Raj, “Soft Mask Methods for single-channel speaker separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, Aug. 2007.
-  A. M. Reddy and B. Raj, “A Minimum Mean squared error estimator for single channel speaker separation,” in International Conference on Spoken Language Processing (InterSpeech), 2004.
-  T. Virtanen, “Speech recognition using factorial hidden markov models for separation in the feature space,” in International Conference on Spoken Language Processing (InterSpeech), 2006.
-  S. T. Roweis, “One Microphone source separation,” Neural Information Processing Systems, 13, pp. 793–799, 2000.
-  A. N. Deoras and A. H. Johnson, “A factorial hmm approach to simultaneous recognition of isolated digits spoken by multiple talkers on one audio channel,” in IEEE International Conference Acoustics, Speech and Signal Processing (ICASSP), 2004.
M. H. Radfar, W. Wong, W. Y. Chan, and R. M. Dansereau,
“Gain estimation in model-based single channel speech
In proc. IEEE International Workshop on Machine Learning for Signal Processing (MLSP, Grenoble, France), Sept. 2009.
-  M. H. Radfar, R. M. Dansereau, and W. Y. Chan, “Monaural speech separation based on gain adapted minimum mean square error estimation,” Journal of Signal Processing Systems,Springer, vol. 61, no. 1, pp. 21 –37, 2008.
-  M. H. Radfar and R. M. Dansereau, “Long-term gain estimation in model-based single channel speech separation,” in in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA, New Paltz, NY), 2007.
-  A. Ozerov, C. F votte, and M. Charbit, “Factorial scaled hidden markov model for polyphonic audio representation and source separation,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Mohonk, NY, 2009.
-  M. H. Radfar, W. Wong, R. M. Dansereau, and W. Y. Chan, “Scaled factorial Hidden Markov Models: a new technique for compensating gain differences in model-based single channel speech separation,” in IEEE International Conference Acoustics, Speech and Signal Processing (ICASSP), 2010.
-  J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Kristjansson, “Super-human multi-talker speech recognition: A graphical modeling approach,” Computer Speech and Language, vol. 24, no. 1, pp. 45–66, Jan. 2010.
-  M. N. Schmidt and R. K. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization,” in International Conference on Spoken Language Processing (InterSpeech), 2006.
-  E. M. Grais and H. Erdogan, “Spectro-temporal post-smoothing in NMF based single-channel source separation,” in European Signal Processing Conference (EUSIPCO), 2012.
-  E. M. Grais and H. Erdogan, “Single channel speech music separation using nonnegative matrix factorization with sliding window and spectral masks,” in Annual Conference of the International Speech Communication Association (InterSpeech), 2011.
-  E. M. Grais and H. Erdogan, “Regularized nonnegative matrix factorization using gaussian mixture priors for supervised single channel source separation,” Computer Speech and Language, vol. 27, no. 3, pp. 746 –762, May 2013.
-  E. M. Grais and H. Erdogan, “Gaussian mixture gain priors for regularized nonnegative matrix factorization in single-channel source separation,” in Annual Conference of the International Speech Communication Association (InterSpeech), 2012.
-  E. M. Grais and H. Erdogan, “Hidden Markov Models as priors for regularized nonnegative matrix factorization in single-channel source separation,” in Annual Conference of the International Speech Communication Association (InterSpeech), 2012.
-  E. M. Grais and H. Erdogan, “Spectro-temporal post-enhancement using MMSE estimation in NMF based single-channel source separation,” in Annual Conference of the International Speech Communication Association (InterSpeech), 2013.
-  Y. Wang and DL. Wang, “Cocktail party processing via structured prediction,” in Advances in Neural Information Processing Systems (NIPS), 2012.
-  Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, July 2006.
-  Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” Signal Processing Magazine, 2012.
-  E. M. Grais and H. Erdogan, “Single channel speech music separation using nonnegative matrix factorization and spectral masks,” in International Conference on Digital Signal Processing, 2011.
-  T. Virtanen, “Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1066–1074, Mar. 2007.
-  D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” Advances in Neural Information Processing Systems, vol. 13, pp. 556– 562, 2001.
-  C. Fevotte, N. Bertin, and J. L. Durrieu, “Nonnegative matrix factorization with the itakura-saito divergence. With application to music analysis,” Neural Computation, vol. 21, no. 3, pp. 793–830, 2009.
-  URL, “http://pianosociety.com,” 2009.
-  URL, “http://www.itu.int/rec/T-REC-G.191/en,” 2009.
-  URL, “http://www.di.ens.fr/ mschmidt/software/minfunc.html,” 2012.
-  E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–69, July 2006.