Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation

10/28/2017 ∙ by Emad M. Grais, et al. ∙ 0

In deep neural networks with convolutional layers, each layer typically has fixed-size/single-resolution receptive field (RF). Convolutional layers with a large RF capture global information from the input features, while layers with small RF size capture local details with high resolution from the input features. In this work, we introduce novel deep multi-resolution fully convolutional neural networks (MR-FCNN), where each layer has different RF sizes to extract multi-resolution features that capture the global and local details information from its input features. The proposed MR-FCNN is applied to separate a target audio source from a mixture of many audio sources. Experimental results show that using MR-FCNN improves the performance compared to feedforward deep neural networks (DNNs) and single resolution deep fully convolutional neural networks (FCNNs) on the audio source separation problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Monaural audio source separation (MASS) aims to separate audio sources from their mono/single mixture [1, 2, 3]. Many deep learning techniques have been used before to tackle this problem [4, 5, 6, 7].

A variety of deep neural networks with convolutional layers have been used recently to tackle the MASS problem [8, 9, 10, 11, 12, 13, 14]. One of the main differences in those works relies on using either fully convolutional neural networks (FCNN), where all the network layers are convolutional layers, or some of the layers are convolutional and others are fully connected layers. The common aspect in those works is that each convolutional layer composes of a set of filters that have the same receptive field (RF) size. The RF is the field of view of a unit (filter in the FCNN case) in a certain layer in the network [15]. In the fully connected neural networks (DNN), the output of each unit in a certain layer depends on the entire input to that layer, while the output of a unit in a convolutional layer only depends on a region of the input, this region in the input is the RF for that unit. The RF size is a crucial issue in many audio and visual tasks, as the output must respond to areas with sizes correspond to the sizes of the different objects/patterns in the input data to extract useful information/features about each object [15]. The size of the RF equals the size of the filters in a convolutional layer. A large filter size captures the global structure of its input features [16, 17]. A small filter size captures the local details with high resolution but it does not capture the global structure of its input features. Intuitively, it might be useful to have sets of filters that can extract both the global structures and local details from the input features in each layer. This might be useful in MASS problem, since the input signal is a mixture of different audio sources and useful features can be extracted for certain sources in certain time-frequency resolutions which may differ from one source to another.

The concept of extracting multi-resolution features has been proposed recently in many applications with different ways of extracting and combining the multi-resolution features from the input data [18, 19, 20, 16]. In this paper, we introduce a novel multi-resolution fully convolutional neural network (MR-FCNN) model for MASS, where each layer in the MR-FCNN is a convolutional layer that is composed of different sets of filters with different sizes to extract the global and local information from its input features in each layer in different resolutions. Each set of filters has filters with the same size which is different than the sizes of the filters in the other sets. We believe that, this is the first time that a deep neural network has been proposed with each layer composed of multi-resolution filters that extract multi-resoltuion features from the layer before, and it is the first time the concept of extracting multi-resolution features is used for MASS problem. The inputs and outputs of the MR-FCNN are two-dimensional (2D) segments from the magnitude spectrogram of the mixed and target source signals respectively. The MR-FCNN in this work is trained to extract useful spectro-temporal features and patterns in different time-frequency resolutions to separate the target source from the input mixture.

This paper is organized as follows: Section 2 shows a brief introduction about the fully convolutional neural networks and the proposed MR-FCNN. The proposed approach of using MR-FCNN for MASS is presented in Section 3. The rest of the paper is for the experiments, discussions, and conclusions.

2 Multi-resolution fully convolutional neural networks

In this section we first give an introduction about the fully convolutional neural network (FCNN) that we use in this study as a core model and then we introduce the proposed MR-FCNN.

2.1 Fully convolutional neural networks

The FCNN model that is used here is somewhat similar to the convolutional denoising encoder-decoder (auto-encoder) networks (CDEDs) that was used in [11, 21], but without using either down-sampling (pooling) or up-sampling as shown in Fig. 1

. The encoder part in the FCNN is composed of repetitions of a convolutional layer and an activation layer. Each convolutional layer consists of a set of filters with the same size to extract features from its input layer, the activation layer is the rectified linear unit (ReLU) that imposes nonlinearity to the feature maps. The FCNN is trained from corrupted input signals and the encoder part is used to extract noise robust features that the decoder can use to reconstruct a cleaned-up version of the input data

[21, 22]. In MASS, the input mixed signal can be seen as a sum of the target source that needs to be separated and background noise (the other sources in the mixture). The decoder part consists of repetitions of a deconvolutional (transposed convolution) layer and an activation layer. The input and output data are 2D signals (magnitude spectrograms) and the filtering is a 2D operator.

Figure 1:

The overview of the structure of a FCNN that separates one target source from the mixed signal. Each layer consists of a single set of filters with the same size followed by a rectified linear unit (ReLU) as activation function. The set of filters in the input and output layers have large filter sizes and small number of filters. The number of filters increases and the size decreases when getting further from the input and output layers

[21]. There is symmetric in the filter sizes and numbers of filters between the encoder and decoder sides.

2.2 Mr-Fcnn

Each layer in the FCNN in Fig.1 is composed of one set of filters that have the same RF size. The size of the RF is a very important parameter as the output of each filter must respond to areas with sizes correspond to the sizes of the different objects/patterns in the input to extract useful information/features from the input data [15]. For example, if the size of the RF of a filter is much bigger than the size of the input pattern, the filter will capture blurred features from the input patterns, while if the RF of a unit is smaller than the size of the input pattern, the output of the filter loses the global structure of the input pattern [15].

In audio source separation problems, the spectrogram of the input mixed signal usually contains different combinations of different spectro-temporal patterns from different audio signals. There are unique set of patterns associated with each source in the spectrogram of their mixture and these patterns appear in different spectro-temporal sizes and these sizes are source dependent [23]. So, to use the FCNN to extract useful information about the individual sources in the spectrogram of their mixture, it might be useful to use filters with different RF sizes in each layer, where the different RF sizes are proportional to the diversity of the spectro-temporal sizes of the patterns in the spectrogram. Bearing these issues in mind, we propose MR-FCNN which is the FCNN shown in Fig.1 but with multi-resolution filters (filters with different sizes) in each layer. Thus, each layer in the MR-FCNN has sets of 2D filters. Each set of filters has the same size which is different than the size of the filters in the other sets in the same layer. Each set of filters generates feature maps with certain time-frequency resolution. Fig. 2 shows the detail structure for each layer in the MR-FCNN. Each layer in the MR-FCNN generates multi-resolution features from its input features and also combines the multi-resolution features from the previous layers to generate accurate patterns that compose the structure of the underlying data.

Figure 2: The overview of the proposed structure of each layer of the MR-FCNN. denotes the number of filters with size in set in layer . is the dimension in the time direction of the filters and is the dimension in the frequency direction of the filters in set and layer . The filters in different sets have different sizes and the filters within a set have the same size. Each set in layer generates feature maps. The number of feature maps that each layer generates equal to the sum of the number of feature maps that all the sets in layer generate (). ReLU denotes a rectified linear unit (ReLU) as an activation function.

3 MR-FCNN for MASS

Given a mixture of sources as

, the aim of MASS is to estimate the sources

, from the mixed signal [24, 25]

. We work here in the short-time Fourier transform (STFT) domain. Given the STFT of the mixed signal

, the main goal is to estimate the STFT of each source in the mixture.

In this work, we propose to use as many MR-FCNN as the number of sources to be separated from the mixed signal. Each MR-FCNN sees the mixed signal as a combination of its target source and background noise. The main aim of each MR-FCNN is to estimate a clean signal for its corresponding source from the other background sources that exist in the mixed signal. This is a challenging task for each MR-FCNN since each MR-FCNN deals with highly nonstationary background noise (other sources in the mixture). Each MR-FCNN is trained to map the magnitude spectrogram of the mixture into the magnitude spectrogram of its corresponding target source. Each MR-FCNN in this work is a deep fully 2D multi-resolutional convolutional neural network without any fully connected layer, which keeps the number of parameters to be optimized for each MR-FCNN small. Also using fully 2D convolutional layers allows neat 2D spectro-temporal representations for the data through all the layers in the network. The inputs and outputs of the MR-FCNNs are 2D-segments from the magnitude spectrograms of the mixed and target signals respectively. Therefore, the MR-FCNNs span multiple spectral frames to capture the spectro-temporal characteristics of each source. The number of spectral frames that each input segment has is and the number of frequency bins is . In this work, is the dimension of the whole spectral frame.

3.1 Training the MR-FCNNs for source separation

Let’s assume we have training data for the mixed signals and their corresponding clean/target sources. Let be the magnitude spectrogram of the mixed signal and be the magnitude spectrogram of the clean source . The subscript “tr” denotes the training data. The MR-FCNN that separates source from the mixture is trained to minimize the following cost function:

(1)

where is the actual output of the last layer of the MR-FCNN of source , is the reference clean output signal for source , , and are the time and frequency indices respectively. The input of the MR-FCNNs is the magnitude spectrogram of the mixed signal. The input and output instants of the MR-FCNN are 2D-segments, where each segment is composed of consecutive spectral frames taken from the magnitude spectrograms. This allows the MR-FCNN to learn multi-resolution spectro-temporal patterns for each source.

3.2 Testing the MR-FCNNs for source separation

Given the trained MR-FCNNs, the magnitude spectrogram of the mixed signal is passed through the trained MR-FCNNs. The output of the MR-FCNN of source is the estimate of the spectrogram of source .

4 Experiments

We applied our proposed MASS using MR-FCC approach to separate the voice/vocal sources from a group of songs from the SiSEC-2015-MUS-task dataset [26]. The dataset has 100 stereo songs with different genres and instrumentations. To use the data for the proposed MASS approach, we converted the stereo songs into mono by computing the average of the two channels for all songs and sources in the data set. Each song is a mixture of vocals, bass, drums, and other musical instruments. We used our proposed algorithm to separate the vocal signals from each song.

The first 50 songs in the dataset were used as training and validation datasets to train the MR-FCNN for separation, and the last 50 songs were used for testing. The data were sampled at 44.1kHz. The magnitude spectrograms for the data were calculated using the STFT, a Hanning window with 2048 points length and overlap interval of 512 was used and the FFT was taken at 2048 points, the first 1025 FFT points only were used as features since the conjugate of the remaining 1024 points are involved in the first points.

For the input and output data for the MR-FCNN, we chose the number of spectral frames in each 2D-segment to be 15 frames. This means the dimension of each input and output instant for the MR-FCNN is 15 (time frames) 1025 (frequency bins) as in [11]. Thus, each input and output instant (the 2D-segments from the spectrograms) spans around 370 msec of the waveforms of the data.

The quality of the separated sources was measured using the signal to distortion ratio (SDR), signal to interference ratio (SIR), and signal to artefact ratio (SAR) [27]. SIR indicates how well the sources are separated based on the remaining interference between the sources after separation. SAR indicates the artefacts caused by the separation algorithm in the estimated separated sources. SDR measures the overall distortion (interference and artefacts) of the separated sources. The SDR values are usually considered as the overall performance evaluation for any source separation approach [27]. Achieving high SDR, SIR, and SAR indicates good separation performance.

We compared the performance of the proposed MR-FCNN model, feedforward deep neural networks (DNNs), and the single resolution FCNN in separating the vocal signals from each song in the test set. The size of each input and output instant is the same in FCNN and MR-FCNN (151025). Each input and output instant of the DNN is a single frame of the magnitude spectrograms of the input and output signals respectively. Table 1 shows the number of layers, the number of filters in each layer, and the size of the filters for the FCNN and MR-FCNN.

FCNN and MR-FCNN model summary
The input/output data with size 15 frames and 1025 frequency bins
Layer number FCNN MR-FCNN
1 Conv2D[13,(13,21)] set 1 Conv2D[12,(13,21)]
set 2 Conv2D[3,(7,9)]
set 3 Conv2D[3,(3,3)]
2 Conv2D[18,(9,13)] set 1 Conv2D[3,(13,21)]
set 2 Conv2D[16,(7,9)]
set 3 Conv2D[3,(3,3)]
3 Conv2D[24,(7,9)] set 1 Conv2D[3,(13,21)]
set 2 Conv2D[12,(7,9)]
set 3 Conv2D[7,(3,3)]
4 Conv2D[42,(3,3)] set 1 Conv2D[3,(13,21)]
set 2 Conv2D[3,(7,9)]
set 3 Conv2D[32,(3,3)]
5 Conv2D[24,(7,9)] set 1 Conv2D[3,(13,21)]
set 2 Conv2D[12,(7,9)]
set 3 Conv2D[7,(3,3)]
6 Conv2D[18,(9,13)] set 1 Conv2D[3,(13,21)]
set 2 Conv2D[16,(7,9)]
set 3 Conv2D[3,(3,3)]
7 Conv2D[13,(13,21)] set 1 Conv2D[12,(13,21)]
set 2 Conv2D[3,(7,9)]
set 3 Conv2D[3,(3,3)]
8 Conv2D[1,(15,1025)] Conv2D[1,(15,1025)]
total number of parameters 445,173 558,181
Table 1: The detail information about the number and sizes of the filters in each layer. For example “Conv2D[13,(13,21)]” denotes 2D convolutional layer with 13 filters and the size of each filter is 1321 where 13 is the size of the filter in the time-frame direction and 21 in the frequency direction of the spectrogram.

As in many deep learning models, there are many parameters in the proposed MR-FCNN to be chosen (number of layers, filter size, and the number of filters in each set) and usually these choices are data and application dependent. Choosing the parameters for the FCNN is also not easy. In this work, we follow the same strategy as in [21] where the size of the filter is decreasing and the number of filter is increasing when we go deep in the encoder part and the opposite (the filter size increases and the number of the filter decreases) in the decoder part in the output direction. For MR-FCNN, the number and size of the filters in each set in each layer are need to be decided. We restricted ourself in this work to use only three sets of filters for the whole network. The first set with size 1321, the second set with size 79, and the third set with size 33. Which means each layer has sets of filters with three different resolutions. Also following the same concept in [21] for choosing the number of filters, the layers towards the input and output layers have more filters with large size than the layers in the middle. The layers in the middle have more filters in the set with small filter size than the layers toward the input and output layers. For example, the first layer in MR-FCNN has a set of 12 filters with size 1321, a set of 3 filters with size 79, and a set of 3 filters with size 33. Thus, the first layer generates 18 feature maps with three different resolutions. Each feature mape is 151025 (the same size of the input and output data). The DNN has three hidden layers with ReLU as activation functions. Each hidden layer has 1025 nodes. The parameters of the DNN are tuned based on our previous work on the same dataset [28, 29]. The DNN here has 4,206,600 parameters, the FCNN has 445,173 parameters, and the MR-FCNN has 558,181 parameters.

The parameters for all the networks were initialized randomly. All the networks were trained using backpropagation with gradient descent optimization using Adam

[30] with parameters: , , , batch size 100, and a learning rate starts with

and reduced by a factor of 10 when the values of the cost function do not decrease on the validation set for 3 consecutive epochs. The maximum number of epochs is 100. We implemented our proposed algorithm using Keras with Tensorflow backend

[31].

To compare the proposed MR-FCNN model to the FCNN, we tried to adjust the number of filters and their sizes in each layer of both models to have total number of parameters in both models close to each other as shown in Table 1. Fig.3, shows the box-plot of the SDR, SIR, and SAR of the separated vocal sources using three different deep learning models, namely DNN, FCNN, and MR-FCNN. The figure also shows the SDR and SIR values of the target vocal source in the mixed signal (denoted as Mix in Fig.3). We did not show the SAR of the mixed signal because it is usually very high (around 250 dB) and causes scaling problem in the figure. From the figure we can see that the vocal signals in the input mixed signal (denoted as Mix in Fig.3) have very low SDR and SIR values, which shows that we are dealing with a very challenging source separation problem.

As can be seen from Fig.3, the three methods perform well on the SDR, SIR, and SAR values of the separated vocal signals. The proposed MR-FCNN model outperforms the two other models in the SDR and SAR values. All the models perform similarly in the SIR. The difference between each pair of models for all the shown results of SDR and SAR is statistically significant with values as follows. For SDR: , , . For SAR: , , . We consider the difference between a pair of models statistically significant if , Wilcoxon signed-rank test [32] and Bonferroni corrected [33].

(a) SDR in dB
(b) SIR in dB
(c) SAR in dB
Figure 3: (a) The SDR, (b) the SIR, and (c) the SAR (values in dB) for the separated vocal signals of using deep fully connected feedforward neural networks (DNNs), using deep fully convolutional neural networks (FCNNs), and the proposed multi-resolution fully convolutional neural networks (MR-FCNN). ”Mix“ denotes the input mixed signal.

5 Conclusions

In this work we proposed a new approach for monaural audio source separation (MASS). The new approach is based on using deep multi-resolution fully convolutional neural networks (MR-FCNN). The MR-FCNN learns unique multi-resolution patterns for each source and uses this information to separate the related components of each source from the mixed signal. The experimental results indicate that using MR-FCNN for MASS is a promising approach and can achieve better results than the feedforward neural networks and the single resolution FCNN.

In our future work, we will investigate the possibility of applying the MR-FCNN on raw audio data (time domain signals) to extract multi-resolution time-frequency features that can represent the input data better than the STFT features. Some audio sources require higher resolution in the time than in the frequency, and other audio sources require the opposite resolution of that. By applying MR-FCNN on the raw audio data, we hope to extract useful features for each source according to its preferred time-frequency resolution which can improve the performance of any audio processing approach.

6 Acknowledgements

This work is supported by grants EP/L027119/1 and EP/L027119/2 from the UK Engineering and Physical Sciences Research Council (EPSRC).

References

  • [1] X. Zhang and D. Wang, “Deep ensemble learning for monaural speech separation,” IEEE/ACM Trans. on audio, speech, and language processing, vol. 24, no. 5, pp. 967–977, 2016.
  • [2] E. M. Grais and H. Erdogan, “Source separation using regularized NMF with MMSE estimates under GMM priors with online learning for the uncertainties,” Digital Signal Processing, vol. 29, pp. 20–34, 2014.
  • [3] T. Virtanen, “Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, pp. 1066–1074, Mar. 2007.
  • [4] S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, “A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation,” in arXiv:1709.00611, 2017.
  • [5] E. M. Grais, G. Roma, A. J. Simpson, and M. D. Plumbly, “Two stage single channel audio source separation using deep neural networks,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 25, no. 9, pp. 1469–1479, 2017.
  • [6] Y. Wang and D. Wang, “A structure-preserving training target for supervised speech separation,” in Proc. ICASSP, 2014, pp. 6148–6152.
  • [7] E. M. Grais, G. Roma, A. J.R. Simpson, and M. D. Plumbley, “Discriminative enhancement for single channel audio source separation using deep neural networks,” in Proc. LVA/ICA, 2017, pp. 236–246.
  • [8] P. Chandna, M. Miron, J. Janer, and E. Gomez, “Monoaural audio source separation using deep convolutional neural networks,” in Proc. LVA/ICA, 2017, pp. 258–266.
  • [9] S. Venkataramani, Y. C. Subakan, and P. Smaragdis, “Neural network alternatives to convolutive audio models for source separation,” in Proc. MLSP, 2017.
  • [10] S. Venkataramani and P. Smaragdis, “End-to-end source separation with adaptive front-ends,” in Proc. WASPAA, 2017.
  • [11] E. M. Grais and Mark D. Plumbly,

    “Single channel audio source separation using convolutional denoising autoencoders,”

    in Proc. GlobalSIP, 2017.
  • [12] M. Miron, J. Janer, and E. Gomez, “Monaural score-informed source separation for classical music using convolutional neural networks,” in Proc. ISMIR, 2017.
  • [13] W. Lim and T. Lee, “Harmonic and percussive source separation using a convolutional auto encoder,” in Proc. EUSIPCO, 2017.
  • [14] S. Fu, Y. Tsao, X. Lu, and H. Kawais, “End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” in arXiv:1709.03658, 2017.
  • [15] L. Wenjie, L. Yujia, U. Raquel, and Z. Richard, “Understanding the effective receptive field in deep convolutional neural networks,” in Proc. NIPS, 2016, pp. 4898–4906.
  • [16] J. Kawahara and G. Hamarneh, “Multi-resolution-tract CNN with hybrid pretrained and skin-lesion trained layers,” in Proc. MICCAI MLMI, 2016, vol. 10019, pp. 164–171.
  • [17] Y. Tang and A. Mohamed,

    “Multiresolution deep belief networks,”

    in Proc. AISTATS, 2012.
  • [18] Q. Zhang, D. Zhou, and X. Zeng, “HeartID: a multiresolution convolutional neural network for ECG-based biometric human identification in smart health applications,” IEEE Access, Special Section on Body Area Networks, pp. 11805–11816, 2017.
  • [19] W. Xue, H. Zhao, and L. Zhang, “Encoding multi-resolution two-stream cnns for action recognition,” in Proc. ICONIP, 2016, pp. 564–571.
  • [20] N. Naderi and B. Nasersharif, “Multiresolution convolutional neural network for robust speech recognition,” in Proc. ICEE, 2017.
  • [21] S. R. Park and J. W. Lee, “A fully convolutional neural network for speech enhancement,” in Proc. Interspeech, 2017.
  • [22] M. Zhao, D. Wang, Z. Zhang, and X. Zhang, “Music removal by convolutional denoising autoencoder in speech recognition,” in In proc. APSIPA, 2016.
  • [23] M. Davy A. Klapuri, Signal Processing Methods for Music Transcription, Springer, 2007.
  • [24] E. M. Grais, I. S. Topkaya, and H. Erdogan, “Audio-Visual speech recognition with background music using single-channel source separation,” in Proc. SIU, 2012.
  • [25] E. M. Grais and H. Erdogan, “Spectro-temporal post-enhancement using MMSE estimation in NMF based single-channel source separation,” in Proc. InterSpeech, 2013.
  • [26] N. Ono, Z. Rafii, D. Kitamura, N. Ito, and A. Liutkus, “The 2015 signal separation evaluation campaign,” in Proc. LVA/ICA, 2015, pp. 387–395.
  • [27] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–69, July 2006.
  • [28] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Single channel audio source separation using deep neural network ensembles,” in Proc. 140th Audio Engineering Society Convention, 2016.
  • [29] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D Plumbley, “Combining mask estimates for single channel audio source separation using deep neural networks,” in Prec. InterSpeech, 2016.
  • [30] D. P. Kingma and J. Ba, “Adam A method for stochastic optimization,” in Proc. ICLR, 2015.
  • [31] F. Chollet, “Keras, https://github.com/fchollet/keras,” 2015.
  • [32] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945.
  • [33] Y. Hochberg and A. C. Tamhane, Multiple Comparison Procedures, John Wiley and Sons, 1987.