1 Introduction
For a given mixed musical signal composed of several instrumental sounds, Musical Source Separation (MSS) is a signal processing task that tries to separate the mixture source into each acoustic sound source, such as singing voice or drums. Recently, many machine learningbased methods have been proposed for the MSS task. Typical MSS models, including stateoftheart models [mmdenselstm, dilatedlstm]
, apply ShortTime Fourier Transform (STFT) on the mixed signal to obtain spectrograms and transform these with deep neural networks to estimate the corresponding spectrograms for a target instrument. Finally, they restore target signals by applying inverse STFT (iSTFT) to estimated spectrograms.
For spectrogram transformation, existing works use various types of neural networks. For example, some models [mmdensenet, mmdenselstm]
use 2D Convolutional Neural Networks (CNNs) to map a given spectrogram to another spectrogramlike (imagelike) representation by filtering out nontarget instrumental features. Some models
[mmdenselstm, dilatedlstm]transform the input using Recurrent Neural Networks (RNNs) to capture the temporal sequential patterns observed in musical signals. However, a thorough search of the relevant literature indicated that there were no existing works that evaluate and directly compare these types of networks.
In this paper, we aim to design a variety of neural transformation methods based on our observations and empirically evaluate their performance. Our approach to designing a neural model was mainly based on capturing the feature observed in musical sources that enables it to distinguish a specific source from the mixture. Figure 1. shows an example where four different instruments play the same note, C4. We illustrate the magnitude spectrogram of each instrument in the figure. Even though they played the same note, the observed frequency patterns are quite different and maybe enough to distinguish each source from other instruments. Based on this observation, we first design a set of transformation methods that capture timeinvariant patterns observed in musical sources. These models only operate in the frequency axis and do not leverage temporal patterns.
Although the SignaltoDistortion Ratio (SDR) performances of some timeinvariant models were above our expectation (see section 3.2
), it was still inferior considerably to that of current stateoftheart methods. This was because features observed in musical sources also include sequential patterns such as vibrato, tremolo, and crescendo, or patterns due to musical structures such as rhythm. Intuitively, these patterns could help distinguish instruments with similar frequential patterns that could not be separated without the notion of time. Existing SOTA methods use CNNs or RNNs to capture these temporal sequential patterns. Thus, on top of timeinvariant feature extractors, we also design a set of neural transformation methods to capture temporal patterns and propose novel transform methods by extending our best performing timeinvariant (frequencyonly) model into the timefrequency domain. Our model outperforms SoTA methods in the singing voice separation task, even with fewer parameters.
We define our neural transform methods in Section 2. We evaluate and compare these models in Section 3
to investigate deep neural transformation networks for the MSS task. We compare SDR performance on singing voice separation, a wellstudied musical source separation task. Our experiments provide abundant material for future works by comparing several transformation methods. Finally, instead of only using magnitude spectrograms for network training, we choose a phaseaware method by training with complexvalued spectrograms (raw STFT outputs).
2 Deep Neural Transformations for Musical Source Separation
We first introduce a deep neural transformation framework for Music Source Separation, then present several models based on this framework.
2.1 Model Framework
The framework described in Figure 2 consists of three parts: (1) spectrogram extraction, (2) a Deep Neural Transformation Network, and (3) signal reconstruction.
2.1.1 Spectrogram Extraction
The spectrogram extraction layer takes a mixture signal and extracts a spectrogram which contains useful frequency related information represented in the timefrequency domain. In our framework, the spectrogram extraction layer produces a complexvalued spectrogram by applying STFT, which is fed to a deep neural transformation network for source spectrogram estimation.
Currently, every SOTA method (such as [mmdenselstm, dilatedlstm]) does not fully utilize the complexvalued spectrogram of the mixture signal. They decompose it into magnitude and phase, then only use the magnitude spectrogram as input for their neural network. They estimate the target source magnitude and combine it with the mixture phase for signal reconstruction, which can be critical for sources with low SNR. In general, considering phase information improves the estimation quality, as discussed in [phase1, phase2, phase3, complex]. There are several methods for considering both magnitude and phase, such as phase reconstruction methods [phase1, phase2, phase3], or using raw complex STFT outputs [complex]. The latter method is an efficient way to improve magnitudeonly models since phase estimation can be done by simply extending a model to learn mappings between mix and target complex spectrograms, instead of mix and target magnitude spectrograms.
As in [cac1, cac2], we view a channeled (usually stereo or mono) complexvalued spectrogram of a mixture spectrogram as a channeled realvalued spectrogram , where denotes the number of the frequency bins and denotes the number of frames in the spectrogram. In other words, we regard the real and imaginary parts of a spectrogram as separate channels. Thus, the output of our spectrogram extraction layer and also the network parameters are a realvalued.
2.1.2 Deep Neural Transformation Network
A deep neural transformation network takes the spectrogram of a mixture signal as input and outputs the estimated spectrogram , which is used for reconstructing the target source signal. The network is trained in a supervised fashion to minimize the mean square error between the output of the network and the groundtruth spectrogram of the corresponding target source signal.
We present various deep neural transformation networks in Section 2.2. We design each convolutional model (thus excluding Section 2.2.2) to be a UNet [medicalunet, unet]like structure since it can be easily extended to adopt various neural transforms, and many SOTA [mmdensenet, mmdenselstm] models also use it as their base architecture.
A deep neural transformation network consists of an encoder and decoder. Its encoder transforms a given tensor of the mixture spectrogram into a multichannel downsized representation. Its decoder takes this representation and returns the estimated spectrogram . The number of downsampling layers and upsampling layers are the same, as shown in Figure 3. Also, the deep neural transformation network has skip connections that concatenate output feature maps of the same scale between encoder and decoder.
There are two types of components in the architecture: neural transform layers and down/upsampling layers. We summarize these two components as follows.

A Neural Transform Layer transforms an input tensor into an equallysized tensor (possibly with a different number of channels).

A Down/Up Sampling Layer halves/doubles the scale of an input tensor while preserving the number of channels.
Since there are multiple options available for each component, we can implement various models based on this framework. We present several models that extend this framework in Section 2.2, which are compared in the experiments section.
2.1.3 Signal Reconstruction
The reconstruction layer first reshapes the input tensor to a complexvalued spectrogram , which is an inverse tensor manipulation of the ‘complex as channels (§2.1.1)’ . It then restores the target signal via inverseSTFT on the estimated complexvalued spectrogram.
Models 
Neural Transform  Down/UpSampling  
Transformation Method  Unit of Transforms 



dense block of 1D Convs  single frame 




single frame  N/A  


series of frames 




series of frames 




series of frames 



2.2 Models
We present several models based on the framework. Models have the same spectrogram extraction and signal reconstruction layers, as described in Section 2.1. Deep neural transformation networks of all the models are based on the UNetlike architecture except for the model presented in Section 2.2.2.
Also, every model has two additional convolution layers besides its neural transform network. For an input , each model first applies a convolution with
channels followed by ReLU
[relu] activation. Thus, the size of the actual input of the network is and the actual output size is the same as that of . To adjust the number of channels to be , every model applies a final convolution with output channels to the network output. We set parameter to be 12, for all the implemented models.Models use different neural transformations except for the two convolution layers. We summarize the configuration for each model in Table 1. Before we describe models in detail, we introduce the notations used in our descriptions. We denote the input of the th neural transformation layer by , and the output by . The size of is denoted as , where and represent the number of channels and spectrogram size, respectively. Also, we denote the size of by , where is the number of channels.
2.2.1 TimeInvariant Convolutions
We first introduce a model called TimeInvariant Convolutions (TIC). The TIC model uses a timeinvariant convolutional transformation in each transformation layer.
Figure 4. illustrates the timeinvariant convolutional transformation. Suppose that the th neural transformation layer of a TIC model transforms into an output . It applies a series of 1D convolution layers separately and identically to each frame (i.e., ) in order to transform an input tensor in a timeinvariant fashion. The series of 1D convolution layers take form of a dense block [densenet]
structure. A dense block consists of densely connected composite layers, where each composite layer is defined as three consecutive operations: convolution, Batch Normalization (BN)
[bn], and ReLU. As discussed in [densenet, mmdensenet, mmdenselstm] the densely connected structure enables each layer to propagate the gradient directly to all preceding layers, making a deep convolution network training more efficient.In each downsampling layer, the TIC model applies a 1D convolution layer with stride 2 to halve the frequency resolution. In each upsampling layer, the TIC model applies a 1D transposed convolution layer with stride 2 to recover the frequency resolution.
2.2.2 TimeInvariant Fullyconnected networks
An simple and alternative way to transform spectrograms in a timeinvariant fashion is to use fullyconnected layers, thus timeinvariant fullyconnected transformations as illustrated in Figure 5. The Timeinvariant Fullyconnected networks (TIF) uses timeinvariant fullyconnected transformations instead of convolutional transforms.
Figure 5 describes the timeinvariant fullyconnected transformation
. This method applies a multilayer fullyconnected network, to each channel of each frame separately and identically. The multilayer fullyconnected network is a series of two composite layers, where each composite layer is defined as consecutive operations: fullyconnected layer, BN, and ReLU. The first composite layer maps an input to the hidden feature space, and the second composite layer maps the internal vector to
.The timeinvariant fullyconnected transformation is designed to preserve the number of channels (i.e., holds for every in the TIF model). Also, it does not have down/up sampling components.
2.2.3 TimeFrequency Convolutions
The TimeFrequency Convolutions (TFC) model is similar to conventional UNetbased MSS models such as [unet, mmdensenet]. It uses the timefrequency convolutional transformation in each transformation layer.
As shown in Figure 6, TFC uses dense blocks [densenet] of 2D CNNs for neural transformation layers as in MDenseNet [mmdensenet]. A dense block of the TFC model consists of composite layers, where each layer is defined as three consecutive operations: 2D convolution, BN, and ReLU. Its dense blocks are applied to the entire timefrequency spectrogram, unlike in the two previous timeinvariant models where they are applied to each time bin separately. Every convolution layer in a dense block has kernels of size , where and .
In each downsampling layer, the TFC model applies a 2D convolution layer with stride (2,2) to reduce the resolution by a factor of 4. In each upsampling layer, then it applies a 2D transposed convolution layer with stride (2,2) to recover the timefrequency resolution.
2.2.4 TimeFrequency Convolutions with TimeInvariant Fullyconnected networks
The TFCTIF model utilizes two different transformations: timefrequency convolutional transformation and timeinvariant fullyconnected transformation. We found that such a combination significantly reduces the number of layers while maintaining SDR performance.
Figure 7. describes the neural transformation used in the TFCTIF model. It first maps the input to a same sized representation with channels by applying timefrequency convolutional transformations. Then timeinvariant fullyconnected transformation
is applied to the dense block output. A residual connection is also added for easier gradient flow.
Other than the neural transformation, the TFCTIF model is equivalent to the TFC model and uses the same down/upsampling layers of the TFC model.
2.2.5 TimeInvariant Convolutions with Recurrent Neural Networks
We present the timeinvariant convolutions with recurrent neural networks (TICRNN) model. For each transformation layer, the TICRNN model uses timeinvariant convolutional transformation followed by a recurrent neural network (RNN).
The transformation method of the TICRNN is illustrated in Figure 8. It applies the timeinvariant convolutional transformation to an input
, and obtains a same sized hidden representation with
channels. The RNN computes the hidden representation and also outputs an equally sized tensor.The TICRNN model is an extension of the TIC model to the timefrequency domain by adding an RNN to learn temporal features that TIC cannot capture. We use gated recurrent units
[gru], a variant of the LSTM. The TICRNN model uses the same down/upsampling layers of the TIC model.3 Experiment
In this section, we evaluate the models introduced in Section2.2. We compare SDR performance on singing voice separation, a wellstudied musical source separation task. We compare our models in Section 3.2 and Section 3.3. Also, we compare our models with SOTA models in Section 3.4. Section 3.5 describes experimental results about methods to deal with complexvalued spectrograms. We present Ablation studies in Section 3.6.
3.1 Setup
3.1.1 Dataset
Train and test data were obtained from the MUSDB18 dataset [musdb18]. The train and test sets of MUSDB18 have 100 and 50 musical tracks each, all stereo and sampled at 44100 Hz. Each track file consists of the mixture and its four source audios: ‘vocals,’ ‘drums,’ ‘bass’ and ‘other.’ Since we are evaluating on singing voice separation, we only use the ‘vocals’ source audio as the separation target for each mixture track.
For validation, we use the default validation set (14 tracks) as defined in the musdb package, and use the MSE between target and estimated signal (waveform) as the validation metric. Data augmentation [blend] was done on the fly to obtain fixedlength mixture audio clips comprised of source audio clips from different tracks.
3.1.2 STFT Parameters
An FFT window size of 2048 and hop size of 1024 are used for STFT unless otherwise mentioned.
3.1.3 Training and Evaluation
Network weights were optimized with RMSprop
[rmsprop] with learning rate depending on model depth and batch size. Each model is trained to minimize the mean square error between the target ‘vocals’ spectrogram of and its neural network estimation .The evaluation metric (SDR) was computed with the official evaluation tool for MUSDB18. We use the median SDR value over all the test set tracks to compare model performance, as done in the SiSEC2018
[sisec].3.2 TimeInvariant Models
We conduct experiments to evaluate timeinvariant models (i.e., TIC, TIF). We also implement a variation of the TIC model that does not use down/upsampling to preserve the frequency resolution. We denote this model by . By comparing the performances of the TIC model and the model, we can investigate the effect of down/upsampling in the frequency axis.
model 
# blocks  hyperparameters  # params  SDR  
TIC  7 

0.04M  3.88  

0.04M  3.48  
TIF  1 

0.13M  3.16  

We summarize timeinvariant models with their hyperparameters and their evaluation results in Table 2. The TIC model achieves an SDR of 3.88, the highest score among the three models. It has seven neural transforms, where each transform is a dense block with 4 composite layers with growth rate 12 (a hyperparameter used in dense blocks [densenet]). The kernel size of each convolution layer in a dense block is 3. We set to have the same configuration as TIC, including the Unet skipconnections (although this may not be necessary since the model is no longer multiscaled) to compare the effect of down/upsampling in the TIC model. Results show that the use of down/upsampling in TIC was effective, which may indicate that for these set of hyperparameters, longterm dependencies are preferred over local features when distinguishing unique timeinvariant frequential patterns of singing voice.
The TIF model did not perform well enough for its parameter budget when compared to the TIC models. The TIF model can be interpreted as a twolayer fullyconnected network where the number of the hidden units is 64 (number of frequency bins/bottleneck factor = 1024/16). Although it contains a lot more parameters than the other models, it is worth noting that TIF is much shallower than the other models. A ‘bottleneck factor’ of 16 is used for all TIF transformations used in our experiments.
STFT parameters  TIC (7 blocks)  TIC (9 blocks)  
window size  # freq. bins  
1024  512  3.37  3.96 
2048  1024  3.88  4.17 
4096  2048  3.73  4.59 
We also investigate the effect of frequency resolution by comparing identical TIC models with different FFT window sizes. Intuitively, a higher frequency resolution (larger FFT window size and more frequency bins) aids locating and reconstructing targetrelated features from mixture spectrograms, thus leading to better SDR performance. Results in Table 3 indicate that this was not always the case. The 7blocked TIC model from Table 2 did not gain SDR when the number of frequency bins was increased from 1024 to 2048.
We assumed this was because a 7blocked TIC model was not deep enough to leverage higher resolutions or that it should have been downsampled one more time to match the resolution in the middle (fourth) dense block. An additional experiment was conducted with a 9blocked TIC model and found that in this case, higher frequency resolution led to higher SDR. Thus, we claim that training with higher frequency resolution does not always give better performance, and an adequate (in our case, a deeper and more downsampled) network is needed to conform to such intuition.
3.3 TimeFrequency Models
In this subsection, we evaluate the TFC, TFCTIF, TICRNN models. We call these models TimeFrequency models because their neural transformations take temporal patterns into account, as opposed to only frequency domain information as in timeinvariant models. We also implement a variant of the TFC model that does not use down/upsampling in the temporal axis to preserve temporal resolution. We denote this model by . The kernel size used in each down/upsampling layer of is to preserve the temporal resolution while scaling frequency resolution.
model 
# blocks  hyperparameters  # params  SDR  
TFC 
17 

1.07M  6.83  

1.05M  6.78  
TICRNN 
17 

6.26M  6.52  
TFCTIF 
7 

0.78M  7.11 
Table 4 summarizes the results for timefrequency models. All models are trained on 3 seconds (128 STFT frames) of music and down/upsampled 3 times in the temporal dimension. Models with more than 7 neural transforms (thus more than 3 down/upsampling layers) use both and sized down/upsampling layers to scale the frequency axis more than 3 times while maintaining the number of scales in the temporal axis to 3. In the first row of Table 4, we compare the TFC model with its time resolutionpreserved version to investigate the effect of down/upsampling in the temporal dimension, and conclude that temporal resolution was not as important as frequency resolution. Both models have the same hyperparameters. Both models have 17 transforms, and the dense block of each layer has four 2D convolution layers with kernel size . We set the growth rate to be 24 to enlarge the number of internal channels. This amount of enlargement was enough to achieve comparable results with stateoftheart methods, even with lower frequency resolution. The number of parameters of the TFC model is slightly larger since the TFC model uses larger sized kernels than for down/upsampling.
The TICRNN model maintains the hyperparameters of for the frequency axis while using RNNs instead of convolutions for the temporal axis. RNNs were implemented with bidirectional GRUs with a single hidden layer where the number of hidden units is 128. Although having much more parameters and a better potential for capturing long temporal dependencies compared to the two fully convolutional models, TICRNN performs lower than the TFC and . Increasing the number of hidden units or hidden layers could have increased SDR since many other stateoftheart recurrent models use a hidden size that is at least 512. Increasing the number of STFT frames, thus training on longer clips of music might have also worked.
The final row of the Table 4 shows promising results regarding the TFCTIF model. It achieves stateoftheart SDR (see Section 3.4) even with fewer parameters. Also, the TFCTIF model outperforms the other 17blocked models in Table 4 with nearly twice as less number of layers. These results show that fully connected layers can be a useful transformation for spectrogrambased separation models since it reduces the number of layers while maintaining SDR performance.
For further investigation on the temporal axis, we directly compare TFC models and their timeinvariant versions in Section 3.6.1.
3.4 Comparison with SOTA models
In this section, we compare our models with other models on the musdb18 benchmark. Table 5 shows the topperforming models (regarding SDR) from SiSEC2018 that do not use additional training data (TAK1 [mmdenselstm], UHL2 [blend], JY3 [dilated]) along with other SOTA models (UMX [UMX], DGRUDGConv [dilatedlstm]). Comparing with Table 4, we can see that even with less frequency resolution, our models perform comparably to or even outperform previous models. On top of that, our TFC extensions do not use recurrent layers, which is a key factor in the other previous models, which may lead to shorter forward/backward propagation time and faster source separation. Also, it is worth noting that previous models adopt Multichannel Wiener Filtering as a postprocessing method to further enhance SDR, while ours directly use the signal reconstruction output without such postprocessing.
For fair comparison with SOTA models, we also trained an additional TFCTIF (notated as ‘large’) with the same frequency resolution as the other SOTA models (FFT window size = 4096). Hyperparameters are specified in Section 3.6.1. The TFCTIF model from Table 4 is notated as ‘small’.
model 
# parameters  SDR (vocals) 
DGRUDGConv  more than 1.9M  6.99 
TAK1  1.22M  6.60 
UMX  N/A  6.32 
UHL2  N/A  5.93 
JY3  N/A  5.74 
TFCTIF (small)  0.78M  7.11 
TFCTIF (large)  2.24M  7.99 

3.5 Complex as Channels
Our models view a channeled complexvalued spectrogram as a channeled realvalued spectrogram, as mentioned in Section 2.1.1. Although ‘Complex as Channels (CaC)’ is an easy and efficient way to extend magnitudeonly models, it does not maintain the notion of real and imaginary parts within its network. On the other hand, Deep Complex Networks (DCNs) propose a more natural way to handle complexvalued tensors, such as applying separate parameters for real and imaginary parts and using complex multiplication.
method  dense block params  # params  SDR  
# convs  growth rate  
CaC  4  12  0.11M  5.54 
DCN  4  16  0.10M  5.49 
DCN  6  12  0.11M  5.45 
We compare these two methods on the 7blocked TFC (first row of Table 4). Despite the nature of DCNs that can go wider and deeper for a given number of parameters, Table 6 does not show any difference between these two methods. However, since a DCN can go exponentially deeper as the number of layers increases, DCNs may have performed better with a deeper model. Meanwhile, our main reason for experimenting with ‘Complex as Channels’ instead of DCNs was that no existing implementation (including the official release) parallelizes the real and imaginary operations, and found that DCNs were at least twice as slow as their ‘Complex as Channels’ counterpart.
3.6 Ablation Study
3.6.1 Ablation Study on Temporal Axis
In Section 1, we gave an intuition that temporal patterns could help distinguish instruments with similar frequential patterns. To verify this, we compare TFC models with their timeinvariant versions. Table 7 shows that learning temporal patterns consistently leads to significantly higher SDR performance.
We also visualize this difference, by comparing estimated source spectrograms for a given subtrack (1:081:10 of AM Contra  Heart Peripheral) in Figure 9. We computed the magnitudes for each estimated complex spectrogram then visualized the average magnitude spectrogram of the left and right channels. The 17blocked TFC model and its timeinvariant form were used for this experiment. Figure 9 (a) is the spectrogram of the twosecond mixture signal, and (b) is the corresponding groundtruth ‘vocals’ spectrogram. The four large periodic rectangular peaks in the mixture signal come from drum sounds. The TFC model somewhat succeeded in separating these noises, whereas its TIC version failed to eliminate the drum peaks and left behind some weaker peaks that look very like the leftmost peak in Figure 9 (b) (a human inhale sound). This may be because inhale/exhale or strong consonant sounds are quite similar to drum sounds in a spectrogram pointofview, since they both spread over a large number of consecutive frequencies. Interestingly, the best results of SiSEC2018 also suffer from these leftover drum noises and was also the hardest sound for our models to denoise.
model 
# blocks  hyperparameters  SDR  
TFC 
7 

5.54  
TIC 
3.88  
TFC 
17 

6.83  
TIC 
4.95  
TFCTIF (large)  9 

7.99  
TIC 
5.87  

3.6.2 Complex vs Magnitude
model 
# blocks  hyperparameters  # params  SDR  
CaC  9 

2.24M  7.99  
Mag 
2.24M  7.12  

For our final experiment, we see how much SDR was gained by extending a magnitudeonly model into a ‘Complex as Channels’ model. Our TFCTIF (large) model in Table 5 is compared to its magnitudeonly form (referred to as ‘Mag’). Both use the same hyperparameter set except for , the input/output number of channels. Mag also has an additional ReLU after the final convolution to obtain positivevalued output spectrograms. Results show that by simply training with raw STFT outputs instead of magnitudes (with minimal change to network specifications) significantly boosts SDR performance. It is also notable that the Mag model still outperforms all previous stateoftheart models in Table 5.
4 Conclusion and future work
In this paper, we design several neural transformation methods, including timeinvariant methods, timefrequency methods, and mixtures of two different transformations. We also implement deep neural transformation models for the MSS task and empirically evaluate their performance. Our experiments provide abundant material for future works by comparing several transformation methods. Also, one of our models (i.e., the TFCTIF) outperforms SOTA methods in the singing voice separation task. For future work, we would like to extend this model to utilize attention networks for modeling longterm dependencies observed in both the frequency and the temporal axis.
This work was also supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. NRF2019R1F1A1062719).
Comments
There are no comments yet.