Investigating Deep Neural Transformations for Spectrogram-based Musical Source Separation

12/02/2019 ∙ by Woosung Choi, et al. ∙ Korea University 1

Musical Source Separation (MSS) is a signal processing task that tries to separate the mixed musical signal into each acoustic sound source, such as singing voice or drums. Recently many machine learning-based methods have been proposed for the MSS task, but there were no existing works that evaluate and directly compare various types of networks. In this paper, we aim to design a variety of neural transformation methods, including time-invariant methods, time-frequency methods, and mixtures of two different transformations. Our experiments provide abundant material for future works by comparing several transformation methods. We train our models on raw complex-valued STFT outputs and achieve state-of-the-art SDR performance in the MUSDB18 singing voice separation task by a large margin of 1.0 dB.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For a given mixed musical signal composed of several instrumental sounds, Musical Source Separation (MSS) is a signal processing task that tries to separate the mixture source into each acoustic sound source, such as singing voice or drums. Recently, many machine learning-based methods have been proposed for the MSS task. Typical MSS models, including state-of-the-art models [mmdenselstm, dilatedlstm]

, apply Short-Time Fourier Transform (STFT) on the mixed signal to obtain spectrograms and transform these with deep neural networks to estimate the corresponding spectrograms for a target instrument. Finally, they restore target signals by applying inverse STFT (iSTFT) to estimated spectrograms.

For spectrogram transformation, existing works use various types of neural networks. For example, some models [mmdensenet, mmdenselstm]

use 2-D Convolutional Neural Networks (CNNs) to map a given spectrogram to another spectrogram-like (image-like) representation by filtering out non-target instrumental features. Some models

[mmdenselstm, dilatedlstm]

transform the input using Recurrent Neural Networks (RNNs) to capture the temporal sequential patterns observed in musical signals. However, a thorough search of the relevant literature indicated that there were no existing works that evaluate and directly compare these types of networks.

In this paper, we aim to design a variety of neural transformation methods based on our observations and empirically evaluate their performance. Our approach to designing a neural model was mainly based on capturing the feature observed in musical sources that enables it to distinguish a specific source from the mixture. Figure 1. shows an example where four different instruments play the same note, C4. We illustrate the magnitude spectrogram of each instrument in the figure. Even though they played the same note, the observed frequency patterns are quite different and maybe enough to distinguish each source from other instruments. Based on this observation, we first design a set of transformation methods that capture time-invariant patterns observed in musical sources. These models only operate in the frequency axis and do not leverage temporal patterns.

Figure 1: Magnitude Spectrograms of four different instruments playing the same note, C4

Although the Signal-to-Distortion Ratio (SDR) performances of some time-invariant models were above our expectation (see section 3.2

), it was still inferior considerably to that of current state-of-the-art methods. This was because features observed in musical sources also include sequential patterns such as vibrato, tremolo, and crescendo, or patterns due to musical structures such as rhythm. Intuitively, these patterns could help distinguish instruments with similar frequential patterns that could not be separated without the notion of time. Existing SOTA methods use CNNs or RNNs to capture these temporal sequential patterns. Thus, on top of time-invariant feature extractors, we also design a set of neural transformation methods to capture temporal patterns and propose novel transform methods by extending our best performing time-invariant (frequency-only) model into the time-frequency domain. Our model outperforms SoTA methods in the singing voice separation task, even with fewer parameters.

We define our neural transform methods in Section 2. We evaluate and compare these models in Section 3

to investigate deep neural transformation networks for the MSS task. We compare SDR performance on singing voice separation, a well-studied musical source separation task. Our experiments provide abundant material for future works by comparing several transformation methods. Finally, instead of only using magnitude spectrograms for network training, we choose a phase-aware method by training with complex-valued spectrograms (raw STFT outputs).

2 Deep Neural Transformations for Musical Source Separation

We first introduce a deep neural transformation framework for Music Source Separation, then present several models based on this framework.

2.1 Model Framework

The framework described in Figure 2 consists of three parts: (1) spectrogram extraction, (2) a Deep Neural Transformation Network, and (3) signal reconstruction.

Figure 2: The Deep Neural Transformation Framework for MSS task

2.1.1 Spectrogram Extraction

The spectrogram extraction layer takes a mixture signal and extracts a spectrogram which contains useful frequency related information represented in the time-frequency domain. In our framework, the spectrogram extraction layer produces a complex-valued spectrogram by applying STFT, which is fed to a deep neural transformation network for source spectrogram estimation.

Currently, every SOTA method (such as [mmdenselstm, dilatedlstm]) does not fully utilize the complex-valued spectrogram of the mixture signal. They decompose it into magnitude and phase, then only use the magnitude spectrogram as input for their neural network. They estimate the target source magnitude and combine it with the mixture phase for signal reconstruction, which can be critical for sources with low SNR. In general, considering phase information improves the estimation quality, as discussed in [phase1, phase2, phase3, complex]. There are several methods for considering both magnitude and phase, such as phase reconstruction methods [phase1, phase2, phase3], or using raw complex STFT outputs [complex]. The latter method is an efficient way to improve magnitude-only models since phase estimation can be done by simply extending a model to learn mappings between mix and target complex spectrograms, instead of mix and target magnitude spectrograms.

As in [cac1, cac2], we view a -channeled (usually stereo or mono) complex-valued spectrogram of a mixture spectrogram as a -channeled real-valued spectrogram , where denotes the number of the frequency bins and denotes the number of frames in the spectrogram. In other words, we regard the real and imaginary parts of a spectrogram as separate channels. Thus, the output of our spectrogram extraction layer and also the network parameters are a real-valued.

We call this tensor manipulation ‘Complex as Channels’ throughout the rest of this paper. We compare this method with Deep Complex Networks

[complex], where the input tensor and network parameters are complex-valued, later in Section 3.5.

2.1.2 Deep Neural Transformation Network

A deep neural transformation network takes the spectrogram of a mixture signal as input and outputs the estimated spectrogram , which is used for reconstructing the target source signal. The network is trained in a supervised fashion to minimize the mean square error between the output of the network and the ground-truth spectrogram of the corresponding target source signal.

We present various deep neural transformation networks in Section 2.2. We design each convolutional model (thus excluding Section 2.2.2) to be a U-Net [medicalunet, unet]-like structure since it can be easily extended to adopt various neural transforms, and many SOTA [mmdensenet, mmdenselstm] models also use it as their base architecture.

Figure 3: The Architecture of a Deep Neural Transformation Network

A deep neural transformation network consists of an encoder and decoder. Its encoder transforms a given tensor of the mixture spectrogram into a multi-channel downsized representation. Its decoder takes this representation and returns the estimated spectrogram . The number of down-sampling layers and up-sampling layers are the same, as shown in Figure 3. Also, the deep neural transformation network has skip connections that concatenate output feature maps of the same scale between encoder and decoder.

There are two types of components in the architecture: neural transform layers and down/up-sampling layers. We summarize these two components as follows.

  1. A Neural Transform Layer transforms an input tensor into an equally-sized tensor (possibly with a different number of channels).

  2. A Down/Up Sampling Layer halves/doubles the scale of an input tensor while preserving the number of channels.

Since there are multiple options available for each component, we can implement various models based on this framework. We present several models that extend this framework in Section 2.2, which are compared in the experiments section.

2.1.3 Signal Reconstruction

The reconstruction layer first reshapes the input tensor to a complex-valued spectrogram , which is an inverse tensor manipulation of the ‘complex as channels (§2.1.1)’ . It then restores the target signal via inverse-STFT on the estimated complex-valued spectrogram.

Neural Transform Down/Up-Sampling
Transformation Method Unit of Transforms

TIC: Time-Invariant Convolutions
dense block of 1-D Convs single frame
1-D Conv/TransposedConv

(stride: 2)

TIF: Time-Invariant
Fully-connected networks
Fully-connected layers
single frame N/A

TFC: Time-Frequency Convolutions
dense block of 2-D Convs
series of frames
2-D Conv/TransposedConv

TFC-TIF: Time-Frequency Convolutions with
Time-Invariant Fully-connected networks
dense block of 2-D Convs
series of frames
2-D Conv/TransposedConv

TIC-RNN: Time-Invariant Convolutions with
Recurrent Neural Networks
RNN after applying
TIC frame-wisely
series of frames
1-D Conv/TransposedConv
(stride: 2)

Table 1: Summary of our Deep Neural Transformation Networks

2.2 Models

We present several models based on the framework. Models have the same spectrogram extraction and signal reconstruction layers, as described in Section 2.1. Deep neural transformation networks of all the models are based on the U-Net-like architecture except for the model presented in Section 2.2.2.

Also, every model has two additional convolution layers besides its neural transform network. For an input , each model first applies a convolution with

channels followed by ReLU

[relu] activation. Thus, the size of the actual input of the network is and the actual output size is the same as that of . To adjust the number of channels to be , every model applies a final convolution with output channels to the network output. We set parameter to be 12, for all the implemented models.

Models use different neural transformations except for the two convolution layers. We summarize the configuration for each model in Table 1. Before we describe models in detail, we introduce the notations used in our descriptions. We denote the input of the -th neural transformation layer by , and the output by . The size of is denoted as , where and represent the number of channels and spectrogram size, respectively. Also, we denote the size of by , where is the number of channels.

2.2.1 Time-Invariant Convolutions

We first introduce a model called Time-Invariant Convolutions (TIC). The TIC model uses a time-invariant convolutional transformation in each transformation layer.

Figure 4: Time-Invariant Convolutional Transformation

Figure 4. illustrates the time-invariant convolutional transformation. Suppose that the -th neural transformation layer of a TIC model transforms into an output . It applies a series of 1-D convolution layers separately and identically to each frame (i.e., ) in order to transform an input tensor in a time-invariant fashion. The series of 1-D convolution layers take form of a dense block [densenet]

structure. A dense block consists of densely connected composite layers, where each composite layer is defined as three consecutive operations: convolution, Batch Normalization (BN)

[bn], and ReLU. As discussed in [densenet, mmdensenet, mmdenselstm] the densely connected structure enables each layer to propagate the gradient directly to all preceding layers, making a deep convolution network training more efficient.

In each down-sampling layer, the TIC model applies a 1-D convolution layer with stride 2 to halve the frequency resolution. In each up-sampling layer, the TIC model applies a 1-D transposed convolution layer with stride 2 to recover the frequency resolution.

2.2.2 Time-Invariant Fully-connected networks

An simple and alternative way to transform spectrograms in a time-invariant fashion is to use fully-connected layers, thus time-invariant fully-connected transformations as illustrated in Figure 5. The Time-invariant Fully-connected networks (TIF) uses time-invariant fully-connected transformations instead of convolutional transforms.

Figure 5: Time-Invariant Fully-connected Transformation

Figure 5 describes the time-invariant fully-connected transformation

. This method applies a multi-layer fully-connected network, to each channel of each frame separately and identically. The multi-layer fully-connected network is a series of two composite layers, where each composite layer is defined as consecutive operations: fully-connected layer, BN, and ReLU. The first composite layer maps an input to the hidden feature space, and the second composite layer maps the internal vector to


The time-invariant fully-connected transformation is designed to preserve the number of channels (i.e., holds for every in the TIF model). Also, it does not have down/up sampling components.

2.2.3 Time-Frequency Convolutions

The Time-Frequency Convolutions (TFC) model is similar to conventional U-Net-based MSS models such as [unet, mmdensenet]. It uses the time-frequency convolutional transformation in each transformation layer.

Figure 6: Time-Frequency Convolutional Transformation

As shown in Figure 6, TFC uses dense blocks [densenet] of 2-D CNNs for neural transformation layers as in MDenseNet [mmdensenet]. A dense block of the TFC model consists of composite layers, where each layer is defined as three consecutive operations: 2-D convolution, BN, and ReLU. Its dense blocks are applied to the entire time-frequency spectrogram, unlike in the two previous time-invariant models where they are applied to each time bin separately. Every convolution layer in a dense block has kernels of size , where and .

In each down-sampling layer, the TFC model applies a 2-D convolution layer with stride (2,2) to reduce the resolution by a factor of 4. In each up-sampling layer, then it applies a 2-D transposed convolution layer with stride (2,2) to recover the time-frequency resolution.

2.2.4 Time-Frequency Convolutions with Time-Invariant Fully-connected networks

The TFC-TIF model utilizes two different transformations: time-frequency convolutional transformation and time-invariant fully-connected transformation. We found that such a combination significantly reduces the number of layers while maintaining SDR performance.

Figure 7: Neural transformation of the TFC-TIF model

Figure 7. describes the neural transformation used in the TFC-TIF model. It first maps the input to a same sized representation with channels by applying time-frequency convolutional transformations. Then time-invariant fully-connected transformation

is applied to the dense block output. A residual connection is also added for easier gradient flow.

Other than the neural transformation, the TFC-TIF model is equivalent to the TFC model and uses the same down/up-sampling layers of the TFC model.

2.2.5 Time-Invariant Convolutions with Recurrent Neural Networks

We present the time-invariant convolutions with recurrent neural networks (TIC-RNN) model. For each transformation layer, the TIC-RNN model uses time-invariant convolutional transformation followed by a recurrent neural network (RNN).

Figure 8: Neural transformation method of the TIC-RNN model

The transformation method of the TIC-RNN is illustrated in Figure 8. It applies the time-invariant convolutional transformation to an input

, and obtains a same sized hidden representation with

channels. The RNN computes the hidden representation and also outputs an equally sized tensor.

The TIC-RNN model is an extension of the TIC model to the time-frequency domain by adding an RNN to learn temporal features that TIC cannot capture. We use gated recurrent units

[gru], a variant of the LSTM. The TIC-RNN model uses the same down/up-sampling layers of the TIC model.

3 Experiment

In this section, we evaluate the models introduced in Section2.2. We compare SDR performance on singing voice separation, a well-studied musical source separation task. We compare our models in Section 3.2 and Section 3.3. Also, we compare our models with SOTA models in Section 3.4. Section 3.5 describes experimental results about methods to deal with complex-valued spectrograms. We present Ablation studies in Section 3.6.

3.1 Setup

3.1.1 Dataset

Train and test data were obtained from the MUSDB18 dataset [musdb18]. The train and test sets of MUSDB18 have 100 and 50 musical tracks each, all stereo and sampled at 44100 Hz. Each track file consists of the mixture and its four source audios: ‘vocals,’ ‘drums,’ ‘bass’ and ‘other.’ Since we are evaluating on singing voice separation, we only use the ‘vocals’ source audio as the separation target for each mixture track.

For validation, we use the default validation set (14 tracks) as defined in the musdb package, and use the MSE between target and estimated signal (waveform) as the validation metric. Data augmentation [blend] was done on the fly to obtain fixed-length mixture audio clips comprised of source audio clips from different tracks.

3.1.2 STFT Parameters

An FFT window size of 2048 and hop size of 1024 are used for STFT unless otherwise mentioned.

3.1.3 Training and Evaluation

Network weights were optimized with RMSprop

[rmsprop] with learning rate depending on model depth and batch size. Each model is trained to minimize the mean square error between the target ‘vocals’ spectrogram of and its neural network estimation .

The evaluation metric (SDR) was computed with the official evaluation tool for MUSDB18. We use the median SDR value over all the test set tracks to compare model performance, as done in the SiSEC2018


3.2 Time-Invariant Models

We conduct experiments to evaluate time-invariant models (i.e., TIC, TIF). We also implement a variation of the TIC model that does not use down/up-sampling to preserve the frequency resolution. We denote this model by . By comparing the performances of the TIC model and the model, we can investigate the effect of down/up-sampling in the frequency axis.

# blocks hyperparameters # params SDR
dense block params:
{ convs:4, gr:12, kernel:3 }
0.04M 3.88

0.04M 3.48
TIF params:
{ bottleneck factor:16 }
0.13M 3.16

Table 2: Time-Invariant models and their evaluation results. (# blocks means the number of neural transformations, gr means the growth rate [densenet].

We summarize time-invariant models with their hyperparameters and their evaluation results in Table 2. The TIC model achieves an SDR of 3.88, the highest score among the three models. It has seven neural transforms, where each transform is a dense block with 4 composite layers with growth rate 12 (a hyperparameter used in dense blocks [densenet]). The kernel size of each convolution layer in a dense block is 3. We set to have the same configuration as TIC, including the U-net skip-connections (although this may not be necessary since the model is no longer multi-scaled) to compare the effect of down/up-sampling in the TIC model. Results show that the use of down/up-sampling in TIC was effective, which may indicate that for these set of hyperparameters, long-term dependencies are preferred over local features when distinguishing unique time-invariant frequential patterns of singing voice.

The TIF model did not perform well enough for its parameter budget when compared to the TIC models. The TIF model can be interpreted as a two-layer fully-connected network where the number of the hidden units is 64 (number of frequency bins/bottleneck factor = 1024/16). Although it contains a lot more parameters than the other models, it is worth noting that TIF is much shallower than the other models. A ‘bottleneck factor’ of 16 is used for all TIF transformations used in our experiments.

STFT parameters TIC (7 blocks) TIC (9 blocks)
window size # freq. bins
1024 512 3.37 3.96
2048 1024 3.88 4.17
4096 2048 3.73 4.59
Table 3: TIC model performance for different frequency resolutions

We also investigate the effect of frequency resolution by comparing identical TIC models with different FFT window sizes. Intuitively, a higher frequency resolution (larger FFT window size and more frequency bins) aids locating and reconstructing target-related features from mixture spectrograms, thus leading to better SDR performance. Results in Table 3 indicate that this was not always the case. The 7-blocked TIC model from Table 2 did not gain SDR when the number of frequency bins was increased from 1024 to 2048.

We assumed this was because a 7-blocked TIC model was not deep enough to leverage higher resolutions or that it should have been down-sampled one more time to match the resolution in the middle (fourth) dense block. An additional experiment was conducted with a 9-blocked TIC model and found that in this case, higher frequency resolution led to higher SDR. Thus, we claim that training with higher frequency resolution does not always give better performance, and an adequate (in our case, a deeper and more down-sampled) network is needed to conform to such intuition.

3.3 Time-Frequency Models

In this subsection, we evaluate the TFC, TFC-TIF, TIC-RNN models. We call these models Time-Frequency models because their neural transformations take temporal patterns into account, as opposed to only frequency domain information as in time-invariant models. We also implement a variant of the TFC model that does not use down/up-sampling in the temporal axis to preserve temporal resolution. We denote this model by . The kernel size used in each down/up-sampling layer of is to preserve the temporal resolution while scaling frequency resolution.

# blocks hyperparameters # params SDR

dense block params:
{ convs:4, gr:24
, kernel: }
1.07M 6.83

1.05M 6.78

dense block params:
{ convs:4, gr:24
, kernel: } ,
GRU params:
{ # layers: 1,
hidden size: 128}
6.26M 6.52

dense block params:
{ convs:4, gr:24
, kernel: }
0.78M 7.11
Table 4: Time-Frequency models and their evaluation results. (# blocks means the number of neural transformations.)

Table 4 summarizes the results for time-frequency models. All models are trained on 3 seconds (128 STFT frames) of music and down/up-sampled 3 times in the temporal dimension. Models with more than 7 neural transforms (thus more than 3 down/up-sampling layers) use both and sized down/up-sampling layers to scale the frequency axis more than 3 times while maintaining the number of scales in the temporal axis to 3. In the first row of Table 4, we compare the TFC model with its time resolution-preserved version to investigate the effect of down/up-sampling in the temporal dimension, and conclude that temporal resolution was not as important as frequency resolution. Both models have the same hyperparameters. Both models have 17 transforms, and the dense block of each layer has four 2-D convolution layers with kernel size . We set the growth rate to be 24 to enlarge the number of internal channels. This amount of enlargement was enough to achieve comparable results with state-of-the-art methods, even with lower frequency resolution. The number of parameters of the TFC model is slightly larger since the TFC model uses larger sized kernels than for down/up-sampling.

The TIC-RNN model maintains the hyperparameters of for the frequency axis while using RNNs instead of convolutions for the temporal axis. RNNs were implemented with bidirectional GRUs with a single hidden layer where the number of hidden units is 128. Although having much more parameters and a better potential for capturing long temporal dependencies compared to the two fully convolutional models, TIC-RNN performs lower than the TFC and . Increasing the number of hidden units or hidden layers could have increased SDR since many other state-of-the-art recurrent models use a hidden size that is at least 512. Increasing the number of STFT frames, thus training on longer clips of music might have also worked.

The final row of the Table 4 shows promising results regarding the TFC-TIF model. It achieves state-of-the-art SDR (see Section 3.4) even with fewer parameters. Also, the TFC-TIF model outperforms the other 17-blocked models in Table 4 with nearly twice as less number of layers. These results show that fully connected layers can be a useful transformation for spectrogram-based separation models since it reduces the number of layers while maintaining SDR performance.

For further investigation on the temporal axis, we directly compare TFC models and their time-invariant versions in Section 3.6.1.

3.4 Comparison with SOTA models

In this section, we compare our models with other models on the musdb18 benchmark. Table 5 shows the top-performing models (regarding SDR) from SiSEC2018 that do not use additional training data (TAK1 [mmdenselstm], UHL2 [blend], JY3 [dilated]) along with other SOTA models (UMX [UMX], DGRU-DGConv [dilatedlstm]). Comparing with Table 4, we can see that even with less frequency resolution, our models perform comparably to or even outperform previous models. On top of that, our TFC extensions do not use recurrent layers, which is a key factor in the other previous models, which may lead to shorter forward/backward propagation time and faster source separation. Also, it is worth noting that previous models adopt Multi-channel Wiener Filtering as a post-processing method to further enhance SDR, while ours directly use the signal reconstruction output without such post-processing.

For fair comparison with SOTA models, we also trained an additional TFC-TIF (notated as ‘large’) with the same frequency resolution as the other SOTA models (FFT window size = 4096). Hyperparameters are specified in Section 3.6.1. The TFC-TIF model from Table 4 is notated as ‘small’.

# parameters SDR (vocals)
DGRU-DGConv more than 1.9M 6.99
TAK1 1.22M 6.60
UMX N/A 6.32
UHL2 N/A 5.93
JY3 N/A 5.74
TFC-TIF (small) 0.78M 7.11
TFC-TIF (large) 2.24M 7.99

Table 5: Comparison results: SDR median value on test set. (We estimate the lower bound of the number of parameters of DGRU-DGConv with 1-D CNN parameters without considering its GRUs.)

3.5 Complex as Channels

Our models view a -channeled complex-valued spectrogram as a -channeled real-valued spectrogram, as mentioned in Section 2.1.1. Although ‘Complex as Channels (CaC)’ is an easy and efficient way to extend magnitude-only models, it does not maintain the notion of real and imaginary parts within its network. On the other hand, Deep Complex Networks (DCNs) propose a more natural way to handle complex-valued tensors, such as applying separate parameters for real and imaginary parts and using complex multiplication.

method dense block params # params SDR
# convs growth rate
CaC 4 12 0.11M 5.54
DCN 4 16 0.10M 5.49
DCN 6 12 0.11M 5.45
Table 6: Comparison: Complex as Channels (CaC) method and DCN

We compare these two methods on the 7-blocked TFC (first row of Table 4). Despite the nature of DCNs that can go wider and deeper for a given number of parameters, Table 6 does not show any difference between these two methods. However, since a DCN can go exponentially deeper as the number of layers increases, DCNs may have performed better with a deeper model. Meanwhile, our main reason for experimenting with ‘Complex as Channels’ instead of DCNs was that no existing implementation (including the official release) parallelizes the real and imaginary operations, and found that DCNs were at least twice as slow as their ‘Complex as Channels’ counterpart.

3.6 Ablation Study

3.6.1 Ablation Study on Temporal Axis

In Section 1, we gave an intuition that temporal patterns could help distinguish instruments with similar frequential patterns. To verify this, we compare TFC models with their time-invariant versions. Table 7 shows that learning temporal patterns consistently leads to significantly higher SDR performance.

Figure 9: Magnitude Spectrogram of : (a) Mixture signal, (b) Corresponding Vocals signal, (c) Estimated signal of the TFC model, (d) Estimated signal of the TIC model

We also visualize this difference, by comparing estimated source spectrograms for a given sub-track (1:08-1:10 of AM Contra - Heart Peripheral) in Figure 9. We computed the magnitudes for each estimated complex spectrogram then visualized the average magnitude spectrogram of the left and right channels. The 17-blocked TFC model and its time-invariant form were used for this experiment. Figure 9 (a) is the spectrogram of the two-second mixture signal, and (b) is the corresponding ground-truth ‘vocals’ spectrogram. The four large periodic rectangular peaks in the mixture signal come from drum sounds. The TFC model somewhat succeeded in separating these noises, whereas its TIC version failed to eliminate the drum peaks and left behind some weaker peaks that look very like the left-most peak in Figure 9 (b) (a human inhale sound). This may be because inhale/exhale or strong consonant sounds are quite similar to drum sounds in a spectrogram point-of-view, since they both spread over a large number of consecutive frequencies. Interestingly, the best results of SiSEC2018 also suffer from these leftover drum noises and was also the hardest sound for our models to denoise.

# blocks hyperparameters SDR

dense block params:
{ convs:4, gr:12 }


dense block params:
{ convs:4, gr:24 }

TFC-TIF (large) 9
dense block params:
{ convs:5, gr:24 }


Table 7: Ablation Study on the Temporal Axis.

3.6.2 Complex vs Magnitude

# blocks hyperparameters # params SDR
CaC 9
dense block params: { convs:
5, gr:24, kernel: }
2.24M 7.99

2.24M 7.12

Table 8: Ablation Study on Complex as Channels: TFC-TIF (large) model and its magnitude-only version.

For our final experiment, we see how much SDR was gained by extending a magnitude-only model into a ‘Complex as Channels’ model. Our TFC-TIF (large) model in Table 5 is compared to its magnitude-only form (referred to as ‘Mag’). Both use the same hyperparameter set except for , the input/output number of channels. Mag also has an additional ReLU after the final convolution to obtain positive-valued output spectrograms. Results show that by simply training with raw STFT outputs instead of magnitudes (with minimal change to network specifications) significantly boosts SDR performance. It is also notable that the Mag model still outperforms all previous state-of-the-art models in Table 5.

4 Conclusion and future work

In this paper, we design several neural transformation methods, including time-invariant methods, time-frequency methods, and mixtures of two different transformations. We also implement deep neural transformation models for the MSS task and empirically evaluate their performance. Our experiments provide abundant material for future works by comparing several transformation methods. Also, one of our models (i.e., the TFC-TIF) outperforms SOTA methods in the singing voice separation task. For future work, we would like to extend this model to utilize attention networks for modeling long-term dependencies observed in both the frequency and the temporal axis.

This work was also supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. NRF-2019R1F1A1062719).