Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation

by   Weitao Yuan, et al.
University of Surrey

Monaural Singing Voice Separation (MSVS) is a challenging task and has been studied for decades. Deep neural networks (DNNs) are the current state-of-the-art methods for MSVS. However, the existing DNNs are often designed manually, which is time-consuming and error-prone. In addition, the network architectures are usually pre-defined, and not adapted to the training data. To address these issues, we introduce a Neural Architecture Search (NAS) method to the structure design of DNNs for MSVS. Specifically, we propose a new multi-resolution Convolutional Neural Network (CNN) framework for MSVS namely Multi-Resolution Pooling CNN (MRP-CNN), which uses various-size pooling operators to extract multi-resolution features. Based on the NAS, we then develop an evolving framework namely Evolving MRP-CNN (E-MRP-CNN), by automatically searching the effective MRP-CNN structures using genetic algorithms, optimized in terms of a single-objective considering only separation performance, or multi-objective considering both the separation performance and the model complexity. The multi-objective E-MRP-CNN gives a set of Pareto-optimal solutions, each providing a trade-off between separation performance and model complexity. Quantitative and qualitative evaluations on the MIR-1K and DSD100 datasets are used to demonstrate the advantages of the proposed framework over several recent baselines.



There are no comments yet.


page 1

page 11


Evolutionary Neural Architecture Search Supporting Approximate Multipliers

There is a growing interest in automated neural architecture search (NAS...

Multi-Objective Neural Architecture Search Based on Diverse Structures and Adaptive Recommendation

The search space of neural architecture search (NAS) for convolutional n...

Learning Interpretable Models Through Multi-Objective Neural Architecture Search

Monumental advances in deep learning have led to unprecedented achieveme...

Theme Aware Aesthetic Distribution Prediction with Full Resolution Photos

Aesthetic quality assessment (AQA) of photos is a challenging task due t...

LENS: Layer Distribution Enabled Neural Architecture Search in Edge-Cloud Hierarchies

Edge-Cloud hierarchical systems employing intelligence through Deep Neur...

Multi-objective Neural Architecture Search via Non-stationary Policy Gradient

Multi-objective Neural Architecture Search (NAS) aims to discover novel ...

Neural Architecture Search for Speech Recognition

Deep neural networks (DNNs) based automatic speech recognition (ASR) sys...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Popular music, which plays a central role in entertainment industries, usually consists of two components: singing voice (Vocal) and music accompaniment (Acc) [34]. Human beings can easily hear out/distinguish the singing voice from music accompaniment when listening to a popular song. This effortless task for human, however, is very difficult for machines, which raises both challenges and opportunities to advance audio signal processing techniques [20, 34]. Monaural singing voice separation (MSVS), as an important research branch of music source separation (MSS), aims to separate the singing voice and the background music accompaniment from a single-channel mixture signal. The research on MSVS is useful in many areas such as automatic lyrics recognition/alignment, singer identification, and music information retrieval [20]. Moreover, it would benefit our understanding of the perception and interpretation mechanisms of the human auditory system [20].

Traditional (largely unsupervised) methods have provided many effective frameworks for MSVS [34], e.g., time-frequency (T-F) masking methods [20]

, and robust principal component analysis (RPCA) based methods 

[11]. A comprehensive overview of the traditional MSVS methods can be found in [34]. Benefiting from these methods, recent data-driven methods, especially the Deep Neural Network (DNN) [17], strongly boosts the performance of MSVS with the help of large scale data. The basic building blocks of DNNs for MSVS mainly include Feed-Forward Network (FFN) [39]

, Recurrent Neural Network (RNN) 

[13], Convolutional Neural Network (CNN) [5], and attention mechanism [51]. In these building blocks, CNN is proven to be very effective in extracting vocal/musical features for MSVS, since efficient representations related to discriminative features of vocal/music can be learned by convolutional filters via sharing weights.

In fact, music relies heavily on its multi-scale repetitions (e.g., from very basic elements such as individual notes, timber, or pitch to larger structure chords [33]) to build the logical structure and meaning [10]. These multi-resolution repetitions appearing at various musical levels also distinguish the music accompaniment from vocals which are less redundant and mostly harmonic [34]. As an important CNN for MSVS, the Multi-Resolution CNN (MR-CNN) [32, 14, 7, 6], which can capture multi-resolution features via constructing various-size receptive fields (RFs), has been found effective in modeling the multi-scale repetitive music structures and extracting discriminative features (e.g., global or local features). The MR-CNN has been widely employed by many state-of-the-art (SOTA) MSVS methods and it is also our research focus in this work.

According to different implementations of multi-resolution RFs, existing MR-CNNs for MSVS/MSS can be divided into two types. The first type, e.g., Stacked Hourglass Network (SHN) [32] and U-net [14], is constructed in a cascade manner with fixed-size or single-resolution RF in each layer. The input signal is repeatedly convoluted and downsampled to form multiple consecutive layers. In this case, different resolution features can only be found in different layers and thus the cascade structure of the first type MR-CNN should be deep enough to extract effective multi-resolution features. In contrast, the second type MR-CNN such as Multi-Resolution Convolutional Auto-Encoder (MRCAE) [7] and Multi-Resolution Fully Convolutional Neural Network (MR-FCNN) [6], directly implements multi-resolution RFs in the same layer by using multiple sets of various-size convolutional operators. Accordingly, multi-resolution features can be extracted in one or a few layers without deepening the cascade structure.

In spite of these achievements, several issues need to be addressed for current MR-CNNs:

(1) Architecture limitations

The first type MR-CNN depends on its cascade structure to extract multi-scale music features. However, according to [52], the optimization algorithms would be less effective in capturing the dependencies across multiple layers. This problem could be aggravated in the first type MR-CNN since it heavily relies on its deep cascade structure to improve the separation performance of MSVS.

In contrast, the second type MR-CNN does not suffer from the optimization issue. However, in order to extract global features, large-size convolutional filters should be used. According to [44], large convolutional filter results in low computational efficiency. Moreover, for MSVS, a minor linear shift in T-F representations (e.g., magnitude spectrogram) could cause significant distortions on vocal and music perception [14]. To address this issue, many MSVS networks employ skip or similar connections to directly transmit the low-level information between different layers [14, 32]. However, such skip connections (or similar mechanisms) have not been implemented for the second type MR-CNN.

(2) Manual design

Current MR-CNNs (or DNNs) based MSVS methods are usually designed manually. This manual design procedure usually has the following shortcomings.

  1. Manual design is often achieved empirically via trial and error: MSVS is a challenging task as the music accompaniment and vocals often exhibit highly synchronous non-stationary spectro-temporal structures over time and frequency [34]. The MR-CNN learns hierarchical feature extractors (e.g., the coefficients of the convolutional operators) from the data in an “end-to-end” fashion. In this case, slight modifications to the architecture may significantly affect the separation performance. To find suitable structures for MSVS, a large amount of architecture modifications and repetitive training and testing are required, which is inevitably time-consuming, error-prone, and ineffective.

  2. Domain knowledge may be not sufficient for detailed architecture design: For MSVS, domain knowledge may suggest to use vertical filters to learn timbral representations [18] and horizontal filters to learn long temporal cues [37] in the T-F domain. However, when dealing with an actual MSVS network, how to combine and deploy these filters and how to select an effective combination from so many possible combinations may not be answered sufficiently by domain/expert knowledge.

  3. Pre-designed structures lack a mechanism to adapt their architectures to the training data: The data-driven optimization process of MR-CNNs can learn parameters of the convolutional filters. However, the pre-defined convolutional operator sizes, the hyper-paremters, and the architecture of MR-CNNs, cannot be changed or adapted to the dataset during the training process. As a result, the information learned from real data is not utilized for improving the pre-designed structures.

To address these issues, this paper proposes a flexible and effective MR-CNN for MSVS namely Multi-Resolution Pooling CNN (MRP-CNN). We also extend the proposed MRP-CNN into an evolving framework, i.e., E-MRP-CNN, using Neural Architecture Search (NAS) technique. The E-MRP-CNN can automatically evolve its neural architecture according to the learned data using two kinds of genetic algorithms: the single-objective genetic algorithm and the multi-objective genetic algorithm. The details of our work are described below.

(1) Multi-resolution Pooling CNN

The MRP-CNN utilizes sets of average pooling operators of various sizes in parallel at the same layer to obtain multi-resolution features. All these pooling operators are embedded in stacked convolution networks with small and fixed-size convolutional kernels. Compared with the cascade framework U-net or SHN (the first type MR-CNN), the MRP-CNN does not need to optimize the deep cascade structure. Compared with the second type MR-CNN, large-size pooling (downsampling) operators rather than large-size convolutional filters are used to extract global features, which reduces the number of trainable parameters and leads to much better memory and computational efficiency. Moreover, the MRP-CNN is a flexible design and allows skip connections (or other similar connections) to be implemented between different layers for low-level features transmission.

(2) Automatic Neural Architecture Search

We introduce NAS to the MRP-CNN and construct the E-MRP-CNN, which can automatically search effective MRP-CNN architectures for MSVS. As the first attempt to introduce NAS in the MSVS field, we aim to enhance the existing MR-CNNs and make the DNN based MSVS methods less dependent on domain/expert knowledge, with single-objective E-MRP-CNN and multi-objective E-MRP-CNN.

The single-objective E-MRP-CNN evolves its architecture with the only objective of optimizing the separation performance. This evolving process will provide an insight about how different architectures of MRP-CNN affect the separation performance and what structures work well on the MSVS problem. The single-objective E-MRP-CNN tends to optimize the separation performance, but choosing a more complex model. In some real applications (e.g., the embedded FPGA platform) [8], however, the computing resources and on-chip memory are usually limited, in this case, both the model complexity and separation performance should be considered.

The multi-objective E-MRP-CNN is proposed to address the balance between model complexity and separation performance. It provides a set of Pareto-optimal solutions [1] for MSVS, i.e., Pareto-optimal MRP-CNN architectures. Each solution (architecture) is Pareto-optimal, that is, no objective can be improved without degrading the other objective, e.g., the separation performance can not be improved without increasing the model complexity. We approximate the Pareto-optimal solution set based on a classic multi-objective evolutionary genetic algorithm: Non-dominated Sorting Genetic Algorithm II (NSGA-II) [1]. With the multi-objective E-MRP-CNN, we can obtain multiple architectures with each providing a good separation performance under a fixed model complexity.

Our main contributions are summarized as follow.

  • We propose a flexible MR-CNN framework, i.e., MRP-CNN, for extracting multi-resolution spectro-temporal features for MSVS;

  • Based on MRP-CNN, we introduce the first evolutionary scheme for MSVS, i.e., the E-MRP-CNN, which can evolve its architecture and search effective architecture for MSVS based on training data. This automatic scheme not only avoids the empirical manual design process but also provides better separation performance (via the single-objective E-MRP-CNN) and a well-balanced model complexity and separation performance (via the multi-objective E-MRP-CNN) for MSVS.

Fig. 1: The architecture of the proposed MRP-CNN.

Ii related works

The existing deep networks for MSVS/MSS mainly use RNN [27, 28] and CNN structures [5, 32, 14, 43, 40]. The RNN can effectively model dependencies of temporal patterns and structures of music (e.g. rhythm, beat/tempo, melody) [27, 28]

. The CNN, which is effective for feature extraction in the T-F domain, is usually constructed as a convolutional encoder-decoder architecture with skip connections, such as the U-net 

[14], Wave-U-net [43], Exp-Wave-U-Net [40], and SHN [32]. The CNN can also be combined with other structures to obtain better MSS/MSVS performance. For example, in [21], CNN and RNN are combined to improve the MSS performance; the Skip Attention (SA) [51] inspired from Transformer [48] was introduced into CNN encoder-decoder structure to improve the separation performance. In addition to these works, [25, 45]

considered using generative adversarial networks (GANs) for (semi-supervised) MSVS; 

[24] designed Chimera network for singing voice separation based on deep clustering; [26]

examined the mapping functions of neural networks based on the denoising autoencoder (DAE) model for MSS. However, these works lack flexibility for adapting the architectures to the data, as compared with the use of various size pooling operators in our approach. In addition, all of these networks are designed manually and none of them considered the use of NAS for automatic architecture design.

Over the past few years, the NAS has achieved impressive progress in many research areas and begun to outperform human-designed deep models [41, 2]. As a classic search strategy of NAS, the NeuroEvolution of Augmenting Topologies (NEAT) [42] adopted the Genetic Algorithm (GA) to evolve both its artificial neural networks and their weights. Recently, the Evolved Transformer [41]

considered the use of NAS to find a better alternative to the Transformer for sequence-to-sequence tasks. The Reinforcement Learning (RL) based NAS has also been introduced to Generative Adversarial Networks (GANs) 

[3]. However, to our knowledge, the NAS has not been explored for the MSVS/MSS tasks and no work has yet attempted to design an evolving MR-CNN framework for MSVS/MSS. In particular, since the neural architecture for MSVS usually has millions of weights, we use GA to optimize the neural architecture while the gradient based method to optimize the weights [2], which is different from NEAT [42]. In addition, compared with the RL based NAS (e.g., [3]), the evolution guided NAS would be more simple and efficient for MSVS.


Iii-a Proposed framework

The proposed MRP-CNN in Fig. 1(a) is composed of five stacked Blocks111The number of blocks was choosen empirically here, and can be chosen flexibly in an application. Using more (than 5) stacked blocks, a higher separation performance may be obtained, but with a computationally higher complexity.. Each Block (indexed by , ) works as a basic unit to extract multi-resolution features and five Blocks form a stacked structure. Skip connections (dotted lines in Fig. 1(a)) can be optionally used between different Blocks to improve the separation performance.

As illustrated in Fig. 1(b), each Block consists of a convolution-group (CG), multiple pooling layers (PLs, indexed by , ), concatenation, and post-convolution-group (PCG) layer. Skip connection can be used optionally (dotted lines in Fig. 1(b)). The -th PL in the -th Block is composed with three components: an average pooling operator of size , a PCG layer, and an upsampling operation. Each pooling layer (PL, ) is responsible for extracting one specific resolution feature and the Block which has multiple PLs can extract multi-resolution features. The CG and PCG in each Block have the same structure. As shown in Fig. 1(c), both CG and PCG are made of two consecutive convolution layers with the same size of and a possible skip connection, where represents the kernel size of 2D convolutional operator and is the channel number.

Using the hyper-parameters (e.g., , , , etc.) and flexible components (e.g., skip connection) of the basic MRP-CNN framework, many different MRP-CNN architectures can be induced. For example, in each Block, the exact PL number, i.e., , can be adjusted by the data-driven evolution process of E-MRP-CNN. In particular, when the size of the average pooling operator of one PL is changed to during the evolution process, this PL will not be used in the current Block. In addition, the CG/PCG can have different channel numbers (different ) and when

, CG/PCG is turned into direct connection; skip connections can be used optionally between different Blocks; nonlinear activation functions can be different (e.g., ReLU or sigmoid). Hence, the proposed MRP-CNN provides a flexible framework for MSVS.

Iii-B Encoding method

The encoding process is to assign each specific MRP-CNN architecture a unique code, i.e., the gene. With the gene-encoded MRP-CNN architectures, a search space is constructed, thus enabling our NAS to find the appropriate architectures for MSVS (see Section IV) under the defined objective. For the convenience of presentation, we divide the proposed MRP-CNN framework in Fig. 1 into the following four levels from low to high

where Convolution-level represents convolutional layers and CG and PCG belong to this level, the Pooling-level, Block-level, and Full-level correspond to PL, Block, and the whole MRP-CNN structure, respectively. The whole MRP-CNN structure can be encoded as in Table I, where all the four levels are included.

2bit: 00(32), 01(64), 11(128), 10(256)
10bit: b-bb-bbb-bbbb (b )
2bit: 00(None), 01(32), 11(64), 10(128)
1bit: 0(No), 1(Yes)
1bit: 0(ReLU), 1(Sigmoid)
1bit: 0(ReLU), 1(Sigmoid)
(2bit)x(2bit): 00(1), 01(4), 11(16), 10(64)
2bit: 00(16), 01(32), 11(64), 10(128)
1bit: 0(No), 1(Yes)
1bit: 0(ReLU), 1(Sigmoid)
1bit: 0(ReLU), 1(Sigmoid)
1bit: 0(No), 1(Yes)
1bit: 0(ReLU), 1(Sigmoid)
1bit: 0(ReLU), 1(Sigmoid)
…. ….
TABLE I: Encoding method of the proposed MRP-CNN.

Iii-B1 Full-level

The Full-level, i.e., the whole MRP-CNN structure, is encoded by where encodes the number of channels of the last PCG layer in all Blocks, i.e., (see Fig. 1(b)), encodes possible skip connections between different Blocks, stands for Block, and “” represents concatenation of codes.

The value of can be , as shown in Table I, where we use 2 bits to represent four options: 00(32), 01(64), 11(128), 10(256), respectively. Here, the same (one of the four options) is used for all Blocks in one MRP-CNN structure, since the output channels of different Blocks should be the same to enable skip connections.

The is encoded in form of “b-bb-bbb-bbbb" using ten bits (see the second row in Table I). The first bit ‘b’ stands for the skip connection from the first Block to the second Block, the second ‘bb’ stands for skip connections from the first and second Block to the third Block, and so on. The value of b decides if skip connection exists (b=) or not (b=).

Iii-B2 Block-level

This level is important to extract multi-resolution features. Each Block is encoded as

where , , and have been defined earlier. Both CG and PCG belong to Convolution-level and PLs working in parallel belong to Pooling-level.

Blocks Block 1 Block 2 Block 3 Block 4 Block 5
11(64) 11(64) 11(64) 11(64) 11(64)
1(Yes) 1(Yes) 1(Yes) 1(Yes) 1(Yes)
0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU)
0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU)
0011(1x16) 0011(1x16) 0011(1x16) 0011(1x16) 0011(1x16)
11(64) 11(64) 11(64) 11(64) 11(64)
1(Yes) 1(Yes) 1(Yes) 1(Yes) 1(Yes)
0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU)
0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU)
0000(1x1) 0000(1x1) 0000(1x1) 0000(1x1) 0000(1x1)
11(64) 11(64) 11(64) 11(64) 11(64)
1(Yes) 1(Yes) 1(Yes) 1(Yes) 1(Yes)
0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU)
0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU)
1(Yes) 1(Yes) 1(Yes) 1(Yes) 1(Yes)
0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU)
0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU) 0(ReLU)
TABLE II: The code (gene) of an example MRP-CNN.

Iii-B3 Convolution-level

The CG and PCG which have the same architecture (see Fig. 1(c)) are encoded differently. The is encoded as

where encodes the number of channels of convolutional layers in CG, i.e., in Fig. 1(b), stands for the skip connection ( ), and two consecutive bits imply the activation functions for the two-layer convolution operators, where represents ReLU and represents Sigmoid. The values of can be . When , the CG turns into a direct connection, i.e., there is no convolution, activation, or skip connection. In this case, the is ignored.

The code of PCG is similar to CG but without the channel number information, i.e.,

According to Fig. 1(b), the PCG is employed in both Block and PL. Thus the channel number of PCG in Block and in PL is decided by in Full-level and in Pooling-level (see the following), respectively.

Iii-B4 Pooling-level

Each PL is encoded using

where is the size of pooling operator in PL, is the channel number of PCG (i.e., of the -th PL in the -th Block in Fig. 1(b)), and represents the post convolution group. For the -th PL in the -th Block, is defined as [,], where is the downsampling size in time axis and in frequency axis. When , the -th PL will not appear in the -th Block and the code will be ignored. We use 2 bits to encode and of , respectively. As shown in Table I, four possible values are represented by 00(1), 01(4), 11(16), and 10(64). The is also encoded by 2 bits: 00(16), 01(32), 11(64), and 10(128). The upsampling operator in PL is not encoded, since it has no freedom but to upsample the extracted features back to the same size as the input of the current PL.

A simple example of MRP-CNN is shown in Table III-B2, where all five Blocks have two PLs. The of the second PL is 0000 (), i.e., the and are ignored (shown in gray). This MRP-CNN (or other MRP-CNN architectures) can be used as a seed in E-MRP-CNN.


Using the above encoding method, each possible MRP-CNN structure can be represented by a unique code (i.e., gene). All these genes form a big searching space. The proposed E-MRP-CNN utilizes genetic algorithm to automatically search effective genes, i.e., effective MRP-CNN structures, from this searching space. Here, we propose two types of evolution schemes: the single-objective and the multi-objective E-MRP-CNN scheme.

Both single/multi-objective schemes start with an initial population, which is made of a seed gene (a specific MRP-CNN structure) and other genes (structures) randomly mutated from this seed gene. After initialization, the single/multi-objective schemes iteratively generate new offspring genes by applying genetic operations (i.e., crossover and mutation) to randomly selected gene(s) from the current population. The new offspring genes are decoded to corresponding MRP-CNN structures which are then trained/tested and assigned with fitness values. The fitness values for single-objective and multi-objective schemes are computed in different ways: the single-objective scheme considers only the separation performance while the multi-objective scheme considers both separation performance and model complexity. When the fitness values for all genes are computed, the genes with low fitness in current generation will be removed. This evolution iteration is repeated and finally ended in a generation made of well-performing genes (structures).

Iv-a Single-objective E-MRP-CNN

According to BSS-EVAL toolkit [49], there are usually three metrics to measure the separation performance of MSVS: source-to-distortion ratio (SDR), source-to-interferences ratio (SIR), and sources-to-artifacts ratio (SAR). As a proof of concept, we choose SDR as the fitness function to guide the evolution process of the single-objective scheme, because it is a global performance measure considering three goals222According to [49], three goals are (i) rejection of the interferences, (ii) absence of forbidden distortions and “burbling” artifacts, and (iii) rejection of the sensor noise. as equally important [49]. In particular, since each gene is only partially trained in the evolution process (to accelerate the computation), the global measure SDR would be more suitable than the SIR and SAR.

The single-objective scheme is presented in Algorithm 1, where Rows 1-4 show the initialization process and Rows 5-12 show the evolution process.

Iv-A1 Initialization process

  • In the first step (Row 1), we generate the initial population of size , including one seed gene and the other genes randomly mutated from this seed. To do this, the bits of the seed gene are flipped to generate a new gene, where is a random number and ( is the maximum flipping number). We repeat this process until different genes are obtained.

  • In the second step (Row 2), we divide the training dataset denoted by into three subsets , where the training subset is used for training, the testing subset is used for computing the fitness, and the validation subset is used to decide when to stop the evolution process of the single-objective scheme.

  • In the third step (Row 3-4), we compute the fitness of each gene in the initial population. Specifically, the MRP-CNN structure decoded from each gene is trained with for only a few iterations (i.e., partial training). These partially trained structures are tested on and we compute the averaged SDR performance333This averaged SDR score is computed on the subset , which can be considered as an approximation of the separation performance on the full testing dataset in the final evaluation. over all clips of as the fitness of each gene. The genes with low-fitness are removed according to the population limit .

Iv-A2 Evolution process

  • In each iteration of evolution, we use crossover (Row 6) and mutation (Row 7) operators to generate new offspring genes. The crossover operator recombines the information of the two randomly selected genes, where one gene is used as the baseline and each bit within it has a probability (prob.)

    to be exchanged with the corresponding bit of the other gene. We apply crossover to create new offsprings. The mutation operator produces a new offspring by randomly flipping each bit of one gene with a prob. . We apply the mutation operator to each gene of the current generation and the newly obtained offsprings (generated by the crossover) to create total new offsprings.

  • The SDR fitness values of all new offsprings () are computed (Row 8). All populations including the new offsprings () and the current populations () are sorted by their fitnesses (Row 9) and the low-fitness genes are removed according to the population limit (Row 10).

  • We check if the stopping criterion is satisfied with the validation subset (Row 11). Specifically, we test the best-fitness gene of the current generation on to compute its SDR, which can be considered as the best SDR performance of the current generation. This SDR is then compared with the SDRs of several recent generations and if there is no improvement on this value for a few generations ( generation), the evolution iteration will be stopped and the earliest generation with no SDR improvement will be given as the output.

1:Generate the initial population of size
2:Data preparation: training set
3:Compute SDR fitness of each gene in the initial population
4:Remove low-fitness genes according to population limit
5:for  to (maximum generation) do
6:     Generate new genes by crossover with prob.
7:     Generate new genes by mutation with prob.
8:     Compute SDR fitness for all new offsprings
9:     Sort all genes (current+new) by SDR fitness
10:     Remove low-fitness genes by population limit
11:     break, if stopping criterion is satisfied
12:end for
Algorithm 1 Single-objective E-MRP-CNN

Iv-B Multi-objective E-MRP-CNN

The single-objective scheme evolves only to improve the separation performance. Thus it may pick up the more complex neural structures that provide better separation performance. Since the model complexity is an important factor for limited memory applications [8], the multi-objective scheme is designed to balance two objectives, i.e., model complexity and separation performance. In fact, these two objectives are conflicting: a complicated structure is more likely to provide a higher performance than a simple one. Thus the multi-objective scheme tries to approximate the Pareto front set, where many solutions are included and each solution provides a good separation performance under a fixed model complexity.

There are generally two properties to design evolutionary multi-objective optimization algorithms: convergence and diversity [19]. The convergence measures the distances of solutions toward the Pareto front (i.e., Pareto-optimal front) which should be as small as possible [19]. The diversity is the spread of solutions along the Pareto front and should be as uniform as possible [19]. For MSVS, the convergence encourages each evolved structure to offer a separation performance as good as possible under a certain complexity; the diversity encourages the evolved structures to be various enough to handle different complexity levels. To achieve these, the proposed multi-objective scheme is implemented based on NSGA-II [1], where the fast non-dominated sorting is used to promote convergence and the crowded-comparison operator is employed to address diversity [1].

The multi-objective scheme is presented in Algorithm 2, where Rows 1-4 show the initialization process and Rows 5-11 show the evolution iteration.

The first two steps in the initialization process (Rows 1-2) are the same as those in the single-objective scheme (note that the subset is not used here). In the third step, we compute the fitness of each gene in the initial population, but instead of considering the SDR as the only fitness, we calculate both the SDR score and the model complexity (measured by the amount of parameters (Params)) of each gene. Then we use the fast non-dominated sorting of NSGA-II [1] to calculate the non-dominated levels of all genes. By sorting all these levels with crowded-comparison operator, low-fitness genes are removed according to the population limit (Row 4).

In each iteration of the evolution, we use crossover (Row 6) and mutation (Row 7) to generate and () new offsprings, respectively. The SDR and model complexity of all new offsprings are computed. Both the current populations () and the new offsprings () are sorted by fast non-dominated sorting and crowded-comparison operator of NSGA-II. We remove low-fitness genes according to the population limit . The multi-objective scheme will stop until the maximum iteration number is reached.

1:Generate the initial population of size
2:Data preparation: training set
3:Compute SDR and model complexity and then perform fast non-dominated sorting and crowded-comparison
4:Remove low-fitness genes according to the population limit
5:for  to (maximum generation) do
6:     Generate new genes by crossover with prob.
7:     Generate new genes by mutation with prob.
8:     Compute SDR and Params for all new offsprings
9:     Sort all genes (current+new) using fast non-dominated sorting and crowded-comparison
10:     Remove low-fitness genes by the population limit
11:end for
Algorithm 2 Multi-objective E-MRP-CNN

V Experiment setting

V-a Datasets and evaluation metrics

The proposed method was evaluated on two popular datasets: MIR-1K [9] and DSD100 [30]. The MIR-1K dataset contains a thousand song clips extracted from 110 karaoke songs. For a fair comparison, we followed the evaluation conditions in [12, 50, 32, 51]: clips performed by one male singer ‘abjones’ and one female singer ‘amy’ were used for training, the other clips performed by singers were used for testing. On DSD100, songs of the "Dev" subset were used for training and we followed [32, 51] to convert all sources to monophonic and then added three sources except for the vocals together to form the musical component (i.e., the Acc) source.

The separation performance was quantitatively measured by the BSS-EVAL toolkit [49] with respect to three criteria: SDR, SIR, and SAR. Normalized SDR (NSDR) [31] was calculated to show the improvement of SDR compared to the original mixture. Global NSDR (GNSDR), Global SIR (GSIR), and Global SAR (GSAR) were computed by taking the weighted means of the NSDRs, SIRs, SARs, respectively, over all the test clips weighted by their length [12, 32]. Some qualitative results were also presented to verify the separation performance of the proposed method.

V-B T-F masking framework

The proposed MRP-CNN and E-MRP-CNN were evaluated based on the T-F masking framework in Fig. 2, where the red rectangular is the key separation module444Although it is advantageous to use independent separation module for each source, i.e., two separation modules for two sources, it is computationally expensive according to [32]. Hence, following [32], we use only one separation module.

(can be the proposed structure or other compared structures). The output of the separation module is fed to the convolution layer (blue rectangular), which has two outputs for estimating the T-F masks for Vocal and Acc sources in MSVS. This framework (or similar frameworks) is widely employed in many MSVS/MSS methods (see 

[32], [28, 27, 22]).

The above framework was used in both evolution process (denoted by Evo) and the final evaluation (denoted by Eva). For each situation, we have two scenarios: training (Tra) and testing (Tes). For Evo, we trained the evolved structures in the T-F masking framework using (Evo&Tra) and then tested the trained structures on (Evo&Tes) to obtain the SDR fitness. For Eva, the final evolved structures were trained in the T-F masking framework using the full training set (Evo&Tra) and then tested on the full testing set (Eva&Tes).

Fig. 2: The T-F masking framework.

When using the T-F masking framework, the input mixture signal (time-domain) was first downsampled to kHz to speed up computation [32]. The

kHz mixture signal was transformed to its spectrogram (a complex matrix) via short-time Fourier transform (STFT) using a window size of

and a hop size of . The magnitude spectrogram of the mixture was normalized by dividing its maximum value and then split into blocks of size (frequencyframes) to form batches. The batches of the mixture were fed to the separation module and its output was fed to the convolution layer to predict the masks (in batches) for Vocal and Acc sources. The predicted masks were used in (i

) the training process (Evo&Tra and Eva&Tra) to compute the loss function and (

ii) the testing process (Evo&Tes and Eva&Tes) to output the time-domain estimated sources.

In the training process (Evo&Tra and Eva&Tra), the loss function norm in [35, 32] was adopted for a fair comparison. Formally, given the mixture , the -th ground truth source , and the predicted mask for the -th source (, in MSVS), the loss function is defined as where denotes the element-wise multiplication of matrices. Note that when computing the loss funciton, the magnitude spectrograms of the ground-truth Vocal and Acc sources were also normalized by dividing the maximum value of their mixture’s magnitude spectrogram.

In testing process (Evo&Tes and Eva&Tes), the predicted masks for Vocal and Acc were truncated to the range of and multiplied with the normalized spectrogram of the mixture [32]. After de-normalization and batch combination, the time-domain sources were obtained via inverse STFT (ISTFT) followed by upsampling.

In particular, for Eva&Tra, two data augmentation operations, gain and sliding, were applied to original time-domain ground-truth sources, to creat new mixtures. The gain operation multiplied the original source by a random factor () and the sliding operation added a random short delay (ss) to the beginning of the original source. The newly obtained ground-truth sources were mixed to form new mixtures. The ratio of the augmented data to the original data is . All the differences of using the T-F masking framework for four scenarios are summarized in Table III.

Scenarios Evo&Tra Evo&Tes Eva&Tra Eva&Tes

Data augmentation
Training dataset
Testing dataset
Subset of
Subset of
Subset of (Single)
TABLE III: Differences scenarios of using the T-F framework.

V-C Hyperparameters of the E-MRP-CNN

Table IV

lists the hyperparameters of the E-MRP-CNN. Since the multi-objective scheme requires more diversity, its population limit

and mutation number were higher than those of the single-objective scheme. For MIR-1K, the , and were set as (clips). For DSD-100, the , and were set as (songs). For the multi-objective scheme, and were not used.

Single 22 20 100 15 10 25 0.5 0.02 100/30 55/15 20/5 8
Multi. 37 20 100 25 10 35 0.5 0.02 100/30 55/15
TABLE IV: Hyperparameters of the E-MRP-CNN.
(a) Multi-objective scheme
(b) Single-objective scheme
Fig. 3: The evolution processes of the single-objective and multi-objective E-MPR-CNN on MIR-1K.
(a) Multi-objective scheme
(b) Single-objective scheme
Fig. 4: The evolution processes of the single-objective and multi-objective E-MPR-CNN on DSD100.

V-D Training parameters

The Adam optimizer [16] was employed to train the T-F masking framework. In Evo&Tra, we aim to compute the ‘fitness’ of each gene, so the T-F masking framework was only partially trained with iterations for the MIR-1K dataset and iterations for the DSD100 dataset using batch size . In Eva&Tra, the framework was fully trained with iterations for the MIR-1K dataset and iterations for the DSD100 dataset using batch size .

In both Evo&Tra and Eva&Tra, two tricks were used: (i) cosine decay learning rate and warm restart [23] and (ii) learning rate warmup [4]. For (i), we set and for both datasets in Evo&Tra, and () for MIR-1K (DSD100) and in Eva&Tra, where is the length of first decay period [23] and is the multiplication factor for decay period length at every new warm restart [23]. The maximum learning rate for Evo&Tra and Eva&Tra was . The minimum learning rates for Evo&Tra and Eva&Tra were and , respectively (more details can be found in [23]). For (ii), we scaled the learning rate in the first () iterations for Evo&Tra (Eva&Tra) with a factor , to avoid the maximum learning rate being too large for some genes.

Vi Experimental results

Vi-a Evolution process of the E-MPR-CNN

For both single-objective and multi-objective schemes, the MRP-CNN structure in Table III-B2 was used as the seed of the initial population of E-MRP-CNN on two datasets. The evolved genes (structures) of E-MRP-CNN are represented in form of “S/M-G-Index-Dataset”, where S and M denote the single-objective scheme and multi-objective scheme, respectively, G represents the generation (evolution) number, Index is the gene index in the G-th generation, and Dataset can be MIR (MIR-1K) or DSD (DSD100). For single-objective scheme, the Index is the SDR ranking of a gene in the current generation; for multi-objective scheme, the Index is the gene index in the current generation. For example, “S-25-2-MIR” represents the structure with the second highest SDR performance in the th generation of the single-objective scheme on the MIR-1K dataset, “M-99-2-DSD” represents the No. evolved structure in the th generation of the multi-objective scheme on the DSD100 dataset.

We recorded the dynamic evolution process of E-MRP-CNN in Fig. 3 for the MIR-1K dataset and Fig. 4 for the DSD100 dataset. The vertical axis in each figure represents the model complexity measured by Params and the horizontal axis represents the fitness score measured by SDR. Each colored data point stands for a gene, i.e., a MRP-CNN structure. The genes of different generations are distinguished by colors changing from red (initial generation) to pink (highest generation). We set the highest evolution number as . In our experiments, the single-objective scheme stopped evolving at the th generation on the MIR-1K dataset and the rd generation on the DSD100 dataset when the SDR of the best gene has no improvement for consecutive generations. For the multi-objective scheme, we can observe the evolution process of all generations (i.e., G) on both MIR-1K and DSD100 datasets. By comparing the two subfigures in Fig. 3 and in Fig. 4, we can find that the single-objective scheme and the multi-objective scheme had different evolution trends.

As shown in Fig. 3(a) and Fig. 4(a), the multi-objective scheme pushed the genes toward the Pareto front generation by generation during the evolution process. In each generation, a set of genes with different model complexities and SDR fitnesses were obtained. More specifically, we can see that the seed gene (represented by the black inverted triangle) had a relatively high model complexity (Params M) and a low SDR score ( dB for MIR-1K and dB for DSD100). As the evolution proceeded, the new generations gradually moved toward the Pareto optimal front. For example, the first generations in Fig. 3(a) and Fig. 4(a) (red and yellow points) spread widely, the generations from to (yellow and green points) started to move to the lower-right boundary, and the higher generations, e.g., to generations (blue and pink points), converged to the Pareto optimal front approximately. These results suggested that we could obtain better genes (in model complexity, in SDR performance, or in both) as the evolution proceeded. Finally, a set of structures with better overall performance in model complexity and/or SDR performance were obtained, which can deal with different complexity requirements.

Compared with the multi-objective scheme, the model complexity of genes in the single-objective scheme was not reduced during the evolution process, as shown in Fig. 3(b) and Fig. 4(b). This is because the model complexity was not considered in the single-objective scheme. In particular, we can see from Fig. 3(b) and Fig. 4(b) that the single-objective scheme, without the constraint of model complexity, could steadily improve the SDR performance generation by generation. While in the multi-objective scheme, the genes of one generation (Fig. 3(a) and Fig. 4(a)) showed much difference in SDR performance (so that they can deal with different complexity levels). In addition, by comparing Fig. 3(a) with Fig. 3(b), and Fig. 4(a) with Fig. 4(b), we can find that the single-objective scheme could achieve a similar SDR performance to the multi-objective scheme with much fewer generations. For example in Fig. 3(b), the single-objective scheme reached a SDR dB using only G generations, while this required at least generations in the multi-objective scheme. Nevertheless, we can observe that the multi-objective scheme could achieve a lower model complexity at SDR dB compared with the single-objective scheme. It is also found that the single-objective scheme behaved differently on two datasets. On the MIR-1K dataset (see Fig. 3(b)), the model complexity was significantly improved at high SDR score while this phenomenon was not apparent on the DSD100 dataset (see Fig. 4(b)).

We also labelled some representative genes in Fig. 3 and Fig. 4 (see the legend in each subfigure). For the multi-objective scheme in Fig. 3(a) and Fig. 4(a), the seed gene, genes of early generations (G and G), and genes of the final genertion (G) are plotted. It is clear that better genes (in model complexity, in SDR performance, or in both) can be obtained during the evolution process. For the single-objective scheme, we intentionally continued the evolution process for a few more generations. Typical genes including the seed gene, genes of early generations (G, G for MIR-1K and G, G, G for DSD100), genes of the final generation (G for MIR-1K and G for DSD100), and genes after the final generation (G for MIR-1K and G for DSD100) are plotted in Fig. 3(b) and Fig. 4(b). It is found from Fig. 3(b) that the gene in later generation, i.e., S-29-1-MIR, provided higher SDR performance than the best gene obtained in the evolution process, i.e., S-16-1-MIR, on the testing subset . For DSD100 in Fig. 4(b), the gene after the final generation, i.e., S-49-1-DSD, provided similar SDR performance to the final evolved genes, i.e., S-31-1-DSD and S-31-2-DSD. The performance of all these evolved genes was evaluated and compared using the full training and testing datasets, as shown in the next section.

Vi-B Final evaluations

In this section, we compare the evolved structures and other SOTA MSVS methods using the full MIR-1K and DSD100 datasets. In accordance with previous methods [36, 15, 29, 47, 46], on the DSD100 dataset, we computed SDRs/SIRs/SARs of all songs and then computed their median values.

Vi-B1 Quantitative evaluations

The evolved structures in Fig. 3 and Fig. 4 were first compared with some typical MR-CNN based MSVS methods on the T-F masking framework: MR-FCNN [6], SHN [32], and SA-SHN [51]. The performances of all structures were evaluated with respect to computational efficiency and separation performance. The results on computational efficiency are listed in Table V and the results on separation performance are listed in Table VI (for MIR-1K dataset) and Table VII (for DSD100 dataset). In these Tables, we use SHN- and SA-SHN- to represent the -layer SHN and -layer SA-SHN, respectively.

Method Params FLOPs Training Speed Inferring Speed
[M] [G] [bat./s] [bat./s]
Seed 2.33 129.72 31.61 93.09
M-25-5-MIR 1.21 39.33 57.19 194.54
M-50-2-MIR 0.37 18.27 97.50 349.71
M-50-5-MIR 1.04 52.30 53.13 171.80
M-99-2-MIR 0.30 11.64 121.37 445.10
M-99-4-MIR 1.24 48.69 50.76 171.02
M-99-5-MIR 2.42 130.66 31.58 94.34
M-99-8-MIR 8.27 454.28 13.13 35.78
M-25-4-DSD 0.77 26.91 71.30 249.70
M-50-8-DSD 2.73 139.41 29.21 87.59
M-99-2-DSD 0.15 7.91 175.22 621.22
M-99-4-DSD 0.38 15.64 108.06 408.32
M-99-6-DSD 0.59 21.63 87.76 311.76
M-99-7-DSD 3.18 151.41 27.83 82.70
S-1-1-MIR 2.47 135.31 30.40 89.67
S-8-1-MIR 3.91 193.09 23.09 67.28
S-16-1-MIR 6.76 404.45 13.89 38.72
S-29-1-MIR 6.67 400.50 13.90 38.79
S-1-1-DSD 2.55 135.85 30.16 89.48
S-8-1-DSD 2.80 138.79 29.87 88.55
S-16-1-DSD 2.53 136.47 30.45 91.05
S-31-1-DSD 2.15 116.94 33.62 102.16
S-31-2-DSD 2.84 144.63 29.13 85.66
S-49-1-DSD 2.48 125.90 32.72 97.97
MR-FCNN 0.56 36.56 9.03 18.59
SHN-1 9.06 168.29 29.94 87.70
SHN-2 17.46 292.87 16.70 49.19
SHN-4 34.18 537.66 8.84 26.09
SA-SHN-1 9.85 197.29 14.41 40.08
SA-SHN-2 19.03 350.87 7.56 20.95
SA-SHN-4 37.33 653.67 3.87 10.70
TABLE V: Computational efficiency of the proposed method (Seed, M-, and S-) and the existing methods (MR-FCNN, SHN-, and SA-SHN-).

Computational efficiency: The computational efficiency in Table V was calculated in theory and measured in real hardware/software environment. The theoretical efficiency was given by Params and FLOPs, where Params denotes the number of trainable parameters of each structure and FLOPs represents the number of floating-point operations for testing (inferring) in one batch. In practice, two structures with similar Params and FLOPs may have different computation speeds, thus the computational efficiency was also measured in real hardware/software environment555The GPU is RTX 2080Ti, CPU is Intel Core i9 9900K, and the memory is 4

16G DDR4 (3200 MHz). In Linux operating system, we use TensorFlow 2.0 with CUDA 10.1 and cuDNN 7.6.

. The real computational efficiency in training and inferring was given in bat./s. that is, the number of batches per second.

According to Table V, the multi-objective scheme provided multiple structures with varying model complexities in one generation, e.g., M-99-2/4/5/8-MIR. In particular, most evolved structures in the multi-objective scheme had a lower model complexity than the seed on both datasets. For single-objective scheme, the model complexity of the evolved structures on the MIR-1K was increased generation by generation and most structures had a slightly higher model complexity than the seed. On DSD100, however, the increase in the model complexity was not apparent during the evolution process.

Method Acc Vocal Mean
Seed 10.23 14.16 13.08 11.26 17.29 12.94 10.74 15.72 13.01
M-25-5-MIR 10.03 13.24 13.56 11.80 18.95 13.11 10.92 16.10 13.33
M-50-2-MIR 10.20 14.00 13.25 11.41 17.56 13.05 10.80 15.78 13.15
M-50-5-MIR 10.41 14.85 12.97 11.42 17.88 12.94 10.91 16.37 12.96
M-99-2-MIR 10.13 14.04 13.09 11.26 17.34 12.92 10.69 15.69 13.00
M-99-4-MIR 10.04 13.67 13.25 11.42 17.80 12.99 10.73 15.74 13.12
M-99-5-MIR 10.25 13.94 13.38 11.54 17.39 13.28 10.90 15.66 13.33
M-99-8-MIR 10.31 13.68 13.68 11.89 17.89 13.55 11.10 15.78 13.62
S-1-1-MIR 9.84 12.71 13.71 11.69 18.03 13.22 10.76 15.37 13.47
S-8-1-MIR 10.16 13.27 13.77 11.85 18.02 13.45 11.00 15.64 13.61
S-16-1-MIR 10.55 14.18 13.65 11.89 17.80 13.60 11.22 15.99 13.63
S-29-1-MIR 10.51 14.20 13.58 11.83 17.79 13.51 11.17 16.00 13.54
MR-FCNN 8.65 11.65 12.35 9.66 15.72 11.40 9.16 13.68 11.87
SHN-1 9.85 13.66 12.85 10.88 16.63 12.71 10.36 15.15 12.78
SHN-2 9.94 13.67 12.96 11.10 17.13 12.82 10.52 15.40 12.89
SHN-4 9.97 13.65 13.08 11.13 17.09 12.89 10.55 15.37 12.98
SA-SHN-1 10.12 13.78 13.25 11.32 17.15 13.10 10.72 15.47 13.18
SA-SHN-2 10.34 13.99 13.46 11.71 17.58 13.44 11.02 15.79 13.45
SA-SHN-4 10.53 14.54 13.38 11.75 17.87 13.40 11.14 16.21 13.39
TABLE VI: The separation performance on MIR-1K (in dB) of the proposed method (Seed, M-, and S-) and the existing SOTA methods (MR-FCNN, SHN-, and SA-SHN-).

The theoretical model complexity of MR-FCNN was lower than those of the seed and some of the evolved structures (see Params and FLOPs). However, in real environment, its computation speed was much slower than the seed and the evolved structures, e.g., MR-FCNN vs. S-8-1-MIR, MR-FCNN vs. M-50-8-DSD. In particular, we can also find that some evolved structures, e.g., M-50-2-MIR, M-99-2-MIR, and M-99-2-DSD, could achieve lower theoretical model complexity than MR-FCNN. In SHN and SA-SHN, the model complexity was increased with layer number and the model complexities of these two methods were much higher than those of the seed, the multi-objective scheme, the single-objective scheme, and the MR-FCNN.

Separation performance: We can see from Table VI (MIR-1K dataset) that the evolved structures in both single-objective and multi-objective schemes achieved higher GNSDR and GSIR performance on the Vocal source and higher GSAR performance on the Acc source than the seed. For DSD100 in Table VII, most evolved structures achieved higher SDR performance on Acc and Vocal sources than the seed. For Vocal source, most evolved structures achieved higher SIR/SAR performance. The last three columns of Table VI and Table VII

listed the mean results of Vocal and Acc. One can see that the overall separation performances of most evolved structures in single-objective and multi-objective schemes outperform the seed in three evaluation metrics. In addition, by comparing the proposed method (including the seed, the single-objective scheme, and the multi-objective scheme) with other methods, one can see that the single-objective scheme, the multi-objective scheme, and the SA-SHN outperformed the MR-FCNN and the SHN.

Computational efficiency vs. Separation performance:

(i) Proposed method: By comparing Table V and the mean results in Tables VI-VII, we can find that within the same generation of the multi-objective scheme, the structures with a higher model complexity can provide higher performance on both datasets, e.g. from M-99-2-MIR to M-99-8-MIR, from M-99-2-DSD to M-99-7-DSD. In single-objective scheme, a higher generation (with increased model complexity) usually achieved better separation performance, e.g. from S-1-1-MIR to S-16-1-MIR, from S-1-1-DSD to S-31-1-DSD. In particular, according to Fig. 3(b), the S-29-1-MIR (a structure of later generation after the stopping criterion was satisfied) provided higher SDR performance than the final evolved gene S-16-1-MIR on the testing subset , while according to Tables VI, this gene does not outperform the S-16-1-MIR on the full MIR-1K dataset. This result verified the effectiveness of our stopping criteria of the single-objective scheme.

Method Acc. (Median) Vocal (Median) Mean
Seed 12.18 18.36 14.47 5.47 13.16 7.01 8.83 15.76 10.74
M-25-4-DSD 12.78 17.91 14.80 6.21 14.32 7.24 9.50 16.12 11.02
M-50-8-DSD 12.70 18.34 14.88 6.31 14.85 7.45 9.51 16.60 11.16
M-99-2-DSD 11.96 18.40 13.95 5.36 13.10 6.53 8.66 15.75 10.24
M-99-4-DSD 12.52 18.25 14.43 5.95 14.27 7.12 9.23 16.26 10.78
M-99-6-DSD 12.64 18.09 14.70 6.15 14.53 7.25 9.39 16.31 10.98
M-99-7-DSD 12.64 18.33 14.83 6.42 14.79 7.51 9.53 16.56 11.17
S-1-1-DSD 12.33 18.49 14.45 5.68 13.16 7.22 9.01 15.82 10.84
S-8-1-DSD 12.39 17.78 14.44 5.82 14.56 7.04 9.11 16.17 10.74
S-16-1-DSD 12.41 18.05 14.74 6.26 15.24 7.14 9.34 16.64 10.94
S-31-1-DSD 12.60 18.48 14.72 6.15 14.76 7.36 9.38 16.62 11.04
S-31-2-DSD 12.70 18.28 14.69 6.24 14.68 7.31 9.47 16.48 11.00
S-49-1-DSD 12.62 18.25 14.54 6.23 14.89 7.24 9.42 16.57 10.89
MR-FCNN 11.28 16.48 13.59 4.76 12.43 5.83 8.02 14.45 9.71
SHN-1 12.11 17.78 14.20 5.42 13.46 6.66 8.76 15.62 10.43
SHN-2 12.01 17.95 14.43 5.67 13.80 6.76 8.84 15.88 10.60
SHN-4 12.17 17.63 14.61 5.85 14.29 7.07 9.01 15.96 10.84
SA-SHN-1 12.17 17.71 14.73 5.91 14.76 7.17 9.04 16.23 10.95
SA-SHN-2 12.33 18.06 14.73 6.11 14.79 7.27 9.22 16.42 11.00
SA-SHN-4 12.63 18.04 14.90 6.24 15.14 7.31 9.43 16.59 11.10
TABLE VII: The separation performance on DSD100 (in dB) of the proposed method (Seed, M-, and S-) and the existing SOTA methods (MR-FCNN, SHN-, and SA-SHN-).
(a) On MIR-1K dataset
(b) On DSD100 dataset
Fig. 5: Visualization of all structures in Params (vertical axis) and mean GNSDR/SDR (horizontal axis).
(a) Vocal
(b) Acc
Fig. 6: Qualitative comparison of the proposed method with MR-FCNN, SHN-4, and SA-SHN-4 on MIR-1K dataset.

By comparing the multi-objective scheme and the seed, one can see that the multi-objective scheme can obtain better separation results than the seed with a lower model complexity. For example, the M-99-6-DSD achieved dB improvement on mean SDR than the seed using only % Params and % FLOPs of the seed. In real environment, this structure was also times (Training) and times (Inferring) faster than the Seed. When comparing the single-objective scheme with the seed, we can find that the single-objective scheme achieved much better separation results with a slightly higher model complexity on the MIR-1K dataset, e.g., the S-16-1-MIR, which had dB improvement on mean GNSDR than that of the seed with additional cost of M Params and FLOPs. On the DSD100 dataset, the single-objective scheme achieved much better separation results with similar or even lower model complexity to the seed, e.g., the S-31-1-DSD, which obtained dB improvement on mean SDR and had a lower model complexity (only in Params and in FLOPs) than the seed.

When comparing the single-objective scheme with the multi-objective scheme, we can find that the multi-objective scheme could sometimes find more effective and efficient structures (similar or lower model complexity but better separation performance) than the single-objective scheme. For example, the M-99-6-DSD is dB higher in the mean SDR than the S-31-1-DSD but with only % Params and % FLOPs of the S-31-1-DSD. In the real environment, the M-99-6-DSD was times (Training) and times (Inferring) faster than the S-31-1-DSD. Such phenomenon can also be observed on the MIR-1K dataset, e.g., the M-50-5-MIR with only % Params and % FLOPs of the S-8-1-MIR was only dB lower than the S-8-1-MIR in mean GNSDR. In the real environment, the M-50-5-MIR was times (training) and times (Inferring) faster than the S-8-1-MIR. These observations suggested that the multi-objective scheme can greatly reduce the model complexity while maintaining acceptable separation performance.

(ii) Proposed method vs. other methods: Compared with the proposed method (the seed, the single-objective scheme, and the multi-objective scheme), the MR-FCNN has lower theoretical model complexity ( in Params and in FLOPs), while in real environment, it was much slower in training ( bat./s) and inferring ( bat./s) than the proposed method. For separation performance, the MR-FCNN was much worse than the proposed method. In particular, we can see from Tables VI-VII that the evolved structures in multi-objective scheme, e.g., M-99-2-MIR, M-50-2-MIR, and M-99-2-DSD, could achieve better separation performance in mean GNSDR/SDR than MR-FCNN even with lower model complexity.

The SHN and SA-SHN also achieved good separation performance, especially the SA-SHN. However, these two methods had low computational efficiency. For example, on the MIR-1K dataset, the SHN-4 (the best performance of SHN) and the M-99-2-MIR of the multi-objective scheme, have similar mean GNSDR results, while the model complexity of SHN-4 was times (Params) and times (FLOPs) of those of M-99-2-MIR. In real environment, the SHN-4 was times (training) and times (Inferring) slower than the M-99-2-MIR. Similar phenomenon can be also observed from, e.g. SHN-4 vs. S-1-1-MIR, SHN-4 vs. S-1-1-DSD, SHN-4 vs. M-99-6-DSD. For SA-SHN, we can see from Tables VI-VII that when the SA-SHN-4 (the best performance of SA-SHN) had similar GNSDR/SDR results to the single-objective and multi-objective schemes on the MIR-1K and DSD-1K datasets, its model complexity was much higher than the proposed structures, e.g., SA-SHN-4 vs. S-16-1-MIR, SA-SHN-4 vs. M-25-4-DSD.

All the above results suggested that the proposed method (especially the multi-objective scheme) was more effective and efficient than the MR-FCNN, the SHN, and the SA-SHN. In order to clearly visualize these quantitive results, we plotted all the data (except for MR-FCNN) of Tables V-VII in Fig. 5, where the vertical axis is the Params and the horizontal axis is the mean GNSDR/SDR.

Method Vocal Acc
MLRR [50] 3.85 5.63 10.70 4.19 7.80 8.22
DRNN [12] 7.45 13.08 9.68
ModGD [38] 7.50 13.73 9.45
U-Net [14] 7.43 11.79 10.42 7.45 11.43 10.41

11.26 17.29 12.94 10.23 14.16 13.08
M-25-5-MIR 11.80 18.95 13.11 10.03 13.24 13.56
M-50-2-MIR 11.41 17.56 13.05 10.20 14.00 13.25
M-50-5-MIR 11.42 17.88 12.94 10.41 14.85 12.97
M-99-2-MIR 11.26 17.34 12.92 10.13 14.04 13.09
M-99-4-MIR 11.42 17.80 12.99 10.04 13.67 13.25
M-99-5-MIR 11.54 17.39 13.28 10.25 13.94 13.38
M-99-8-MIR 11.89 17.89 13.55 10.31 13.68 13.68

11.69 18.03 13.22 9.84 12.71 13.71
S-8-1-MIR 11.85 18.02 13.45 10.16 13.27 13.77
S-16-1-MIR 11.89 17.80 13.60 10.55 14.18 13.65
S-29-1-MIR 11.83 17.79 13.51 10.51 14.20 13.58

TABLE VIII: Comparison of the proposed method (Seed, M-, and S-) with other MSVS methods (MLRR, DRNN, ModGD, and U-Net) on MIR-1K dataset (in dB), where “–” means corresponding results were not provided by the method.

Vi-B2 Qualitative Results

We also qualitatively compared the separation performance of the above methods. The separation results on an exemplar MIR-1K song clip (geniusturtle_6_04) are shown in Fig. 6. By comparing the ground truth (G.T.) Vocal and Acc, one can see that the MR-FCNN, the SHN-4, the SA-SHN-4, and the seed of the proposed method wrongly assigned an important frequency component of Acc ( Hz appearing around ss) to Vocal. Besides, the MR-FCNN and the SHN-4 could not capture some of the fine vocal details. In contrast, the evolved structures in single-objective scheme, e.g., S-1-1-MIR, S-8-1-MIR, and S-16-1-MIR, correctly put this frequency component back to the Acc. In multi-objective scheme, the separation results of several structures in the th generation, e.g., M-99-2/4/5/8-MIR, are exhibited. It is shown that the M-99-4/5/8-MIR correctly assigned the Hz frequency component back to Acc while the M-99-2-MIR did not. According to Table V, we can find that the M-99-2-MIR compromised the separation performance with a very low model complexity. Finally, one can see that the estimated magnitude spectrograms of the Vocal and Acc obtained by M-99-4/5/8-MIR were quite similar to those of the ground truth Vocal and Acc sources.

Vi-B3 Comparsion with other methods

We finally compared the proposed method with other MSVS methods. The results are listed in Tables VIII-IX. These numerical results verified the separation performance of the proposed method.

  Vocal Acc
DeepNMF [36] 2.75 8.90
wRPCA [15] 3.92 9.45
NUG [29] 4.55 10.29
BLEND [47] 5.23 11.70
MM-DenseNet [46] 6.00 12.10

5.47 12.18
M-25-4-DSD 6.21 12.78
M-50-8-DSD 6.31 12.70
M-99-2-DSD 5.36 11.96
M-99-4-DSD 5.95 12.52
M-99-6-DSD 6.15 12.64
M-99-7-DSD 6.42 12.64

5.68 12.33
S-8-1-DSD 5.82 12.39
S-16-1-DSD 6.26 12.41
S-31-1-DSD 6.15 12.60
S-31-2-DSD 6.24 12.70
S-49-1-DSD 6.23 12.62
TABLE IX: Median SDR values of the proposed method (Seed, M-, and S-) and other MSVS methods (DeepNMF, wRPCA, NUG, BLEND, and MM-DenseNet) on DSD100 dataset (in dB).

Vii Conclusions

As the first attempt in the field of MSVS, this paper proposed an evolutionary framework, i.e., the E-MRP-CNN, to automatically find effective neural networks for MSVS. The proposed E-MRP-CNN is based on a novel MR-CNN namely MRP-CNN, which utilizes various-size average pooling operators for feature extraction. Compared with existing MR-CNNs, the MRP-CNN has a low computational complexity and can effectively extract multi-resolution features for MSVS. We derived the E-MRP-CNN using single-objective and multiple-objective genetic algorithms. The single-objective E-MRP-CNN considers only the separation performance while the multi-objective E-MRP-CNN considers both the separation performance and the model complexity, and thus it provides a set of solutions to handle different separation performance and/or model complexity requirements. Experimental results on the MIR-1K and DSD100 datasets showed that the proposed method (especially the multi-objective scheme) is more effective and efficient than the SOTA MSVS methods, which verified the effectiveness of the proposed method.

Viii Acknowledge

This work was supported by National Natural Science Foundation of China (No. 61902280, 61373104), Natural Science Foundation of Tianjin (No. 19JCYBJC15600). It was also supported by a Grant-in-Aid for Scientific Research (B) (No. 17H01761) and I-O DATA foundation.


  • [1] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II.

    IEEE Trans. Evolutionary Computation

    6 (2), pp. 182–197.
    Cited by: §I, §IV-B, §IV-B.
  • [2] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: A survey. J. Mach. Learn. Res. 20, pp. 55:1–55:21. Cited by: §II.
  • [3] X. Gong, S. Chang, Y. Jiang, and Z. Wang (2019) AutoGAN: neural architecture search for generative adversarial networks. CoRR abs/1908.03835. Cited by: §II.
  • [4] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017)

    Accurate, large minibatch SGD: training imagenet in 1 hour

    CoRR abs/1706.02677. Cited by: §V-D.
  • [5] E. M. Grais and M. D. Plumbley (2017) Single channel audio source separation using convolutional denoising autoencoders. In Proc. IEEE Global Conf. Signal and Info. Process., pp. 1265–1269. Cited by: §I, §II.
  • [6] E. M. Grais, D. Ward, and M. D. Plumbley (2018) Raw multi-channel audio source separation using multi- resolution convolutional auto-encoders. In Proc. 26th European Signal Processing Conf., pp. 1577–1581. Cited by: §I, §I, §VI-B1.
  • [7] E. M. Grais, H. Wierstorf, D. Ward, and M. D. Plumbley (2018) Multi-resolution fully convolutional neural networks for monaural audio source separation. In Proc. 14th Int. Conf. Latent Variable Anal. Signal Separation (LVA/ICA), pp. 340–350. Cited by: §I, §I.
  • [8] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang (2019) A survey of fpga-based neural network inference accelerators. TRETS 12 (1), pp. 2:1–2:26. Cited by: §I, §IV-B.
  • [9] C. Hsu and J. R. Jang (2010) On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset. IEEE Trans. Audio, Speech & Language Processing 18 (2), pp. 310–319. Cited by: §V-A.
  • [10] C. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M. Dai, M. D. Hoffman, and D. Eck (2018) An improved relative self-attention mechanism for transformer with application to music generation. CoRR abs/1809.04281. Cited by: §I.
  • [11] P. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson (2012) Singing-voice separation from monaural recordings using robust principal component analysis. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 57–60. Cited by: §I.
  • [12] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio, Speech & Language Processing 23 (12), pp. 2136–2147. Cited by: §V-A, §V-A, TABLE VIII.
  • [13] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio, Speech & Language Processing 23 (12), pp. 2136–2147. Cited by: §I.
  • [14] A. Jansson, E. J. Humphrey, N. Montecchio, R. M. Bittner, A. Kumar, and T. Weyde (2017) Singing voice separation with deep u-net convolutional networks. In Proc. 18th Int. Soc. Music Inf. Ret. Conf., pp. 745–751. Cited by: §I, §I, §I, §II, TABLE VIII.
  • [15] I. Jeong and K. Lee (2017) Singing voice separation using RPCA with weighted l_1 -norm. In Proc. 13th Int. Conf. Latent Var. Anal. Signal Separation, pp. 553–562. Cited by: §VI-B, TABLE IX.
  • [16] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In Proc. 3rd Int. Conf. Learning Rep., Cited by: §V-D.
  • [17] Y. LeCun, Y. Bengio, and G. E. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §I.
  • [18] H. Lee, P. T. Pham, Y. Largman, and A. Y. Ng (2009)

    Unsupervised feature learning for audio classification using convolutional deep belief networks

    In Proc. Adv. Neural Inf. Process. Syst., pp. 1096–1104. Cited by: item 2.
  • [19] K. Li, S. Kwong, Q. Zhang, and K. Deb (2015) Interrelationship-based selection for decomposition multiobjective optimization. IEEE Trans. Cybernetics 45 (10), pp. 2076–2088. Cited by: §IV-B.
  • [20] Y. Li and D. Wang (2007) Separation of singing voice from music accompaniment for monaural recordings. IEEE Trans. Audio, Speech & Language Processing 15 (4), pp. 1475–1487. Cited by: §I, §I.
  • [21] J. Liu and Y. Yang (2019) Dilated convolution with dilated GRU for music source separation. In

    Proc. 28th Int. Joint Conf. on Artificial Intelligence

    pp. 4718–4724. Cited by: §II.
  • [22] A. Liutkus and R. Badeau (2015) Generalized wiener filtering with fractional power spectrograms. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 266–270. Cited by: §V-B.
  • [23] I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    In Proc. of the 5th Int. Conf. Learning Representations, Cited by: §V-D.
  • [24] Y. Luo, Z. Chen, J. R. Hershey, J. L. Roux, and N. Mesgarani (2017) Deep clustering and conventional networks for music separation: stronger together. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 61–65. Cited by: §II.
  • [25] M. Michelashvili, S. Benaim, and L. Wolf (2019) Semi-supervised monaural singing voice separation with a masking network trained on synthetic mixtures. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 291–295. Cited by: §II.
  • [26] S. I. Mimilakis, K. Drossos, E. Cano, and G. Schuller (2019) Examining the mapping functions of denoising autoencoders in music source separation. CoRR abs/1904.06157. Cited by: §II.
  • [27] S. I. Mimilakis, K. Drossos, J. F. Santos, G. Schuller, T. Virtanen, and Y. Bengio (2018) Monaural singing voice separation with skip-filtering connections and recurrent inference of time-frequency mask. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 721–725. Cited by: §II, §V-B.
  • [28] S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller (2017) A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation. In Proc. Int. Workshop on Mach. Learn. for Signal Processing, pp. 1–6. Cited by: §II, §V-B.
  • [29] A. A. Nugraha, A. Liutkus, and E. Vincent (2016) Multichannel music separation with deep neural networks. In Proc. 24th European Signal Process. Conf., pp. 1748–1752. Cited by: §VI-B, TABLE IX.
  • [30] N. Ono, Z. Rafii, D. Kitamura, N. Ito, and A. Liutkus (2015) The 2015 signal separation evaluation campaign. In Proc. 12th Int. Conf. Latent Var. Anal. Signal Separation, pp. 387–395. Cited by: §V-A.
  • [31] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval (2007) Adaptation of bayesian models for single-channel source separation and its application to voice/music separation in popular songs. IEEE Trans. Audio, Speech & Language Processing 15 (5), pp. 1564–1578. Cited by: §V-A.
  • [32] S. Park, T. Kim, K. Lee, and N. Kwak (2018) Music source separation using stacked hourglass networks. In Proc. 19th Int. Soc. Music Inf. Ret. Conf., pp. 289–296. Cited by: §I, §I, §I, §II, §V-A, §V-A, §V-B, §V-B, §V-B, §V-B, §VI-B1, footnote 4.
  • [33] J. Paulus, M. Müller, and A. Klapuri (2010) State of the art report: audio-based music structure analysis. In Proc. 11th Int. Soc. Music Inf. Ret. Conf., pp. 625–636. Cited by: §I.
  • [34] Z. Rafii, A. Liutkus, F. Stöter, S. I. Mimilakis, D. FitzGerald, and B. Pardo (2018) An overview of lead and accompaniment separation in music. IEEE/ACM Trans. Audio, Speech & Language Processing 26 (8), pp. 1307–1335. Cited by: item 1, §I, §I, §I.
  • [35] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Proc. 18th Int. Conf. Med. Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §V-B.
  • [36] J. L. Roux, J. R. Hershey, and F. Weninger (2015) Deep NMF for speech separation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 66–70. Cited by: §VI-B, TABLE IX.
  • [37] J. Schlüter and S. Böck (2014) Improved musical onset detection with convolutional neural networks. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 6979–6983. Cited by: item 2.
  • [38] J. Sebastian and H. A. Murthy (2016) Group delay based music source separation using deep recurrent neural networks. In Proc. Int. Conf. Signal Process. and Comm., pp. 1–5. Cited by: TABLE VIII.
  • [39] A. J. R. Simpson, G. Roma, and M. D. Plumbley (2015) Deep karaoke: extracting vocals from musical mixtures using a convolutional deep neural network. In Proc. 12th Int. Conf. Latent Var. Anal. Signal Separation, pp. 429–436. Cited by: §I.
  • [40] O. Slizovskaia, L. Kim, G. Haro, and E. Gómez (2019) End-to-end sound source separation conditioned on instrument labels. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 306–310. Cited by: §II.
  • [41] D. So, Q. Le, and C. Liang (2019) The evolved transformer. In Proc. 36th Int. Conf. Mach. Learn., pp. 5877–5886. Cited by: §II.
  • [42] K. O. Stanley and R. Miikkulainen (2001) Evolving neural networks through augmenting topologies. Evolutionary Computation 10, pp. 99–127. Cited by: §II.
  • [43] D. Stoller, S. Ewert, and S. Dixon (2018) Wave-u-net: A multi-scale neural network for end-to-end audio source separation. In Proc. Conf. Int. Soc. Music Inf. Retrieval, pp. 334–340. Cited by: §II.
  • [44] A. Stoutchinin, F. Conti, and L. Benini (2019) Optimally scheduling CNN convolutions for efficient memory access. CoRR abs/1902.01492. Cited by: §I.
  • [45] Y. C. Sübakan and P. Smaragdis (2018) Generative adversarial source separation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 26–30. Cited by: §II.
  • [46] N. Takahashi and Y. Mitsufuji (2017) Multi-scale multi-band densenets for audio source separation. In Proc. IEEE Workshop on App. of Signal Process. to Audio and Acoustics, pp. 21–25. Cited by: §VI-B, TABLE IX.
  • [47] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji (2017) Improving music source separation based on deep neural networks through data augmentation and network blending. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,, pp. 261–265. Cited by: §VI-B, TABLE IX.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 6000–6010. Cited by: §II.
  • [49] E. Vincent, R. Gribonval, and C. Févotte (2006) Performance measurement in blind audio source separation. IEEE Trans. Audio, Speech & Language Processing 14 (4), pp. 1462–1469. Cited by: §IV-A, §V-A, footnote 2.
  • [50] Y. Yang (2013) Low-rank representation of both singing voice and music accompaniment via learned dictionaries. In Proc. 14th Int. Soc. Music Inf. Ret. Conf., pp. 427–432. Cited by: §V-A, TABLE VIII.
  • [51] W. Yuan, S. Wang, X. Li, M. Unoki, and W. Wang (2019) A skip attention mechanism for monaural singing voice separation. IEEE Signal Process. Lett. 26 (10), pp. 1481–1485. Cited by: §I, §II, §V-A, §VI-B1.
  • [52] H. Zhang, I. J. Goodfellow, D. N. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In Proc. 36th Int. Conf. on Mach. Learn., pp. 7354–7363. Cited by: §I.