Singer separation for karaoke content generation

by   Hsuan-Yu Chen, et al.

Due to the rapid development of deep learning, we can now successfully separate singing voice from mono audio music. However, this separation can only extract human voices from other musical instruments, which is undesirable for karaoke content generation applications that only require the separation of lead singers. For this karaoke application, we need to separate the music containing male and female duets into two vocals, or extract a single lead vocal from the music containing vocal harmony. For this reason, we propose in this article to use a singer separation system, which generates karaoke content for one or two separated lead singers. In particular, we introduced three models for the singer separation task and designed an automatic model selection scheme to distinguish how many lead singers are in the song. We also collected a large enough data set, MIR-SingerSeparation, which has been publicly released to advance the frontier of this research. Our singer separation is most suitable for sentimental ballads and can be directly applied to karaoke content generation. As far as we know, this is the first singer-separation work for real-world karaoke applications.



There are no comments yet.


page 3


An Overview of Lead and Accompaniment Separation in Music

Popular music is often composed of an accompaniment and a lead component...

A Study of Transfer Learning in Music Source Separation

Supervised deep learning methods for performing audio source separation ...

Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments

We present a single deep learning architecture that can both separate an...

The MIDI Degradation Toolkit: Symbolic Music Augmentation and Correction

In this paper, we introduce the MIDI Degradation Toolkit (MDTK), contain...

A Geometric Approach For Fully Automatic Chromosome Segmentation

A fundamental task in human chromosome analysis is chromosome segmentati...

Improved singing voice separation with chromagram-based pitch-aware remixing

Singing voice separation aims to separate music into vocals and accompan...

Addressing the confounds of accompaniments in singer identification

Identifying singers is an important task with many applications. However...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An audio signal separation model was initially used to separate the target voice from background interference, such as the TasNet model [11]

, which uses an encoder to extract the two-dimensional speech features, and then uses the separation to estimate the speaker mask, and finally uses the decoder to convert the two-dimensional features into a speech waveform to obtain the separated speech. The decoder cannot be perfectly reconstructed, driving in-depth exploration and modification of TasNet, leading to the development of the Multi-phase Gammatone filterbank, which can obtain a better frequency response distribution than random initialization learning. Compared with a single-channel audio signal, a multi-channel audio signal obtains more spatial information, thereby further assisting speech separation. Wave-U-Net


splices multi-channel signals are input into U-Net and need to change the number of input channels, but the input length of the time domain is usually not fixed if the series is very long, and its resistance to optimization means the traditional RNN model cannot be effectively used. Dual-path recurrent neural networks (DPRNN) optimize RNN in the deep model to process extremely long speech sequences

[10]. Later, a dual-path transformer (DPTNet) [2] network changed the RNN into a transformer to improve the source separation task.

However, due to the particular complexities of musical structure [4], this specific case of source separation brings challenges and opportunities. A song includes many different instruments and vocals. The instrumental accompaniment and multiple vocal sources cannot be directly separated when applying a multi-channel model. We propose a novel singer separation framework for duet songs, individual or various singers. In addition, we have released four datasets of English and Chinese songs to assist future research in singer separation for duet songs. The proposed method automatically selects an appropriate model for the lyrics language or the number of singers. The remainder of this paper is organized as follows. In Section 2, we introduce the datasets and their methods of generation. Section 3 introduces the proposed system architecture and auto-selection method. Section 4 presents the content, results, and findings of each experiment. Section 5 presents conclusions and directions for future work.

2 Dataset

Many datasets have been developed for music source separation. For instance, MUSDB18 is composed of drums, bass, vocals, and remaining accompaniment [12]. The Mixing Secrets dataset is another multi-track dataset used for instrument recognition in polyphonic music [5], but features serious leakage between tracks. The Choral Singing Dataset is a good multi-microphone dataset [3] that contains the personal recordings of 16 singers from Spain, but has only three songs. Due to the lack of a dataset appropriate for the singer separation task, we have created and released MIR-SingerSeparation dataset, consisting of 476 English and 500 Chinese songs publicly released on Youtube.

The composition of Chinese songs is the same as that of MIR-ST500 [16], while English songs are randomly selected for 476 popular songs on YouTube. The male/female vocalist ratio for the English songs was 269:207, while that for the Chinese songs was 223:277. First, we divided the songs into training and testing subsets (80:20 ratio), ensuring that no vocalist appeared in both subsets. We then converted the music frequency for each song to 8kHz, used an improved accompaniment separation algorithm [6] to separate the vocal and accompaniment. Finally, we cut the songs into 10-second segments, randomly selecting -5 to 5 dB SNR to mix pairs of the vocal wave.

The pairs method from left to right in Fig. 1 are the English duet (EN-D), Chinese duet (CH-D), and English self-harmonic vocal (EN-S). The duet dataset extracts a vocal segment from a different singer and pairs them together. Since we hoped that EN-D and CH-D would have similar amounts of data, we repeated the pairing of English singers twice. The pairing method for EN-S is similar to that for EN-D, except that the former randomly extracts a segment for a different singer while the latter randomly extracts a segment for the same singer. The distribution of pair results is in Table 1. In addition to that, MIR -SingerSeparation will continue to expand in the future, for instance, by adding a label for slow- or fast-tempo music.

Figure 1: The pairing methods of different datasets. Capital letters represent different singers’ vocals (yellow, green, and red blocks). Numbers represent the index of segments in a song.
Figure 2: System flowchart of singer separation system.

3 System Framework

Each song includes many components, such as drums, piano, vocals, harmonies, etc. We divide these components into accompaniment and vocals. Therefore, a singer separation system (SSSYS) includes two stages (Fig. 2), where stage 1 separates the song into accompaniment and vocal, and stage 2 divides the mixed vocals into two vocals. In the vocal separation stage, we use an improvement of the Wave-U-Net model to accurately differentiate between vocals and accompaniment. We then use the output vocal files as the training materials for the second stage. In the lead vocal separation stage, vocals are mixed to form multiple audio files. We refer to the DPRNN model and DPTNet, converting two-mixed vocal data into singing files of two vocals.

11373 12970 384hr 17.5min
4059 4059 205hr 6.5min
6489 6489 256hr 31min
Table 1: Data distribution of each dataset.

3.1 Vocal separation in Wave-U-Net

U-Net was first proposed by Ronneberger in 2015 [13] and used to segment biomedical images. The model can be divided into three parts: encoder, decoder, and skip-connection. Two years later, Jansson proposed using U-Net in song vocal separation by spectrum [7]

. The main difference between Jansson’s U-Net and Ronneberger’s original U-Net is that each encoder layer uses two strides and no pooling layer, which is closer to the model proposed by S. Boll

[1]. Combining the above methods, there are three characteristics in the Wave-U-Net model [6]:

  • Use the encoder and decoder architectures in Ronneberger’s U-Net.

  • Adjust the stride setting of the pooling layer so that after the input spectrum passes through the pooling layer, the time dimension remains unchanged.

  • Refer to Jansson’s U-Net and set the number of model encoders to six layers.

The Wave-U-Net model is used in the first stage. The input is music, and the outputs are accompaniment and vocals.

(a) The user interface of the karaoke application. (b) Pitches data of actual singing.
Figure 3: The pitch trend in a song, blue and red lines at the bottom of (a) are the pitches of the two lead singers, and they correspond to vocal A and B of (b), respectively.

3.2 Lead vocal separation in DPRNN or DPTNet

Dual-path RNN (DPRNN) consists of three stages: segmentation, block processing, and overlap [10]

. It divides longer audio inputs into smaller chunks. It iteratively applies intra-inter-chunk operations, where the input length (chunk size) can be proportional to the square root of the original audio length in each operation. Dual-path transformer Network (DPTNet) also has three steps: encoder, separation layer, decoder

[2]. By introducing an improved transformer, the elements in the speech sequence can interact directly, thus enabling DPTNet to model speech sequences through direct context-awareness.

In the second stage, we use the DPRNN or DPTNet model, split the mixed-vocals into two vocals.

3.3 Model auto-selection

Given an input song, we cannot choose the corresponding model because we do not know the language and whether the same singer performs the harmony vocals. We found the pitch trends of the two singers are almost the same in most songs (Fig. 3), so if the two-channel pitch trend after separation is the same, we can consider this model to have the best separation effect. Therefore, the automatic model selection method is:


where is the model type (EN-Duet, CH-Duet, and EN-Self), and are the vocals A and B, and is the difference of pitches in same vocal:


where are the pitches obtained from CREPE [8]. It is worth noting that using the sliding window with a unit of three ensures the algorithm skips audio segments when only one singer is singing (3).


The above formula will have the smallest value when a model erroneously divides the singing voices of two people into two tracks, with one track composed of silence only. Therefore, if all pitches in a channel are zero, we set a high a penalty value to avoid the model that runs into the above situation.

Through the auto-selection algorithm (1), we obtain a separation result with the closest trend as the final system output.

4 Experiments and Results

In all experiments, we evaluate system effectiveness based on signal fidelity. The degree of improvement in signal fidelity can be measured by SI-SNR improvement (SI-SNRi) [9], and signal-to-distortion ratio improvement (SDRi) [15], which are both often used in audio source separation systems. We calculate these two indicators for the vocal part only, thus increasing the fairness of the experimental results.

Experimental results
10D23H46M 8.2679 8.7844
11D21H50M 9.3741 8.8861
Table 2: Comparison of different model on English duet data.
8.2679 8.7844 10.0857 10.4660 4.5389 3.2236
6.7841 7.3587 10.8926 11.1288 - -
6.6387 7.1639 - - 4.7366 5.2674
Table 3: Comparison results in English duet, Chinese duet, and English self harmonic dataset.

4.1 Experimental configurations

Experiments were run on a computer running the Ubuntu 18.04.5 LTS” operating system, with 8 AuthenticAMD CPUs, a GeForce RTX3090 GPU, 64GB RAM, and a 1TB SSD. Each model was trained for 100 epochs, with a learning rate of

, and using Adam as the optimizer. A stop condition is set if no best model is found in the validation set for ten consecutive epochs.

4.2 Singer Separation Results

We first compare the SSSYS with the DPRNN or DPTNet in the second stage (Table 2). In the lead vocal separation stage, the effect of using DPTNET is a little better than DPRNN, but the relative training time will also increase.

Other languages use other models in speech separation, so we also divide songs into various languages for comparison of singer separation (Table 3). The performance of the English duet model in the English duet dataset is 8.2679 and 8.7844, outperforming Chinese duet model (6.7841 and 7.3587), which achieved 10.8926 and 11.1288 on the Chinese duet dataset, indicating that a model trained with data in the same language will perform better on most songs. Some harmonies are created by the lead vocalist, which is not seen in speech data. The same singer’s voice characteristics are similar, so self-harmonic raises the difficulty of the singer separation task. The experimental results (Table 3) indicate the EN-Duet model performs less well on EN-S, and EN-Self model will outperform the EN-Duet model more than 4%.

Based on the above experiment, the lyric language and number of singers will affect singer separation results. Applying the model auto-selection method to 34 actual songs taken from the CMedia karaoke APP 222

, we obtain the confusion matrix (Table 4). The accuracy rate is 70.59%, and the average SI-SNRi and SDRi are 8.9775 and 6.8282, which is better than single model and less than 10% worse than all the correct classification results(Table 5). The above method is feasible for separating vocals in duet and harmonic songs, and this algorithm can also be incorporated into neural network training in the future.

Predicted classes
EN-D CH-D EN-S Total
EN-D 9 8 0 17
CH-D 1 13 0 14
EN-S 0 1 2 3
Total 10 22 2 34
Table 4: Confusion matrix of model auto-selection.


This paper proposes a singer separation system using music characteristics to separate the song source layer by layer. A single sound source can be successfully separated into an accompaniment component and the individual vocals of two singers. We also created and publicly released MIR-SingerSeparation, including three datasets: EN-Duet, CH-Duet, EN-Self. We compared the separation of the singing vocal in different languages and for different singers, finding that these factors affect model performance, and thus propose a model auto-selection method to maximize performance.

EN-Duet 8.1860 6.8105
CH-Duet 8.5900 6.3200
EN-Self 6.0796 5.0517
Auto Selection 8.9775 6.8282
Correct Classification 9.6080 7.4319
Table 5: Experimental results of the auto-selection method.

Future work will compare songs at different tempos to evaluate the impact on model performance. We will perform data augmentation using a single singer’s voice, changing the pitch for the same lyrics for mixing into the ground truth data. After identifying all features that affect singer separation, we will use joint training to replace the singer separation model, producing a more powerful singer separation model.


  • [1] S. Boll (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing 27 (2), pp. 113–120. Cited by: §3.1.
  • [2] J. Chen, Q. Mao, and D. Liu (2020) Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation. arXiv preprint arXiv:2007.13975. Cited by: §1, §3.2.
  • [3] H. Cuesta, E. Gómez Gutiérrez, A. Martorell Domínguez, and F. Loáiciga (2018) Analysis of intonation in unison choir singing. Cited by: §2.
  • [4] C. Drake and C. Palmer (1993) Accent structures in music performance. Music perception 10 (3), pp. 343–378. Cited by: §1.
  • [5] S. Gururani and A. Lerch (2017) Mixing secrets: a multi-track dataset for instrument recognition in polyphonic music. Proc. ISMIR-LBD, pp. 1–2. Cited by: §2.
  • [6] H. Hsiang-Yu (2020) On the improvement of singing voice separation using u-net. Master’s Thesis, National Taiwan University. Cited by: §2, §3.1.
  • [7] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde (2017) Singing voice separation with deep u-net convolutional networks. Cited by: §3.1.
  • [8] J. W. Kim, J. Salamon, P. Li, and J. P. Bello (2018) Crepe: a convolutional representation for pitch estimation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 161–165. Cited by: §3.3.
  • [9] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey (2019) SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. Cited by: §4.
  • [10] Y. Luo, Z. Chen, and T. Yoshioka (2020) Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46–50. Cited by: §1, §3.2.
  • [11] Y. Luo and N. Mesgarani (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. Cited by: §1.
  • [12] Z. Rafii, A. Liutkus, F. Stöter, S. I. Mimilakis, and R. Bittner (2017) MUSDB18-a corpus for music separation. Cited by: §2.
  • [13] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1.
  • [14] D. Stoller, S. Ewert, and S. Dixon (2018) Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185. Cited by: §1.
  • [15] E. Vincent, R. Gribonval, and C. Févotte (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §4.
  • [16] J. Wang and J. R. Jang (2021) On the preparation and validation of a large-scale dataset of singing transcription. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 276–280. Cited by: §2.