Addressing the confounds of accompaniments in singer identification

by   Tsung-Han Hsieh, et al.
Academia Sinica

Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer only sings in certain musical contexts (e.g., genres). The model cannot therefore generalize well when the singer sings in unseen contexts. In this paper, we attempt to address this issue. Specifically, we employ open-unmix, an open source tool with state-of-the-art performance in source separation, to separate the vocal and instrumental tracks of music. We then investigate two means to train a singer identification model: by learning from the separated vocal only, or from an augmented set of data where we "shuffle-and-remix" the separated vocal tracks and instrumental tracks of different songs to artificially make the singers sing in different contexts. We also incorporate melodic features learned from the vocal melody contour for better performance. Evaluation results on a benchmark dataset called the artist20 shows that this data augmentation method greatly improves the accuracy of singer identification.



There are no comments yet.


page 1


Mixing-Specific Data Augmentation Techniques for Improved Blind Violin/Piano Source Separation

Blind music source separation has been a popular and active subject of r...

Learning a Joint Embedding Space of Monophonic and Mixed Music Signals for Singing Voice

Previous approaches in singer identification have used one of monophonic...

A Study of Transfer Learning in Music Source Separation

Supervised deep learning methods for performing audio source separation ...

Improved singing voice separation with chromagram-based pitch-aware remixing

Singing voice separation aims to separate music into vocals and accompan...

Singer separation for karaoke content generation

Due to the rapid development of deep learning, we can now successfully s...

Assessing Algorithmic Biases for Musical Version Identification

Version identification (VI) systems now offer accurate and scalable solu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Singer identification (SID), a.k.a., artist identification, is a classic task in the field of music information retrieval (MIR). It aims at identifying the performing singers in given audio samples to facilitate management of music libraries. When properly trained, an SID model also learns the embedding of singing voices that can be used in downstream singing-related applications such as similarity search, playlist generation, or singing synthesis [3, 8, 26, 9, 6]. We refer readers to [6] for a recent overview of research on singing voice analysis and processing, and the role of SID in related tasks.

Figure 1: The architecture of the proposed

convolutional recurrent neural network with melody

(CRNNM) model for singer identification. The inputs are mel-spectrograms, and melody contours extracted by CREPE [7]

. The model cascades convolutional blocks, gate recurrent units (GRUs), and a dense layer. The “+” symbol stands for channel-wise concatenation.

Despite of its importance, SID is to date not yet a settled task [6]. There are at least two main challenges. First, as human beings share similar mechanism in producing sounds [24], the difference in the singing voices of two singers may not be always obvious. This becomes more severe as the number of singers to be considered increases. Second, due to the difficulty in acquiring solo recordings of singers, the training data for SID usually consists of audio recordings of singers singing over instrumental accompaniment tracks. The vocal track and instrumental track of a song are usually mixed in such an audio recording [15]. The presence of instrumental accompaniment not only makes it difficult for an SID model to extract only vocal-related features from the audio, but also introduces confounding factors [22]

that hurt the model’s generalizability. This is especially the case as singers usually have their preferred musical genres or styles. In trying to reproduce the most ground truth artist labels of a training dataset (e.g., while minimizing a classification error related loss function), an SID model may learn to capitalize non-vocal related features, which is not what the task is actually about.

We intend to address the second challenge in this paper. Intuitively, the challenge can be tackled by enhancing, or isolating out, the vocal part of a song, to minimize the effect of the instrumental part on the SID model. While singing voice enhancement or separation were difficult just a few years ago [4, 19], state-of-the-art models now can perform the task with low distortion, interference and artifact [15, 21, 11]

, thanks to the advance in deep learning. Using source separation (SS) to improve SID therefore becomes feasible.

While the idea of using SS to improve SID has been attempted before [12, 19, 23, 20], our work differs from the prior arts in two ways. First, except for the concurrent work [20], the prior arts that we are aware of did not use deep learning-based SS models. In contrast, in our work both the SS model and the SID model employ deep learning. Specifically, we use open-unmix [21], an open-source three-layer bidirectional deep recurrent neural network for SS. Moreover, we build upon our SID model based on the implementation of a convolutional recurrent neural network made available by Nasrullah and Zhao [13], which attains the highest song-level F1-score of 0.67 on the per-album split of the artist20 dataset [5], a standard dataset for SID. As neural networks may find their own way extracting relevant features or patterns from the input, it remains to be studied whether the use of SS can improve the performance of a deep learning based SID model.

Second, unlike prior arts (including [20]), we investigate one additional way to employ SS to improve SID. Given the separated vocal tracks and instrumental tracks of the audio recordings in the training set, we perform the so-called “data augmentation” [28, 18, 25, 10] by randomly shuffling the separated tracks of different songs and then remixing them. For example, we remix the vocal part of a song from a singer with the instrumental part of another song from a different singer. In this way, we artificially make the singers sing over a variety of accompaniment tracks, and may therefore break the “bonds” between the vocal and accompaniment tracks, mitigating the confounds from the accompaniments. We intend to empirically validate the effectiveness of such a data augmentation method, which can be said to be task-specific to SID.

As a secondary contribution, we explore adding to our SID model features extracted from the vocal melody contour, which is related to singing timbre

[14]. While the extraction of the vocal melody contour is done by using CREPE [7], an open-source tool with state-of-the-art performance in melody extraction, we use a stack of convolutional layers and gated recurrent unit (GRU) layers [2] to learn features not only from the mel-spectrogram but also the melody contour.111Features extracted from the melody contour have been shown useful in many other MIR tasks [17, 16, 14, 1]. However, we note that most existing work used hand-crafted features, rather than features learned by a neural network.

Figure 1 shows the architecture of our SID model, dubbed convolutional recurrent neural network with melody (CRNNM). Code available at

Figure 2: A diagram of the “shuffle-and-remix” data augmentation method, which has been used before for SS [10].

2 Methodology

2.1 Singer Identification (SID) Models

We consider as the baseline model the convolutional recurrent neural network proposed in [13], which represents the state-of-the-art for SID on the artist20 dataset. This model uses a stack of four convolutional layers, two GRU layers, and one dense (i.e., fully-connected) layer, as depicted in Fig. 1

, but without the lower melody-related branch. We follow exactly the same design (i.e., encompassing number of filters, kernel sizes, activation functions, loss function, optimizer, learning rate, etc) of

[13]. We refer to this model as ‘CRNN’ below.

The proposed CRNNM model extends the CRNN model in two ways. First, in addition to the mel-spectrogram, we use CREPE [7] to extract the melody contour from the mixture audio recordings and establish an additional convolutional branch to learn melodic features for SID. For simplicity, we use the same design for the mel-spectrogram branch and the melody contour branch. Second, instead of using the mel-spectrogram of the mixture audio recordings, we employ open-unmix [21] to remove the instrumental part of the music, and use the proposed data augmentation technique to increase the size of the training data, as described below.

As CRNNM has more parameters than CRNN, in our experiment we also implement a variant of CRNN, denoted as CRNN, that has similar number of parameters as CRNNM.

2.2 Data Augmentation: Separate, Shuffle, and Remix

Data augmentation is to synthetically create training examples to improve generalizability and to help capture invariances of data [28]

. This technique has been popular for some time among the machine learning community. It has also been shown beneficial for MIR tasks such as singing voice detection and source separation

[18, 25, 10] (but not yet for SID).

As discussed in [18]

, data augmentation techniques for MIR can be classified into data-independent, audio-specific, and task-specific methods. Data-independent methods, like dropout, achieve augmentation from model perspective, and then can be data-agnostic. Audio-specific methods, like pitch shifting and time stretching, perform data transformation directly on audio data. Task-specific methods consider the task-specific prior knowledge into the training data. For example, it has been known that remixing sources from different songs improves the performance of SS models


Our approach is motivated by [10]. Our conjecture is that the same shuffle-and-remix technique can also be used for SID: when the vocal part of a song is mixed with the instrumental part of another song, its singer label should remain the same. This process is illustrated in Figure 2. Following this light, we create another three datasets, Vocals, Remix, and Data aug to evaluate our model.

Origin: The original audio recordings of artist20 [5]. It contains six albums per artist for 20 artists, with in total 1,413 sound tracks. Vocal and acconamniments are mixed.

Vocal-only: The vocal tracks separated by open-unmix [21]. In other words, all the accompaniments are removed.

Remix: The dataset is generated by randomly mixing the separated vocal and instrumental tracks of artist20. The size of this dataset is the same as Origin and Vocal-only.

Data aug: Combination of the three sets above.

2.3 Implementation Details

In the literature of SID, data splitting can be done in two ways: song-split or album-split. The former splits a dataset by randomly assigning songs to the three subsets, whereas the latter makes sure that songs from the same album are either in the training, validation, or the test split. It has been known [13] that song-split may leak production details associated with an album over the training and testing subsets, giving an SID model additional clues for classification. Accordingly, the accuracy for song-split may be overly optimistic and tends to be higher than that of album-split. We therefore focus on and only consider album-split in our work.

Under the album-split, we consider and compare the result of models trained using the four types of data listed by the end of Section 2.2. The same test set (i.e., the Origin type) is used.

Following [13], we cut the songs into 5-sec segments for training a 20-class classification model. The final prediction result for a song is made by majority voting from the per-segment results. For evaluation, we consider both “per 5-sec segment” and “per song” F1 score; both the higher the better.

For CRNNM, we quantize the frequency axis of the melody contour to 128 bins before feeding to the next layers.

(a) 05-Winter.mp3                     (b) 07-Calypso.mp3                  (c) 01-Black Friday.mp3

Figure 3: The scatter plots showing the likelihood score for the correct singer of a testing song for different 5-sec segments of that song, predicted by the CRNN model trained on ‘Data aug.’ The segments are sorted from left to right in each plot according to vocalness, the average decibel value of the vocal-separated part of that song. The three plots show the same trend: the model do predict the correct singer for the frames with average vocal db greater than , but not for the non-vocal frames.

3 Experiments

The models are evaluated using artist20 [5] under the album split, averaging the F1 scores of three independent runs.

Model Data F1 / 5-sec F1 / Song
CRNN Origin 0.50 0.67
Vocal-only 0.39 0.61
Remix 0.39 0.65
Data aug. 0.47 0.74
CRNN Origin 0.54 0.67
Vocal-only 0.48 0.71
Remix 0.46 0.71
Data aug. 0.50 0.74
CRNNM Origin 0.53 0.69
Vocal-only 0.42 0.66
Remix 0.39 0.65
Data aug. 0.45 0.75
Table 1: Average testing F1 score on the artist20 dataset; note that ‘CRNN+Origin’ resembles the model in [13].

3.1 Experimental results

From Table 1, we see that CRNNM performs the best among the three models. This result shows that using melody contour as additional features helps SID.

Our ‘CRNNM+Data aug’ model achieves 0.75 song-level F1 score, which is greatly higher than that (0.67) obtained by the best existing model (‘CRNN+Origin’) [13] for artist20.

Table 1 also shows that, for all the three models, training on Data aug outperforms those trained on Origin for the song-level result, validating the effectiveness of the data augmentation method. We also note that, using Vocal-only performs even worse than using Origin for the case of CRNN and CRNNM, suggesting that the models trained with Origin may benefit from the additional (unwanted confounding) information in the accompaniment. Using Remix alone addresses this issue, but its result is no better than using Origin alone. The combination of the three data (i.e., Origin, Vocal-only, and Remix) significantly boosts the song-level F1 score.

The F1 score at the 5-sec level is much worse than that at the song level, highlighting the importance of majority voting in aggregating the result. One important reason for this is the presence of non-vocal parts in a song. To demonstrate this, we regard “vocalness” as the mean volume of the vocal-separated clip for each 5-sec segment, and then compute the correlation between the vocalness and the prediction of the ground truth singer for test songs by our CRNN model trained with Data aug training set. The resulting correlation coefficient (0.39) indicates a weak relationship between these two factors. Figure 3 shows the result for three random test songs. We see that the model assigns high likelihood scores to the correct singer for the vocal frames (i.e., frames with larger avg db) but not for the non-vocal frames. We therefore suggest that 1) song-level accuracy is more important than 5-sec level accuracy, 2) future work may consider employing a vocal/non-vocal detector (e.g., [18]) in both the training and testing stages.

Figure 4: Visualization of the embeddings (projected into 2D by t-SNE) generated by the models trained on the Origin training set for the testing samples (5s segment; under the album split). Upper: the result of CRNN (i.e., the model shown in Figure 1 but without the melody branch); lower: the result of CRNNM (i.e., the model shown in Figure 1).

3.2 Visualization

After training, we can regard the output of the final fully-connected layer as an embedding of the input data. Visualizing the representations can give us some ideas of the behaviour and performance of our SID models. Therefore, we employ t-Distributed Stochastic Neighbor Embedding (t-SNE) 


to project the computed embedding vectors to a 2-D space for visualization, and to explore the structure of the predictions. For space limit, only the result of CRNN and CRNNM models trained on

Origin are presented. The audio samples of testing set are drew and colored according to the ground truth artist labels in Figure 4. It can be seen from the result of CRNNM that samples from different singers are fairly well-separated in the embedding space.222We note that similar visualization of the learned embedding space is also provided in [13]. Yet, they consider the song-split setting in their visualization, while we consider the more challenging yet realistic case of album-split. Therefore, although the embeddings shown in their work seem to be even more separated, we still consider the result here promising. The result of CRNN looks less separated, suggesting again that a model taking additional melody feature may do SID better.

4 Conclusions

The paper proposes a new SID model extending from CRNN and involving the use of melody information by leveraging CREPE [7]. Also, a data augmentation method called shuffle-and-remix is adopted to avoid the confounds from the accompaniments by using source separation [21]. Our evaluation shows that both melody information and data augmentation improve the result, especially the latter. Future work includes three directions. First, to use a vocal detector [18] as a pre-filter for SID. Second, to investigate replacing convolutions by GRUs for the melody branch since the melody contour is a time series. Lastly, to try other data augmentation methods such as pitch shifting, time stretching, or a shuffle-and-remix variant that considers the key and tempo while remixing.


  • [1] R. M. Bittner, J. Salamon, J. J. Bosch, and J. P. Bello (2017) Pitch contours as a mid-level representation for music informatics. In Proc. AES Conf. Semantic Audio, Cited by: footnote 1.
  • [2] K. Cho, B. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1.
  • [3] A. Demetriou, A. Jansson, A. Kumar, and R. M. Bittner (2018) Vocals in music matter: the relevance of vocals in the minds of listeners. In Proc. Int. Society for Music Information Retrieval Conference, pp. 514–520. Cited by: §1.
  • [4] J. Durrieu, G. Richard, and B. David (2009) An iterative approach to monaural musical mixture de-soloing. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp. 105–108. Cited by: §1.
  • [5] D. Ellis (2007) Classifying music audio with timbral and chroma features. In Proc. Int. Society for Music Information Retrieval Conference, Note: [Online] Cited by: §1, §2.2, §3.
  • [6] E. J. Humphrey, S. Reddy, P. Seetharaman, A. Kumar, R. M. Bittner, A. Demetriou, S. Gulati, A. Jansson, T. Jehan, B. Lehner, A. Krupse, and L. Yang (2019) An introduction to signal processing for singing-voice analysis: high notes in the effort to automate the understanding of vocals in music. IEEE Signal Processing Magazine 36 (1), pp. 82–94. Cited by: §1, §1.
  • [7] J.-W. Kim, J. Salamon, P. Li, and J. P. Bello (2018)

    CREPE: a convolutional representation for pitch estimation

    In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Note: [Online] Cited by: Figure 1, §1, §2.1, §4.
  • [8] K. Lee and J. Nam (2019) Learning a joint embedding space of monophonic and mixed music signals for singing voice. In Proc. Int. Society for Music Information Retrieval Conference, Cited by: §1.
  • [9] J.-Y. Liu, Y.-H. Chen, and Y.-H. Yang (2019) Score and lyrics-free singing voice generation. arXiv preprint arXiv:1912.11747. Cited by: §1.
  • [10] J.-Y. Liu and Y.-H. Yang (2018) Denoising auto-encoder with recurrent skip connections and residual regression for music source separation. In Proc. IEEE Int. Conf. Machine Learning and Applications, Cited by: Figure 2, §1, §2.2, §2.2, §2.2.
  • [11] J.-Y. Liu and Y.-H. Yang (2019) Dilated convolution with dilated GRU for music source separation. In

    Proc. Int. Joint Conf. Artificial Intelligence

    pp. 4718–4724. Cited by: §1.
  • [12] A. Mesaros, T. Virtanen, and A. Klapuri (2007)

    Singer identification in polyphonic music using vocal separation and pattern recognition methods.

    In Proc. Int. Society for Music Information Retrieval Conference, Cited by: §1.
  • [13] Z. Nasrullah and Y. Zhao (2019) Musical artist classification with convolutional recurrent neural networks. In Proc. Int. Joint Conf. Neural Network, Note: [Online] Cited by: §1, §2.1, §2.3, §2.3, §3.1, Table 1, footnote 2.
  • [14] M. Panteli, R. Bittner, J. P. Bello, and S. Dixon (2017) Towards the characterization of singing styles in world music. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Cited by: §1, footnote 1.
  • [15] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, D. FitzGerald, and B. Pardo (2018) An overview of lead and accompaniment separation in music. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (8), pp. 1307–1335. Cited by: §1, §1.
  • [16] B. Rocha, R. Panda, and R. P. Paiva (2013) Music emotion recognition: the importance of melodic features. In Proc. Int. Workshop on Machine Learning and Music, Cited by: footnote 1.
  • [17] J. Salamon, B. Rocha, and E. Gomez (2012) Musical genre classification using melody features extracted from polyphonic music signals. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Cited by: footnote 1.
  • [18] J. Schlüter and T. Grill (2015) Exploring data augmentation for improved singing voice detection with neural networks. In Proc. Int. Society for Music Information Retrieval Conference, Cited by: §1, §2.2, §2.2, §3.1, §4.
  • [19] C.-Y. Sha, Y.-H. Yang, and H. H. Chen (2013) Singing voice timbre classification of Chinese popular music. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Cited by: §1, §1.
  • [20] B. Sharma, R. K. Das, and H. Li (2019) On the importance of audio-source separation for singer identification in polyphonic music. In Proc. INTERSPEECH, pp. 2020–2024. Cited by: §1, §1.
  • [21] F.-R. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji (2019) Open-unmix - A reference implementation for music source separation. Journal of Open Source Software. Note: [Online] Cited by: §1, §1, §2.1, §2.2, §4.
  • [22] B. L. Sturm (2014) A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia 16 (6), pp. 1636–1644. Cited by: §1.
  • [23] L. Su and Y.-H. Yang (2013) Sparse modeling for artist identification: exploiting phase information and vocal separation. In Proc. Int. Society for Music Information Retrieval Conference, pp. 349–354. Cited by: §1.
  • [24] J. Sundberg (1989) The science of the singing voice. Northern Illinois University Press. Cited by: §1.
  • [25] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji (2017) Improving music source separation based on deep neural networks through data augmentation and network blending. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Cited by: §1, §2.2.
  • [26] M. Umbert, J. Bonada, M. Goto, T. Nakano, and J. Sundberg (2015) Expression control in singing voice synthesis: features, approaches, evaluation, and challenges. IEEE Signal Processing Magazine 32 (6), pp. 55–73. Cited by: §1.
  • [27] L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. Cited by: §3.2.
  • [28] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell (2016) Understanding data augmentation for classification: when to warp?. In Proc. Int. Conf. Digital Image Computing: Techniques and Applications, Cited by: §1, §2.2.