Log In Sign Up

Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments

by   Ethan Manilow, et al.

We present a single deep learning architecture that can both separate an audio recording of a musical mixture into constituent single-instrument recordings and transcribe these instruments into a human-readable format at the same time, learning a shared musical representation for both tasks. This novel architecture, which we call Cerberus, builds on the Chimera network for source separation by adding a third "head" for transcription. By training each head with different losses, we are able to jointly learn how to separate and transcribe up to 5 instruments in our experiments with a single network. We show that the two tasks are highly complementary with one another and when learned jointly, lead to Cerberus networks that are better at both separation and transcription and generalize better to unseen mixtures.


Few-Shot Musical Source Separation

Deep learning-based approaches to musical source separation are often li...

Bespoke Neural Networks for Score-Informed Source Separation

In this paper, we introduce a simple method that can separate arbitrary ...

SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance

Recent years have witnessed the success of deep learning on the visual s...

Audio Transport: A Generalized Portamento via Optimal Transport

This paper proposes a new method to interpolate between two audio signal...

Singer separation for karaoke content generation

Due to the rapid development of deep learning, we can now successfully s...

Blind Source Separation in Polyphonic Music Recordings Using Deep Neural Networks Trained via Policy Gradients

We propose a method for the blind separation of sounds of musical instru...

Timbre Classification of Musical Instruments with a Deep Learning Multi-Head Attention-Based Model

The aim of this work is to define a model based on deep learning that is...

1 Proposed method

Our proposed method simultaneously separates and transcribes musical mixtures using a deep net called Cerberus (depicted in Fig 1), a “three-headed” network, where each head has a different output and a different objective function. The key idea is to first transform the input representation (e.g. a spectrogram) via shared processing layers into a learned representation useful for both transcription and separation. The learned representation can then be processed by smaller networks that are specialized for separation or transcription.

1.1 Source separation

Assume an audio mixture is in a time-frequency representation, such as a short term Fourier transform (STFT), represented by a matrix

. Here, each element indicates the magnitude and phase of the mixture at time and frequency . We perform a variant of mask-based source separation, where the goal is to make a non-negative mask matrix for each sound source , with values normalized to the interval . Each element indicates the degree to which the energy in the auditory mixture at time and frequency is due to source . To isolate source , one can element-wise multiply the magnitude of the STFT by the mask: .

To train a deep net to provide output useful for mask-based source separation, ground truth training mixtures and their corresponding ideal source masks are provided and a loss function is used to measure the difference between network output and the ideal output. We build on two similar prior works: deep clustering

[9] and the Chimera architecture [12].

In deep clustering, a neural network is trained to map each time-frequency bin in to a point in a higher-dimensional embedding space where bins that primarily contain energy from the same same source are near each other and bins that belong to different sources are far from each other. Call this embedding . Once trained, the network is used to embed a new magnitude spectrogram representing an auditory scene. Mask assignments can then be made by clustering time-frequency points in the higher dimensional embedding

, assigning elements in the same cluster to the same source. With classical deep clustering, a clustering algorithm such as K-means is applied to the embedding space to determine the source assignments for every time-frequency point, which are then used to make a mask.

The Chimera architecture for audio source separation is a network architecture that combines deep clustering with a more traditional signal reconstruction loss [7]. In Chimera, there are two “heads” attached to a single “body”. One head is trained with the deep clustering loss function while the other is trained to create the mask. Both heads are trained simultaneously during training. In Chimera, the mask inference head directly creates a mask that is element-wise multiplied to the mixture spectrogram to recover the sources. In this case, the deep clustering head acts as a regularizer for the mask inference head. The mask inference head is trained using a phase sensitive approximation (PSA) [7] (or similar).

1.2 Adding a Transcription Head: Cerberus

Since Chimera networks have been shown to outperform networks that do only deep clustering or only mask inference, we chose to use both heads in our work. We propose adding a third head to a Chimera network for further multitask learning. We call the proposed architecture a Cerberus network. The new head is used for the automatic transcription of musical mixtures containing multiple polyphonic instruments. All together, the output of the Cerberus architecture has three heads: a deep clustering head producing embeddings, a mask inference head creating masks that are applied to mixture spectrograms, and a transcription head that produces a piano roll transcription for each instrument in the mixture. The transcription estimate is a real-valued matrix with shape , where is time frames (aligned with STFT time frames), is the number of possible pitches, and is the number of instruments to transcribe. Once trained, the output of this head is quantized to binary to produce a piano roll transcription.

Separation   Transcription
Loss Type   SDR (dB)   P R F1


    0.48 0.43 0.44


  9.3   0.48 0.41 0.43
  9.8   0.51 0.46 0.47


  10.0   0.51 0.45 0.47
Table 1: Cerberus networks trained and tested on piano + guitar mixtures. Each row is a distinct network, trained with a distinct combination of three loss functions: Deep Clustering (DC), Mask Inference (MI) and Transcription (TR). The weight applied to a loss function is shown as a subscript. Evaluation measures for the transcription task are precision (P), recall (R) and F1. Evaluation for separation is scale-dependent source to distortion ratio (SDR). Higher values are better. The value in each cell is on the testing data, averaged across both instruments. Grey cells indicate the network was not trained for that task.

The system is trained using a weighted linear combination of three loss functions.


The deep clustering [9] loss () and the mask inference [7] loss () are those used in the Chimera network. For transcription loss , we found that distance to a MIDI-derived piano-roll score works best as the loss (see the next section for details). For inference, any of the three heads can be used, depending on the task.

2 Experimental Design

Our experiments are designed to answer two questions. The first is whether learning to simultaneously separate and transcribe using a single network helps or hurts performance on either task, versus learning the tasks independently. The second is whether all three heads are needed for a effective separation and transcription system.

2.1 Datasets and evaluation

To train a Cerberus network, a dataset is required that contains mixtures, isolated sources, and ground truth transcriptions for those sources. Slakh2100 [13] is one such dataset. It is comprised of 2100 mixtures made with sample-based professional synthesizers along with isolated sources and accompanying MIDI data for each source. We downsample the audio to 16 kHz. To make an example, we pick a mix, pick a subset of the sources (e.g. piano, guitar, bass) in the mixture, and then pick a 5 second segment where all the desired sources have at least 10 note onsets with MIDI velocity above

. The source audio for the desired sources is combined to make a mixture of just those sources. STFTs with 1024-point window size and 256 sample hop are calculated from mixture segments as inputs to the network. The musical score is the accompanying MIDI data binarized with a velocity threshold of


Using this procedure, we made 4 sets, each with 20000 segments (28 hours) for training, 3000 (4 hours) for validation, and 1000 (1.4 hours) for testing. The instrument combinations for the four sets were piano + guitar (set 1), piano + guitar + bass (set 2), and piano + guitar + bass + drums (set 3), and piano + guitar + bass + drums + strings (set 4).

In addition to the synthesized audio data, we evaluate on recordings of real instruments. To our knowledge, no large dataset exists that contains real-world recordings of mixtures, isolated sources, and ground truth transcriptions. Thus, we make mixtures from two datasets of real solo instrument recordings. The first is the MAPS222The MUS partition of both ENSTDkAm and ENSTDkCl. [6], which contains 30 live piano performances of classical music recorded with two microphones (60 clips total), and a Disklavier MIDI recorder. The second dataset is GuitarSet [19], which contains 360 30-second guitar excerpts in 5 styles. We downsample to 16 kHz, select 5 second segments with at least 10 note onsets, and use the same STFT parameters. We randomly selected segments from each dataset to make 1000 instantaneous mixtures with accompanying sources and score data. These mixes are incoherent and highly dissimilar to the data we used to train our network.

For source separation, we use the scale-dependent source-to-distortion ratio [11] for evaluation. For transcription we use precision, recall, and F1-score of note onsets and offsets using the mir_eval toolbox [14]. These are both commonly used measures in the literature for their respective tasks.

2.2 Networks we evaluate

All the networks we trained use a stack of 4 bidirectional long short-term memory (BLSTM) layers). Each BLSTM has 300 hidden units. We trained each network for 100 epochs using an Adam optimizer with an initial learning rate of 2e-4, a batch size of

, and a sequence length of frames. Each network had three heads. The first head maps each time-frequency point to a 20-dimensional embedding space, with sigmoid activation and unit-normalization. The second head outputs masks for each of the sources we trained the network to separate (between 2 and 5 masks), with a softmax activation across the masks. The third head outputs transcriptions for each source and has a sigmoid activation. Each transcription contains pitches and when each pitch is active. For evaluation, we binarized the network’s transcriptions using a static threshold of , except for drums, which used a static threshold of . Each network was initialized with the same set of weights. The only difference between the networks is the training data (for which instrument combination to separate) and the weights on the the three loss functions: deep clustering (DC), mask inference (MI), and transcription (TR).

3 Results

For the first set of experiments, we trained a Cerberus network to separate and transcribe mixtures of one piano and one guitar from Slakh2100. The dataset for this experiment include acoustic and electric pianos and acoustic, electric, and distorted guitars. In this experiment set, we set certain loss weights to zero to make seven combinations of Cerberus networks. Turning off the transcription loss ( i.e., in Equation 1) results in a standard Chimera network. In the Chimera network (4th row of Table 1), we weighted the two separation losses equally. In the Cerberus and Chimera transcription networks, we observed that the scale of the transcription loss was much smaller than the scale of the separation losses. To counteract this, we more heavily weighted the transcription loss during training for these networks, while keeping the two separation losses at equal weight.

Test     Separation   Transcription
Dataset   Network Type   SDR (dB)   P R F1


Transcription Only     0.08 0.08 0.08
GS  Cerberus     0.13 0.11 0.12


M   Transcription Only     0.19 0.08 0.11
M   Cerberus     0.19 0.10 0.12


M + GS   Deep Clustering Only   4.3  
M + GS   Mask Inference Only   4.1  
M + GS   Chimera   4.5  
M + GS   Transcription Only     0.14 0.08 0.09
M + GS   Cerberus   4.7   0.16 0.10 0.12
Table 2: Piano/Guitar performance on MAPS and GuitarSet data using networks from Table 1 trained on Slakh2100. M means MAPS recordings in isolation, GS means GuitarSet recordings in isolation, and M+GS means incoherent mixtures of recordings from MAPS and GuitarSet. Grey cells indicate the network was not trained for that task. Evaluation measures for the transcription task are precision (P), recall (R) and F1. Evaluation for separation is scale-dependent source to distortion ratio (SDR). Higher values are better.

The results in Table 1 suggest that transcription and separation can be learned jointly, given the correct training regime. First, we find that the best performing model for both transcription and separation was the Cerberus model, which surpassed or tied the highest SDR and precision, recall, and F1 scores of the remaining models. Combining the mask inference and transcription objectives resulted in higher transcription performance but lowered separation performance very slightly. Finally, combining the deep clustering and transcription objectives resulted in a large jump in SDR over just deep clustering, suggesting some natural synergy between the two tasks.

Next, we took networks trained on the synthesized Slakh dataset and evaluated them on the real-world dataset we generated from MAPS and GuitarSet. The results are shown in Table 2. First, we notice a significant drop in separation and transcription performance. This is, in large part, due to the major differences between the training and test data. We note that the Cerberus model that was trained with all three loss functions out-performs all of the single-task networks for both separation and transcription, suggesting that our multi-task approach leads to better generalization.

Cerberus trained/tested  
Separation   Transcription
on data that contains:   SDR (dB)   P R F1



3 Sources
Piano   7.6   0.44 0.42 0.42
& Guitar   6.9   0.46 0.35 0.38
& Bass   10.1   0.85 0.80 0.82


4 Sources Piano   6.1   0.38 0.36 0.36
& Guitar   5.8   0.42 0.32 0.34
& Bass   7.7   0.82 0.78 0.79
& Drums*   11.3   0.61 0.76 0.63


5 Sources Piano   3.4   0.31 0.28 0.28
& Guitar   3.1   0.29 0.20 0.22
& Bass   6.4   0.77 0.72 0.74
& Drums*   10.6   0.62 0.75 0.64
& Strings   4.1   0.39 0.35 0.35
Table 3:

Results for individual instruments from three Cerberus networks trained on different sets of instrument combinations, separated by horizontal lines. Each model has its own training, validation, and test set which depend on the instruments it is trained to separate and transcribe. Drum (*) transcription evaluation measures note onset, all other instruments are note on/off precision/recall/f-score.

Finally, we trained and tested a Cerberus model on data sets of increasing numbers of simultaneous polyphonic instruments to see how the system scales up to more complex mixtures. The results, shown in Table 3, show that as we add more sources to the mixture, performance across both separation and transcription predictably degrades. While piano, guitar, and strings results are low in the most complex setup (bottom rows), bass and drums can still be separated and transcribed from complex mixtures.

4 Conclusion

We introduced an architecture to simultaneously transcribe and separate multiple instruments in a musical mixture. This architecture, called Cerberus, has three “heads”: one for transcription, one for deep clustering, and one for mask inference. Cerberus networks are more effective at both tasks than single-task networks on both real and synthesized data. Future work could include more involved network architectures, dedicated losses for note onsets and velocities, training on existing score-aligned recordings of isolated instruments to strengthen transcription on real recordings, as well as training on real multi-instrument recordings when aligned score data becomes available.