1 Proposed method
Our proposed method simultaneously separates and transcribes musical mixtures using a deep net called Cerberus (depicted in Fig 1), a “three-headed” network, where each head has a different output and a different objective function. The key idea is to first transform the input representation (e.g. a spectrogram) via shared processing layers into a learned representation useful for both transcription and separation. The learned representation can then be processed by smaller networks that are specialized for separation or transcription.
1.1 Source separation
Assume an audio mixture is in a time-frequency representation, such as a short term Fourier transform (STFT), represented by a matrix. Here, each element indicates the magnitude and phase of the mixture at time and frequency . We perform a variant of mask-based source separation, where the goal is to make a non-negative mask matrix for each sound source , with values normalized to the interval . Each element indicates the degree to which the energy in the auditory mixture at time and frequency is due to source . To isolate source , one can element-wise multiply the magnitude of the STFT by the mask: .
To train a deep net to provide output useful for mask-based source separation, ground truth training mixtures and their corresponding ideal source masks are provided and a loss function is used to measure the difference between network output and the ideal output. We build on two similar prior works: deep clustering and the Chimera architecture .
In deep clustering, a neural network is trained to map each time-frequency bin in to a point in a higher-dimensional embedding space where bins that primarily contain energy from the same same source are near each other and bins that belong to different sources are far from each other. Call this embedding . Once trained, the network is used to embed a new magnitude spectrogram representing an auditory scene. Mask assignments can then be made by clustering time-frequency points in the higher dimensional embedding
, assigning elements in the same cluster to the same source. With classical deep clustering, a clustering algorithm such as K-means is applied to the embedding space to determine the source assignments for every time-frequency point, which are then used to make a mask.
The Chimera architecture for audio source separation is a network architecture that combines deep clustering with a more traditional signal reconstruction loss . In Chimera, there are two “heads” attached to a single “body”. One head is trained with the deep clustering loss function while the other is trained to create the mask. Both heads are trained simultaneously during training. In Chimera, the mask inference head directly creates a mask that is element-wise multiplied to the mixture spectrogram to recover the sources. In this case, the deep clustering head acts as a regularizer for the mask inference head. The mask inference head is trained using a phase sensitive approximation (PSA)  (or similar).
1.2 Adding a Transcription Head: Cerberus
Since Chimera networks have been shown to outperform networks that do only deep clustering or only mask inference, we chose to use both heads in our work. We propose adding a third head to a Chimera network for further multitask learning. We call the proposed architecture a Cerberus network. The new head is used for the automatic transcription of musical mixtures containing multiple polyphonic instruments. All together, the output of the Cerberus architecture has three heads: a deep clustering head producing embeddings, a mask inference head creating masks that are applied to mixture spectrograms, and a transcription head that produces a piano roll transcription for each instrument in the mixture. The transcription estimate is a real-valued matrix with shape , where is time frames (aligned with STFT time frames), is the number of possible pitches, and is the number of instruments to transcribe. Once trained, the output of this head is quantized to binary to produce a piano roll transcription.
|Loss Type||SDR (dB)||P||R||F1|
The system is trained using a weighted linear combination of three loss functions.
The deep clustering  loss () and the mask inference  loss () are those used in the Chimera network. For transcription loss , we found that distance to a MIDI-derived piano-roll score works best as the loss (see the next section for details). For inference, any of the three heads can be used, depending on the task.
2 Experimental Design
Our experiments are designed to answer two questions. The first is whether learning to simultaneously separate and transcribe using a single network helps or hurts performance on either task, versus learning the tasks independently. The second is whether all three heads are needed for a effective separation and transcription system.
2.1 Datasets and evaluation
To train a Cerberus network, a dataset is required that contains mixtures, isolated sources, and ground truth transcriptions for those sources. Slakh2100  is one such dataset. It is comprised of 2100 mixtures made with sample-based professional synthesizers along with isolated sources and accompanying MIDI data for each source. We downsample the audio to 16 kHz. To make an example, we pick a mix, pick a subset of the sources (e.g. piano, guitar, bass) in the mixture, and then pick a 5 second segment where all the desired sources have at least 10 note onsets with MIDI velocity above
. The source audio for the desired sources is combined to make a mixture of just those sources. STFTs with 1024-point window size and 256 sample hop are calculated from mixture segments as inputs to the network. The musical score is the accompanying MIDI data binarized with a velocity threshold of.
Using this procedure, we made 4 sets, each with 20000 segments (28 hours) for training, 3000 (4 hours) for validation, and 1000 (1.4 hours) for testing. The instrument combinations for the four sets were piano + guitar (set 1), piano + guitar + bass (set 2), and piano + guitar + bass + drums (set 3), and piano + guitar + bass + drums + strings (set 4).
In addition to the synthesized audio data, we evaluate on recordings of real instruments. To our knowledge, no large dataset exists that contains real-world recordings of mixtures, isolated sources, and ground truth transcriptions. Thus, we make mixtures from two datasets of real solo instrument recordings. The first is the MAPS222The MUS partition of both ENSTDkAm and ENSTDkCl. , which contains 30 live piano performances of classical music recorded with two microphones (60 clips total), and a Disklavier MIDI recorder. The second dataset is GuitarSet , which contains 360 30-second guitar excerpts in 5 styles. We downsample to 16 kHz, select 5 second segments with at least 10 note onsets, and use the same STFT parameters. We randomly selected segments from each dataset to make 1000 instantaneous mixtures with accompanying sources and score data. These mixes are incoherent and highly dissimilar to the data we used to train our network.
2.2 Networks we evaluate
All the networks we trained use a stack of 4 bidirectional long short-term memory (BLSTM) layers). Each BLSTM has 300 hidden units. We trained each network for 100 epochs using an Adam optimizer with an initial learning rate of 2e-4, a batch size of, and a sequence length of frames. Each network had three heads. The first head maps each time-frequency point to a 20-dimensional embedding space, with sigmoid activation and unit-normalization. The second head outputs masks for each of the sources we trained the network to separate (between 2 and 5 masks), with a softmax activation across the masks. The third head outputs transcriptions for each source and has a sigmoid activation. Each transcription contains pitches and when each pitch is active. For evaluation, we binarized the network’s transcriptions using a static threshold of , except for drums, which used a static threshold of . Each network was initialized with the same set of weights. The only difference between the networks is the training data (for which instrument combination to separate) and the weights on the the three loss functions: deep clustering (DC), mask inference (MI), and transcription (TR).
For the first set of experiments, we trained a Cerberus network to separate and transcribe mixtures of one piano and one guitar from Slakh2100. The dataset for this experiment include acoustic and electric pianos and acoustic, electric, and distorted guitars. In this experiment set, we set certain loss weights to zero to make seven combinations of Cerberus networks. Turning off the transcription loss ( i.e., in Equation 1) results in a standard Chimera network. In the Chimera network (4th row of Table 1), we weighted the two separation losses equally. In the Cerberus and Chimera transcription networks, we observed that the scale of the transcription loss was much smaller than the scale of the separation losses. To counteract this, we more heavily weighted the transcription loss during training for these networks, while keeping the two separation losses at equal weight.
|Dataset||Network Type||SDR (dB)||P||R||F1|
|M + GS||Deep Clustering Only||4.3|
|M + GS||Mask Inference Only||4.1|
|M + GS||Chimera||4.5|
|M + GS||Transcription Only||0.14||0.08||0.09|
|M + GS||Cerberus||4.7||0.16||0.10||0.12|
The results in Table 1 suggest that transcription and separation can be learned jointly, given the correct training regime. First, we find that the best performing model for both transcription and separation was the Cerberus model, which surpassed or tied the highest SDR and precision, recall, and F1 scores of the remaining models. Combining the mask inference and transcription objectives resulted in higher transcription performance but lowered separation performance very slightly. Finally, combining the deep clustering and transcription objectives resulted in a large jump in SDR over just deep clustering, suggesting some natural synergy between the two tasks.
Next, we took networks trained on the synthesized Slakh dataset and evaluated them on the real-world dataset we generated from MAPS and GuitarSet. The results are shown in Table 2. First, we notice a significant drop in separation and transcription performance. This is, in large part, due to the major differences between the training and test data. We note that the Cerberus model that was trained with all three loss functions out-performs all of the single-task networks for both separation and transcription, suggesting that our multi-task approach leads to better generalization.
|on data that contains:||SDR (dB)||P||R||F1|
Results for individual instruments from three Cerberus networks trained on different sets of instrument combinations, separated by horizontal lines. Each model has its own training, validation, and test set which depend on the instruments it is trained to separate and transcribe. Drum (*) transcription evaluation measures note onset, all other instruments are note on/off precision/recall/f-score.
Finally, we trained and tested a Cerberus model on data sets of increasing numbers of simultaneous polyphonic instruments to see how the system scales up to more complex mixtures. The results, shown in Table 3, show that as we add more sources to the mixture, performance across both separation and transcription predictably degrades. While piano, guitar, and strings results are low in the most complex setup (bottom rows), bass and drums can still be separated and transcribed from complex mixtures.
We introduced an architecture to simultaneously transcribe and separate multiple instruments in a musical mixture. This architecture, called Cerberus, has three “heads”: one for transcription, one for deep clustering, and one for mask inference. Cerberus networks are more effective at both tasks than single-task networks on both real and synthesized data. Future work could include more involved network architectures, dedicated losses for note onsets and velocities, training on existing score-aligned recordings of isolated instruments to strengthen transcription on real recordings, as well as training on real multi-instrument recordings when aligned score data becomes available.
-  (2013) Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems 41 (3), pp. 407–434. Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2006) Query by humming with the vocalsearch system. Communications of the ACM 49 (8), pp. 49–52. Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2018) Increasing drum transcription vocabulary using data synthesis. In Proc. of the 21st Int. Conference on Digital Audio Effects (DAFx-18). Aveiro, Portugal, Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2019) MIMO-speech: end-to-end multi-channel multi-speaker speech recognition. arXiv preprint arXiv:1910.06522. Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2011) Soundprism: an online system for score-informed source separation of music audio. IEEE Journal of Selected Topics in Signal Processing 5 (6), pp. 1205–1215. Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2009) Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Audio, Speech, and Language Processing 18 (6), pp. 1643–1654. Cited by: §2.1.
Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 708–712. Cited by: §1.1, §1.2, Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2014) Score-informed source separation for musical audio recordings: an overview. IEEE Signal Processing Magazine 31 (3), pp. 116–124. Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2016) Deep clustering: discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. Cited by: §1.1, §1.2, Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2019) Multitask learning for frame-level instrument recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 381–385. Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2019) SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. Cited by: §2.1.
-  (2017) Deep clustering and conventional networks for music separation: stronger together. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 61–65. Cited by: §1.1, Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2019) Cutting music source separation some Slakh: a dataset to study the impact of training data quality and quantity. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Cited by: §2.1.
-  (2014) Mir_eval: a transparent implementation of common mir metrics. In In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR, Cited by: §2.1.
-  (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2008) Automatic transcription of melody, bass line, and chords in polyphonic music. Computer Music Journal 32 (3), pp. 72–86. Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2019) A unified neural architecture for instrumental audio tasks. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 461–465. Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2018) Jointly detecting and separating singing voice: a multi-task approach. In International Conference on Latent Variable Analysis and Signal Separation, pp. 329–339. Cited by: Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments.
-  (2018) GuitarSet: a dataset for guitar transcription.. In ISMIR, pp. 453–460. Cited by: §2.1.