We propose a unified model for three inter-related tasks: 1) to separate individual sound sources from a mixed music audio, 2) to transcribe each sound source to MIDI notes, and 3) to synthesize new pieces based on the timbre of separated sources. The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre. To mirror such capability computationally, we designed a pitch-timbre disentanglement module based on a popular encoder-decoder neural architecture for source separation. The key inductive biases are vector-quantization for pitch representation and pitch-transformation invariant for timbre representation. In addition, we adopted a query-by-example method to achieve zero-shot learning, i.e., the model is capable of doing source separation, transcription, and synthesis for unseen instruments. The current design focuses on audio mixtures of two monophonic instruments. Experimental results show that our model outperforms existing multi-task baselines, and the transcribed score serves as a powerful auxiliary for separation tasks.READ FULL TEXT VIEW PDF
Separating different music instruments playing the same piece is a
Source separation for music is the task of isolating contributions, or s...
Audio steganography aims at concealing secret information in carrier aud...
In this paper, we introduce a simple method that can separate arbitrary
Within Music Information Retrieval (MIR), prominent tasks -- including
Modern keyboards allow a musician to play multiple instruments at the sa...
We present a single deep learning architecture that can both separate an...
Music source separation (MSS) is a core problem in music information retrieval (MIR), which aims to separate individual sound sources, either instrumental or vocal, from a mixed music audio. A good separation benefits various of downstream tasks of music understanding and generation[28, 27] since many music-processing algorithms call for “clean” sound sources.
With the development of deep neural networks, we see significant performance improvements in MSS. The current mainstream methodology is to train on pre-defined music sources and then infer a mask on the spectrogram (or other data representations) of the mixed audio. More recently, we see several new efforts in MSS research, including query-based method[9, 21, 20, 13] for unseen (not pre-defined) sources, semantic-based separation that incorporates auxiliary information such as score or video[26, 16, 1, 3, 15, 29, 2], and multi-task settings.
This study conceptually combines the aforementioned new ideas but follows a very different methodology — instead of directly applying masks, we regard MSS an audio pitch-timbre disentanglement and reconstruction problem. Such strategy is inspired by the fact that when humans listen to music, our minds not only separate the sounds into different sources but also perceive high-level pitch and timbre representations that generalize well during both music understanding and creation. For example, humans can easily identify the same timbre in other pieces or identify the same piece played by other instruments. People can even mimic the learned timbre using human voice and sing (i.e., to synthesize via voice) the learned pitch sequence.
To mirror such capability computationally, we propose a zero-shot multi-task model jointly performing MSS, automatic music transcription (AMT), and synthesis. The model comprises four components: 1) a query-by-example (QBE) network, 2) a pitch-timbre disentanglement module, 3) a transcriptor, and 4) an audio encoder-decoder network. First, the QBE network summarizes the clean query example audio (which contains only one instrument) into a low-dimensional query vector, conditioned on which the audio encoder extracts the latent representation of an individual sound source. Second, the model disentangles the latent representation into pitch and timbre vectors while transcribing the score using the transcriptor. Finally, the audio decoder takes in both the disentangled pitch and timbre representations, generating a separated sound source. When the model further equips the timbre representation with a pitch-transformation invariance loss, the decoder becomes a synthesizer, capable of generating new sounds based on an existing timbre vector and new scores.
The current model focuses on audio mixtures of two monophonic instruments and performs in a frame-by-frame fashion. Also, it only transcribes pitch and duration information. We leave polyphonic and vocal scenarios as well as a more complete transcription for future work. In sum, our contributions are:
Zero-shot multi-task modeling: To the best of our knowledge, it is the first model that jointly performs separation, transcription, and synthesis. It works for both previously seen and unseen sources using a query-based method.
Well-suited inductive bias
: The neural structure is analogous to the “hardware” of the model, which alone is inadequate to achieve good disentanglement. We designed two extra inductive biases: vector-quantization for pitch representation and pitch-transformation invariant for timbre representation, which serves as a critical part of the “software” of the model.
None-mask-based MSS: Our methodology regards MSS an audio pitch-timbre disentanglement and re-creation problem, unifying music understanding and generation in a representation learning framework.
Most effective music source separation (MSS) methods are based on well-designed neural networks, such as U-Net and MMDenseLSTM. Here, we review three new trends of MSS related of our work: 1) multi-task learning, 2) zero-shot for unseen sources, and 3) taking advantage of auxiliary semantic information.
Several recent studies[12, 5, 24] conduct multi-task separation and transcription by learning a joint representation for both tasks. These works demonstrated that a multi-task setting benefits one or both of the two tasks due to the better generalized capability of the learned joint representation. Our model is a multi-task and can further disentangle pitch and timbre representation for sound synthesis.
Few-shot and zero-shot learning are becoming popular in MIR. For the MSS task, It is meaningful to separate unseen rather than pre-defined sources since it is unrealistic to collect training data that covers all the sources with considerable amounts. Query-by-example (QBE) network is one of solutions for zero-shot learning and recent researches [9, 21, 4, 25, 11, 8, 13] show its nice performance. In this study, we adopt a QBE as in .
Many researches demonstrate that semantic information is a useful auxiliary for MSS. For example, Gover et al. designs a score-informed wave-U-Net to separate choral music; Jeon et al. performs the lyrics-informed separation; Meseguer-Brocal et al. develops a phoneme-informed C-U-Net; Zhao et al. takes advantage of visual information to separate homogeneous instruments. But these methods cannot separate sources without additional semantic groundtruths during inference. Our study can also be regarded as score-informed MSS, but our model does not call for ground truth score during the inference time.
In this section, we describe our proposed 1) multi-task and QBE model for source separation; 2) pitch-timbre disentanglement module; 3) pitch-translation invariance loss.
Different from previous works that tackle the music separation and music transcription problems separately, we learn a joint representation for both of them. Previous works [12, 5] have shown that the representation learnt by a joint separation and transcription task can generalize better than the representation learnt by single-task models.
We denote the waveform of two single-source audio segments from different sources as and , respectively. We denote their mixture as:
Our aim is to separate from . We denote the spectrogram of and as and , respectively.
We first formalize the general MSS model using an encoder-decoder neural architecture. For instance, UNet 
is an encoder-decoder architecture which is widely used in MSS. By ignoring the skip connections of UNet, the output of the encoder (the bottleneck of U-Net) can be used as a joint representation for separation and transcription. Different from previous MSS methods that estimate a single-target mask on the mixture spectrogram, we design the separation model to directly output spectrograms. In this way, the model can not only separate a source from a mixture, but can also synthesize new audio recordings from joint representations.
For the source separation system, we denote the encoder and decoder as follows:
where is the learned joint representation and is the decoder for the target source . The joint representation is used as input to the a transcription model:
are probabilities of the predicted MIDI roll. Typically, we setincluding notes on a piano and a silence state.
When designing neural networks, to remain the transcription resolution, we do not apply temporal pooling operation in the encoder, decoder and transcriptor. So that the temporal resolution of is consistent with that of . We describe the details of the encoder, decoder, and transcriptor in Section 4.2.
As described in Equation (3) and (4), we need to build decoders to separate target sources. With the number of target sources increases, the parameters number will also increase. More importantly, the model trained for pre-defined sources can not adapt to unseen sources. To tackle these problems, we design a QBE module in our model. The advantage of using QBE is that we can separate unseen target sources. That is, we achieve a zero-shot separation.
Similar to the QueryNet , we design a QueryNet module as shown in Figure 1(a). The QueryNet module extracts the embedding vector of an input spectrogram , where is the dimension of the embedding vector:
Audio recordings from the same source will be learnt to have similar embedding vectors. We propose a contrastive loss to encourage embedding vectors from the same sources to be close, and embedding vectors from different sources to be far:
where is a margin and and are from the same source , and and are from different sources. We set to and to in our experiments.
We input the embedding vector as a condition to each layer of the encoder using Feature-wise Linear Modulation (FiLM) layers. Then the encoder outputs a representation . The embedding vector controls what source to separate or transcribe. There are only one encoder, one decoder, and one transcriptor to separate any sources:
Previous MSS works do not disentangle pitch and timbre for separation. That is, those MSS methods implement separation systems without estimating pitches. In this section, we propose a pitch-timbre disentanglement module based on the query-based encoder-decoder architecture described in previous sections to learn interpretable representations for MSS. Such interpretable representations enable the model to achieve score-informed separation based on predicted scores.
As shown in Figure 1(b), the proposed pitch-timbre disentanglement module consists of a PitchExtractor and a TimbreFilter module. The output of PitchExtractor only contains pitch information of , and the output of TimbreFitler is expected to only contain timbre information of . The PitchExtractor is modeled by an embedding layer , where is the number of vectors, which equals to the number of pitches in our experiment. To explain, denotes the quantized pitch vector for the -th MIDI note. Then, we calculate the disentangled pitch representation for as , where :
where is the output of the transcriptor containing predicted presence probability of the -th MIDI note or the silence state at time , and is the dimension of the disentangled pitch representation. During synthesis, we can replace
with one-hot encodings of new scores as input to Equation (10) to obtain pitch representation for synthesizing audio recordings.
TimbreFilter is used to filter timbre information from :
Here, TimbreFilter is modeled by a convolutional neural network. Then, we can synthesizeusing disentangled pitch and timbre . Inspired by the FiLM , we first split into and , where . Then, we entangle and together to produce :
and the separation loss is:
Different from previous MSS works, we apply a separation loss and a transcription loss to train the proposed model. The transcription loss is:
is the groundtruth of scores. The aggregated loss function is:
The aggregated loss drives the proposed model to be a multi-task score-informed model rather than a synthesizer due to the lack of inductive biases for further timbre disentanglement.
We propose a pitch-translation invariance loss to further improve the timbre disentanglement performance. Based on the pitch-translation invariance, we assume that when the audio pitches with the corresponding MIDI is shifted within a certain interval, the timbre is unchanged.
We shift the pitch of to generate an augmented audio . The augmented audio has the same timbre as . According to Equation (1), we have a new mixture audio :
We denote the and as the spectrograms of and respectively. We extract the disentangled timbre vector of , and denote it as . Because is pitch shifted , so that the timbre should be consistent with that of . Therefore,the reconstructed spectrogram by the timbre and the pitch should be consistent with :
where is the reconstructed spectrogram. We denote as a pitch-translation invariance loss. With , our proposed model is capable of learning the disentanglement of pitch and timbre. A byproduct of the disentanglement system is that, the decoder of our system becomes a synthesizer, which can be used to synthesize audio recordings using timbre and pitches as input. When we change to arbitrary scores, our model can synthesis a new piece of music with the timbre of .
In total, the objective function we exploit to train the proposed model with further disentanglement includes a QueryNet loss, a transcription loss, and a pitch-translation invariance loss:
We utilize the University of Rochester Multimodal Music Performance (URMP) dataset as the experimental dataset. The URMP dataset is a multi-instrument audio-visual dataset covering 44 classical chamber music pieces remixed from 115 single-source tracks of 13 different monophonic instruments. The dataset provides note annotations for each single track. As shown in Figure 2, we divide these instruments into two groups (8 seen and 5 unseen instruments) and tracks into two sub-sets (55 tracks of 8 seen instruments for training and 32 songs by remixing 60 tracks of 13 instruments for test). Note that we calculate the duration of repeated tracks of different songs in the test set and do not exclude silence segments of all the tracks.
We resample all the tracks with a sample rate of 16KHz and extract them into Short-time Fourier transform (STFT) spectrograms with a window size of 1024 and 10ms overlap. During training, we randomly remix 2 arbitrary clips of different instruments to generate a mixture. All the training data are augmented using pitch shifting ( semitones) mentioned in Section 3.4.
|Separation (SDR)||Transcription (Precision)|
We design our models based on U-Net, the current prominent model in MSS. Figure 1(b) and 3 elaborates details of the proposed multi-task score-informed model (MSI) described in Section 3.3 and model with further disentanglement (MSI-DIS) illustrated in Section 3.4.
The combination of the encoder and decoder is a general U-Net without temporal pooling. The QueryNet comprises 2 CNN blocks, each of which consists of 2 convolution layers and a max pooling module. A fully-connected layer and a tanh activation layer are applied to the last feature maps. We then average output vectors over the temporal axis to get a 6-dimensional query embedding vector . The architecture of the transcriptor is similar to the QueryNet but without temporal pooling. Each blue block in TimbreFilter depicted in Figure 3
is a 2-dimension convolutional layer, the shape of the tensor output by which is as same as that of the input tensor. Each deep blue block in PitchExtractor is a 1-dimensionconvolutional layer. Typically, the bottleneck of U-Net is regarded as . However, when constructing disentangled timbre representations, we regard the set of concatenate residual tensors as to avoid non-disentangled representations leaking into the decoder.
As shown in Figure 1(a), besides the proposed models illustrated above, we also report the performance of 3 extra baseline models in our experimental results. The QBE transcription-only baseline model (AMT-only) is composed of the queryNet, encoder, and transcriptor; the QBE separation-only baseline model (MSS-only) is a general U-Net; the QBE multi-task baseline model is composed of a U-Net and a transcriptor. All the hyper-parameters of components in these models are consistent with those of corresponding components in our models.
All the models are trained with a mini-batch of 12 audio pairs for 200 epochs. All the models are evaluated with source-to-distortion (SDR) computed by mir_eval pakage for separation and precision computed by sklearn package for transcription. During training, each audio pair comprises 2 single-track audio clips of different instruments to generate a mixture, 2 correspondings augmented samples for pitch-transformation invariance loss, and 3 single-track audio clips that exclude silence segments for contrastive loss. During inference, each test pair comprises a 4-second audio mixture and query sample. During synthesis, we employ Griffin Lim Algorithm (GLA)
as the phase vocoder using torchaudio library. Since we do not divide a validation set to chose the best-performance model among all the training epochs, we report Micro-average results with a 95% confidence interval (CI) of models at the last 10 epochs. All the experimental results are reproducible via our realeased source code111https://github.com/kikyo-16/a-unified-model-for-zero-shot-musical-source-separation-transcription-and-synthesis.
Experimental results shown in Table 1 demonstrate that the proposed MSI model outperforms baselines on separation without sacrificing performance on transcription. The instrument-wise performance on unseen instruments depicted in Figure 4 demonstrates that the proposed models are capable of performing zero-shot transcription and separation. We also release synthesized audio demos online222https://kikyo-16.github.io/demo-page-of-a-unified-model-for-separation-transcriptiion-synthesis. These demos demonstrate the success of the proposed inductive biases for disentanglement.
With the auxiliary of the proposed pitch-timbre disentanglement module, compared with the multi-task baseline, the performance of MSI on separation becomes better. This indicates that the disentanglement module improve the generalization capability of the joint representation, leading to better separation results. Meanwhile, MSI outperforms the MSS-only baseline on separation by 1.06 points. This demonstrates that inaccurate scores transcribed by the model itself sever as a powerful auxiliary for separation.
As depicted in Figure 5(a) and 5(b), it is interesting that despite the same ‘’hardware” (neural network design) of the two models, the MSI model fails to synthesis but the MSI-DIS model achieves. It exactly demonstrates that the designed ‘’soft-ware” (the pitch translation loss) takes effect on the success of the disentanglement. As for separation performance shown in Table 1, the MSI-DIS model falls behind the MSI model. The observation that better synthesis quality does not implies better separation performance suggests a trade-off between disentanglement and reconstruction. It indicates that extra (well-suited) inductive biases are required to further improve pitch and timbre disentanglement at the same time reduce the loss of information necessary for reconstruction.
Comparing the performance on seen with unseen instruments shown in Table 1, we find that the separation quality of the MSI-DIS model is more sensitive to the accuracy of transcription results than that of the MSI model. This is because the MSI-DIS model synthesizes instead of separating sources, for which the separation performance of it relies more on the accuracy of transcription results and the capability of the decoder than the MSI model does. However, when comparing separated spectrograms shown in Figure 5(c) and 5(d), we find that the MSI model sometimes separates multiple pitches at the same time while the MSI-DIS model yields monophonic results that sound more “clean”. We release more synthesized and separated audio demos online.
We contributed a unified model for zero-shot music source separation, transcription, and synthesis via pitch and timbre disentanglement. The main novelty lies in the disentanglement-and-reconstruction methodology for source separation, which naturally empowers the model with transcription and synthesis capabilities. In addition, we designed well-suited inductive bias including pitch vector quantization and pitch-translation invariant timbre loss to achieve better disentanglement. Lastly, we successfully integrate the model with a query-based networks, so that all three tasks can be achieved in a zero-shot fashion for unseen sound sources. Experiments demonstrated the zero-shot capability of the model and the powerful auxiliary of disentangled pitch information to separation. Results of synthesized audio pieces further exhibit that the disentangled factors are well generalized. In the future, we plan to extent the proposed framework for multi-instrument and vocal scenarios as well as high-fidelity synthesis.
Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.3.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §3.2, §3.3.
Mmdenselstm: an efficient combination of convolutional and recurrent neural networks for audio source separation. In 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 106–110. Cited by: §2.