Automatic Music Transcription (AMT) is the process of creating any form of notation for a music signal and is currently one of the most challenging and discussed topics in the Music Information Retrieval (MIR) community [Benetos2019]. Most AMT systems are designed to transcribe a single monophonic or a single polyphonic source into a musical score (or piano-roll). In this case, the main sub-task involved in the process is Multi-Pitch Estimation (MPE), where predictions regarding the pitch and time localisation of the musical notes are carried out. However, when analysing polyphonic multi-instrumental recordings, not only each note should have its pitch and duration properly estimated, but the information regarding the timbre of sounds should also be correctly processed [Duan2014]. It is mandatory to have a way of recognising the instrument that played each note.
In this paper, we propose a pitch-informed instrument assignment approach, where the main objective is to associate each note event of a music signal to one instrument class. In contrast to other state-of-the-art instrument recognition approaches, which are usually addressed on a frame-level [Hung2018, Hung2019] or clip-level [Han17, Gururani2019] basis, our approach analyses each note event individually. Therefore, it is possible to say that we perform a note-level instrument recognition.
Previous work has shown that the use of pitch information can help frame-level instrument recognition [Hung2018]. Inspired by this, we propose a framework that uses an auxiliary input based on note-event pitch information. Our system is trained using the note annotations provided in the MusicNet [Thickstun17] dataset. However, our main motivation is to create a modular framework that can be combined with any MPE algorithm in order to obtain multi-instrumental pitch predictions, which allows for transcribing music in staff notation, corresponding to the perception of pitch events. Therefore, we also show that our approach can obtain good performance when the note information is predicted by state-of-the-art MPE algorithms such as [Thome17, Wu2019a].
Furthermore, the utilisation of multiple kernel shapes in the filters of a Convolutional Neural Network (CNN) has been proven to be an efficient strategy of applying domain knowledge in several MIR tasks [Pons2016, Pons17_Timbre, Lordelo19]. In particular, [Lordelo19] applied this strategy with a dense connectivity pattern of skip-connections in order to learn even more efficient feature-maps and reduce the number of trainable parameters for the task of source separation. In our work, we build our CNN adapting the architecture in [Lordelo19] for the classification (instrument assignment) task and verified that it can also improve its performance. In summary, the main contributions of this paper are as follows:
Pitch-informed instrument assignment: Proposal of a Deep Neural Network (DNN) that associates each note from a music signal to its instrumental source.
Modular Framework: Approach works with any MPE method. We evaluate the performance when using ground-truth note labels as well as state-of-the-art MPE algorithms [Thome17, Wu2019a].
Multiple Kernel Shapes: Proposal of a CNN architecture for instrument assignment that uses multiple kernel shapes for the convolutions, facilitating learning representations for different instruments and note sound states. We show that their use improves instrument assignment performance.
2 Related Work
The instrument recognition task is usually formulated as a multi-label classification task that can be addressed either on a frame-level [Hung2018, Hung2019, Wu2020], where the purpose is to obtain the instrument activations across time, or on a clip-level basis [Han17, Gururani2019, Solanki2019], where the purpose is to estimate the instruments that are present in an audio clip. However, our objective in this work is to approach the instrument recognition task note by note, assigning an instrument class to each. Such a task requires note-event annotations and, in the literature, it is also known as instrument assignment [Benetos2013] or multi-pitch streaming [Duan2014].
Just few works have explored this particular task. For instance, Duan et al. [Duan2014] approached it using a constrained clustering of frame-level pitch estimates obtained from an MPE algorithm via the minimisation of timbre inconsistency within each cluster. They tested different timbre features for both music and speech signals. In [Arora2015], a similar method was proposed, where the authors applied Probabilistic Latent Component Analysis (PLCA) to decompose the audio signal into multi-pitch estimates and to extract source-specific features. Then, clustering was performed under the constraint of cognitive grouping of continuous pitch contours and segregation of simultaneous pitches into different source streams using Hidden Markov Random Fields. Both of those works, however, assume that each source is monophonic, i.e., each instrument could only play a single note at a time.
An alternative approach iss to model the temporal evolution of musical tones [Benetos2013]
. This method is based around the use of multiple spectral templates per pitch and instrument source that correspond to sound states. The authors used hidden Markov model-based temporal constraints to control the order of the templates and streamed the pitches via shift-invariant PLCA. In a more recent work, Tanaka et al.[Tanaka2020] also approached the task via clustering, but applied on a joint input representation combined of the spectrogram and the pitchgram, which was obtained using an MPE algorithm. In their proposal, each bin of the joint input was encoded onto a spherical latent space taking into account timbral characteristics and the piano-rolls of each instrument were later estimated via masking of the pitchgram based on the results of a deep spherical clustering technique applied on the latent space.
Recent multitask deep-learning based works have successfully proposed multi-instrumental AMT methods that are able to directly estimate pitches and associate them to their instrumental source jointly[Bittner2018, Hung2019, Manilow2020]. In [Bittner2018], a multitask deep learning network jointly estimated outputs for various tasks including multiple-pitch, melody, vocal and bass line estimation. The Harmonic Constant-Q Transform (HCQT) of the audio signal was used as input and the data used for training was semi-automatically labelled by remixing a diverse set of multitrack audio data from the MedleyDB [Bittner2017] dataset. In [Hung2019] a DNN was used to jointly predict the pitch and instrument for each audio frame. They used the Constant-Q Transform (CQT) as input to their system and trained using a large amount of audio signals synthesised from MIDI piano-rolls. Manilow et al. [Manilow2020], on the other hand, were able to jointly transcribe and separate an audio signal into up to instrumental sources — piano, guitar, bass and strings. However, their system was trained with only synthesised signals.
Our approach is closely related to that of Hung and Yang [Hung2018]
, where a frame-level instrument recogniser is proposed using the CQT spectrogram of the music signal allied with the pitch information of the note events. We also use the pitch annotations to guide the instrument classifier, but our work differs from[Hung2018] in the fact that we perform a classification for each note event individually, while Hung and Yang use the whole piano-roll at once to guide frame-level instrument recognition. While Hung and Yang are able to obtain the instrument activations leveraging from the pitch information, they cannot stream the note events into their corresponding instruments.
3 Proposed Method
In our method, we use the same definition of note events as in the MIREX MPE task111https://www.music-ir.org/mirex/. Each note is considered an event with a constant pitch , an onset time and an offset time . Therefore, if a music signal has a total of notes, any note , with , can be uniquely defined by the tuple . In our experiments, we use two ways of obtaining this note information. The first using ground-truth pitch labels provided by the employed dataset (MusicNet) [Thickstun17] and the second using pitch estimates predicted by state-of-the-art MPE algorithms [Thome17, Wu2019a]. We consider the granularity to follow the semitone scale, ranging from to (MIDI #).
In our framework polyphony is allowed, so, most of the time more than a single note will be active, but our objective is to analyse each note of the audio signal separately in order to be able to assign an instrument class to it. This is done by using two inputs to the model: the main input , with representing frequency and representing time, is a time-frequency representation of a segment of the audio signal around the value of , and an auxiliary input , which carries information regarding , and . The two inputs are concatenated into a two-channel input , where represents the channel dimension, that is fed to the model. In Figure 1 an overview of the proposed framework is shown.
3.1 Main Audio Input
The main input is a time-frequency representation of a small clip of the music signal, where is the number of frequency bins and is the number of time frames. The clip is generated by first setting a maximum duration for the note. We tested values of ranging from ms to s (see Section 7 for details) and ms obtained the best results, so we kept this value in all of the other experiments. If any note has a duration greater than , i.e., , only its initial time span of seconds is considered.
Next, for every note , is constructed by picking a segment of duration from the original music signal starting from , where is a small interval to take into account deviations between the true onset value and the value we use. The inclusion of the extra window of from the music signal also helps the convolutional layers since it brings some context of the signal before the note onset value. We set ms after initial tests. Lastly, if the note duration is less than , we set the values of to zero, where .
3.2 Auxiliary Note-Related Input
The auxiliary input is a harmonic comb representation using the pitch value as the first harmonic222We use the definition that corresponds to the first harmonic., such that,
where with being the total number of harmonics in the representation. We tested multiple values for (see Section 6). In practice, we use a tolerance of half a semitone for each harmonic value when constructing as a mel-spectrogram. Therefore, even though this representation starts as binary, the final mel-spectrogram is not binary due to the mel-filtering procedure. Moreover, it is important to note that we also set the values of before and after to zero. In Figure 2 we show an example of a pair of inputs for our framework.
The note-level instrument assignment task is tackled as a multi-class single-label classification task. Given , our objective is to classify it as belonging to one of instrument classes. We use a deep neural network that receives as input and outputs a
. A softmax activation function is applied in the final layer of the network to ensure the values of
represent probabilities that sum up to. The model is trained using the cross-entropy loss. At inference time, the class corresponding to the dimension with the highest value in is predicted. See Section 4 for details regarding the network architecture.
In the cases where two or more instruments are playing the same pitch simultaneously, the small differences between the notes’ onset and offset values can generate different inputs . Thus, it would still allow the instrument assignment task to be properly executed as a single label classification scenario. However, when the pitch, onset and offset values of notes from different instruments exactly match, our system will consider them as a single note and only a single instrument will be estimated. This case rarely happens in real-world scenarios for many musical styles. For instance, in MusicNet only of the notes had the same pitch, onset, and offset values. For our experiments, we have considered notes in MusicNet that were performed by a single instrument, and discarded the notes that were concurrently produced (in terms of the same pitch, onset, and offset times) by multiple instruments. As a proof of concept, we believe that this is not a severe limitation for our framework and we leave multi-labelled approaches as future work.
|Main Input||Aux Input||Piano||Violin||Viola||Cello||Horn||Bassoon||Clarinet||Mean|
When processing music spectrograms by CNNs, the strategy of combining vertical and horizontal kernel shapes in the model architecture can facilitate learning of timbre-discriminative feature maps [Pons2016, Pons17_Timbre, Lordelo19]. In our work, we propose a CNN adapted from the W-MDenseNet [Lordelo19]. This architecture was originally proposed for harmonic-percussive source separation and consists of an encoder-decoder model that estimates spectrograms for two sources. Thus, the outputs of the W-MDenseNet have the same shape as the mixture spectrogram that is used as input. In this CNN architecture, three MDenseNets [MDenseNet17] run in parallel in separated branches, each with a unique kernel shape (vertical, square and horizontal). The MDenseNets are only combined at the final layer, i.e., after both the encoding and decoding procedure are performed. In our work, we adopt a similar methodology by taking only the encoder layers from [Lordelo19] and adding fully connected layers at the end in order to perform classification rather than separation. Also, we propose modifications to the original encoder layers: instead of combining the branches using a concatenation layer only at the final stage, we concatenate their feature-maps at the end of each downsampling stage. By doing so, we allow each branch to have access to feature-maps computed using all different choices of kernel shapes from a previous stage.
Figure 3 shows a summary of the architecture we adopt in our work. It consists of a stack of multi-branch convolutional stages and fully connected layers. In Figure 4 the internal structure of the multi-branch convolutional stage is shown. Internally, each multi-branch convolutional stage contains separate branches whose convolutions have unique kernel shapes. We use a branch with horizontal (), a branch with square (), and a branch with vertical () convolutions. In each path, a Densely connected convolutional Network (DenseNet) [Huang2017_DenseNet] with growth rate and number of layers is used. In short, a DenseNet is a stack of -channel convolutional layers — each with its own activation function — with a dense pattern of skip connections, where each layer receives the concatenation of all previous layers’ outputs as input. We used the LeakyRelu function as the activation function for all layers. The reader is referred to [Huang2017_DenseNet] for the detailed internal structure of a DenseNet. After the DenseNet, a max pooling layer is applied in order to reduce the feature-maps’ dimensions and increase the receptive field at each branch. Afterwards, the three branches are concatenated and the batch is normalised. The final feature-maps are used as input for the next multi-branch convolutional stage. Since we need to concatenate feature-maps that were originated by multiple kernel shapes we use padding on the convolution and on the max pooling to ensure the feature-maps maintain the same dimensions across branches. The number of trainable parameters is approximately million.
We used the MusicNet dataset [Thickstun17] in our experiments. MusicNet is the largest publicly available dataset with non-synthesised data that is strongly labelled for the task of instrument recognition. This means that we know the exact frames where the instruments are active in the signal, which permits the training of supervised models to perform instrument recognition at the frame-level, note-level, and clip-level. The dataset contains freely-licensed classical music recordings by composers, written for instruments, along with over million annotated labels indicating the precise time and pitch of each note in the recordings and the instrument that plays each note.
The instrument taxonomy of MusicNet is: piano, violin, viola, cello, french horn, bassoon, clarinet, harpsichord, bass, oboe and flute. However, the last instruments (harpsichord, bass, oboe and flute) do not appear in the original test set provided by the authors. Therefore, in all our experiments we ignored all the labels related to those instruments and we performed a -class instrument classification using the following classes: piano, violin, viola, cello, french horn, bassoon and clarinet. Table 1 shows the statistics of the note labels provided by MusicNet. The dataset is heavily biased towards piano and violin given their usual presence in Western classical music recordings.
6 Experimental Setup
In all experiments we used the original train/test split provided by MusicNet with the original sampling frequency of
Hz. For experiments that involved the computation of Short Time Fourier Transform (STFT) we used Blackman-Harris windows ofsamples to compute the Discrete Fourier Transform (DFT). The hop size was always set to ms in every experiment.
From the training set we picked % of the notes of each class and created a validation set. We trained the models using the Adam optimiser with an initial learning rate of and reduced it by a factor of if the cross-entropy loss stopped improving for
consecutive epochs on the validation set. If no improvement happened after
The classification performance was evaluated by computing the note-level F-score (), which is directly related to the precision (), recall () according to:
where is the number of true positives, the false positives and the false negatives.
For the cases when the instrument assignment is done on top of MPE algorithms, we provide groups of metrics that are generated following the MIREX evaluation protocol for the music transcription task. In the first group, an estimated note is assumed correct if its onset time is within ms of a reference note and its pitch is within quarter tone of the corresponding reference note. The offset values are ignored. In the second group, on top of those requirements, the offsets are also taken into consideration. An estimate note is only considered correct if it also has an offset value within ms or within % of the reference note’s duration around the original note’s offset, wherever is largest. After all notes are verified, the F-score is computed note-wise across time and the average value is provided here. This evaluation method was computed using the mir_eval.transcription333https://craffel.github.io/mir_eval/ toolbox.
7.1 Effects of the Kernel Shapes
First we analysed the effects of the inclusion of multiple kernel shapes in the architecture of the CNN. The top part of Table 3 compares versions of the model: one that uses only square filters in a single branch; a version using the branched structure, but with kernels in each; and another model with the proposed multi-branch structure with horizontal, square, and vertical kernels. For the single-branched case we increased the growth factor of the DenseNets to channels in order to keep the number of trainable parameters of the network close to the original.
Analysing the results we see that the addition of new kernel shapes improved the average F-score across all classes. Regarding each instrument class, we can say that for string instruments (piano, violin, viola and cello) there is a gain in performance, while for non-string instruments (horn, bassoon, clarinet) the performance either drops or remains with a negligible gain if compared to the models that used only square filters. This suggests that the inclusion of vertical kernel shapes helped the model in learning the percussive characteristics of the timbre of string musical instruments.
7.2 Evaluation of the Input Size
We also tested different values for the input size. More specifically, we compared multiple values for , which is the maximum valid window of analysis for a note event. The results are shown in the lower part of Table 3. We can see that the shortest input size of ms obtained the best results. We believe that it is due to the fact that the average duration of a note event in the test set of MusicNet is ms and the th percentile is ms. So, the value of ms is already enough to represent the vast majority of the notes. Moreover, when the analysed note event is longer than ms, the ms initial window contains most of the important features for the model.
7.3 Auxiliary Input and Types of Representations
To test the importance of the auxiliary input and how its modification would affect the performance of the model, we also tested a version of the model using only the main mel spectrogram input and versions using different numbers of harmonics in the auxiliary input (from to ). We also tested two types of input representation for the model, the Constant-Q Transform (CQT) and the mel-frequency spectrogram. The CQT was computed using bins per octave and a total of bins starting from (MIDI
). The mel-frequency spectrogram was computed by a linear transformation of an STFT onto a mel-scaled frequency axis, usingmel-bins. The results are provided in Table 2.
Analysing the results, it is possible to say that the auxiliary input is extremely necessary for the framework. Without it, the average F-score only reaches %, while with it the performance improves up to %. Apart from piano, all other classes have a large decrease in performance when we exclude the auxiliary input from
. We believe that the results for the piano class continue to be high not only because of the MusicNet bias towards piano, but also because some recordings of the test set are solo piano recordings, which facilitates the classification of piano notes when analysing the main input signal due to the absence of other classes. Regarding the number of harmonics used in the auxiliary input, we can see that, in general, the CQT works best with few harmonics, while the Mel-STFT prefers higher values. A possible explanation for this is the fact that it is harder to represent odd harmonics on the CQT using a log-frequency resolution ofbins per octave. However, more experiments are needed in order to better investigate this assumption.
7.4 Streaming of Multi-Pitch Estimations
Once we verified that our model obtains impressive performance when the original ground-truth labels are used, we tested the classifier in a more realistic environment, where no note-event labels are readily available. We estimated frame-level pitch values using two third-party MPE algorithms [Thome17, Wu2019a]. For the algorithm in [Thome17] we obtained an implementation from the original authors while an implementation of [Wu2019a] is available via the project Omnizart444https://github.com/Music-and-Culture-Technology-Lab/omnizart. We ran both algorithms on the music recordings to obtain the note events in order to construct the input to the classifier.
It is important to observe that errors in the MPE estimation will be carried over to the instrument assignment task. If a note is wrongly estimated, no ground-truth class for the instrument assignment exists, so it is hard to evaluate the results in the same way we did for the other experiments. So, in this experiment we used the transcription metrics that we explained in the last paragraph of Section 6. The results appear in Table 4. Given the limitations of each MPE method we used, we can see that our approach can successfully generate multi-instrument transcriptions.
|Instr.||Onset||Onset + Offset|
In this work we presented a convolutional neural network for note-level instrument assignment. We approach this problem as a classification task and proposed a framework that uses the pitch information of the note-events to guide the classification. Our approach can also successfully classify notes provided by a MPE algorithm, which permits generating multi-instrument transcriptions. Our method also shows the benefits of including different kernel shapes in the convolutional layers.
As future work we plan to investigate more deeply the interaction of our method with MPE algorithms as well as how the final estimations can be improved by including a clip-level analysis. The adoption of multi-label classification approaches is also planned.
This work is supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 765068 (MIP-Frontiers).