Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning

by   Dorian Desblancs, et al.

Annotating musical beats is a very long in tedious process. In order to combat this problem, we present a new self-supervised learning pretext task for beat tracking and downbeat estimation. This task makes use of Spleeter, an audio source separation model, to separate a song's drums from the rest of its signal. The first set of signals are used as positives, and by extension negatives, for contrastive learning pre-training. The drum-less signals, on the other hand, are used as anchors. When pre-training a fully-convolutional and recurrent model using this pretext task, an onset function is learned. In some cases, this function was found to be mapped to periodic elements in a song. We found that pre-trained models outperformed randomly initialized models when a beat tracking training set was extremely small (less than 10 examples). When that was not the case, pre-training led to a learning speed-up that caused the model to overfit to the training set. More generally, this work defines new perspectives in the realm of musical self-supervised learning. It is notably one of the first works to use audio source separation as a fundamental component of self-supervision.


page 10

page 14

page 15

page 29


Self-Supervised Learning of Audio Representations from Permutations with Differentiable Ranking

Self-supervised pre-training using so-called "pretext" tasks has recentl...

Contrastive Learning of Musical Representations

While supervised learning has enabled great advances in many areas of mu...

Self-supervised Pre-training Reduces Label Permutation Instability of Speech Separation

Speech separation has been well-developed while there are still problems...

The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning

Contrastive learning of auditory and visual perception has been extremel...

Drum-Aware Ensemble Architecture for Improved Joint Musical Beat and Downbeat Tracking

This paper presents a novel system architecture that integrates blind so...

Structure and Automatic Segmentation of Dhrupad Vocal Bandish Audio

A Dhrupad vocal concert comprises a composition section that is interspe...

Modeling the Compatibility of Stem Tracks to Generate Music Mashups

A music mashup combines audio elements from two or more songs to create ...

1 Audio Representation Learning

1.1 Audio-Visual Correspondence

One of the first papers to explore the field of audio representation learning was Look, Listen, and Learn arandjelovic2017look. In their paper, Zisserman et al. tried to match a sound to an image. This was done by extracting sounds and images from video frames. From there, two neural networks were used to determine whether a sound corresponded to this image. A vision subnetwork studied the image input, while an audio subnetwork studied the log-spectrogram of the audio input. 555The network architecture for Look, Listen, and Learn arandjelovic2017look can be found in 12 (appendix of this thesis). The outputs of each subnetwork were then fused together to determine whether the image and sound came from the same video. This work was significant in many ways. In the realm of sound, it was found that extracting the audio subnetwork and fine-tuning it to a downstream task led to much better performance on the downstream task. Why? Because this pretext task allowed the network to distinguish sounds by matching them to an image. Note that the work in arandjelovic2017look was later enhanced by using log mel-spectrograms as an input to the audio subnetwork cramer2019look.

In 2018, Zhao et al. zhao2018sound used a similar idea to perform audio source separation. In their work, image frames from two separate videos were used as input to a visual model. The sounds corresponding to these image frames were then mixed and input to an audio network. The goal of their pretext task was to separate the mixture into two audio snippets, corresponding to the separated sounds of each video. Their method was found to achieve extremely good results in the realm of sound separation. The networks were notably found to match sounds to specific objects in an image. Figures 13 and 14 illustrate their methodology and results.

Although early works in the field of audio representation learning focused on audio-visual correspondences, a flurry of papers in the realm of pure audio representation learning were published around the same time. These notably made use of triplet-based metric learning, and will be presented in the next section.

1.2 Triplet-Based Metric Learning

At its most basic level, triplet-based metric learning is used to train a network to distinguish pairs of images, time series, or other data types. Assuming a triplet , we define:

  • the anchor sample

  • the positive sample

  • the negative sample

The goal of a triplet learning task is then to distinguish the anchor and positive from the negative. The anchor and positive are usually either two samples from the same class, or augmented versions of each other. Popular loss functions include margin loss, which is defined as:


where is a distance metric 666

Popular margin loss distance metrics are cosine similarity and euclidean distance. The margin value is usually adapted to the chosen distance function. For example, in the case of cosine similarity,

is between 0 and 1. As gets closer to 1, and as gets closer to -1, the loss decreases. and is referred to as a margin value weinberger2009distance. Note that each anchor/positive pair can be compared to multiple negatives at a time, in which case the above formula is used for every negative. The values returned are then summed or averaged.

Triplet-based learning was used in three works that greatly inspired us. The first was published in 2018 jansen2018unsupervised. In this paper, Jansen et al. sample random snippets of audio from each data point in the AudioSet gemmeke2017audio. These are the anchors. Each anchor is then augmented using a series of signal transformations such as Gaussian noise addition and time-frequency translation to create positive samples. A simple ResNet he2016deep model was then trained using margin loss with a cosine similarity distance metric. The model was fine-tuned on two downstream tasks: query-by-example and sound classification. The fine-tuned models proved to perform very closely to their fully-supervised counterparts using extremely little data.

Lee et al. lee2020metric used a similar approach to learn musical representations. In their work, sample pairs were snippets of songs that had the same class (i.e. same genre). Triplet learning was found to sucessfully pre-train their model on two musical tasks: similarity-based song retrieval and auto-tagging.

Finally, the triplet learning work that inspired our experiments the most was published by Lee et al. in 2019 lee2019learning. Their work relied on mashing up vocals and background tracks that had similar tempo, beat, and key. From there, triplets were generated using tracks that contained the same vocals, but a different background track. They used a margin loss. The goal of their pretext task was to train a model to recognize a vocal from the same singer amidst a different background track.777Note that three settings were used in this triplet learning task. A MONO setting, where all inputs are monophonic (i.e. only contain vocals), a MIXED setting, where all inputs were mashups, and a CROSS setting, where anchors were monophonic and positives were mashups. The outputs of pre-trained network were then adapted for two musical tasks: singer identification and query-by-singer. In both tasks, Lee et al. achieved extremely high accuracy. The audio mashup pipeline used in lee2019learning is illustrated in Figure 15 (Appendix).

This work greatly inspired us because of its ingenious use of musical stems. By distinguishing vocals that came from the same song/ artist from their background songs, a neural network could learn accurate voice embeddings that could be used for a variety of musical tasks. We separated drum stems from the rest of our songs in order to learn a more rhythmic representation of music for tasks such as beat tracking and downbeat estimation.

1.3 Self-Supervised Learning

As mentioned previously, self-supervised learning involves a pretext task, used to pre-train a neural network using lots of data, and a downstream task that is often quite precise and limited in its data. When it comes to designing a pretext task, there exist a wide range of options. In some cases, the tasks are quite general, and are aimed towards learning an audio representation that can span numerous downstream tasks. In other cases, however, the pretext task is geared towards a specific downstream task.

Let us first explore the former. Some of the most interesting pretext tasks used in audio were introduced by Tagliasacchi et al. tagliasacchi2020pre tagliasacchi2019self. These were partially inspired by the famous word2vec word vectorial representations mikolov2013efficient. In their audio2vec tasks, an autoencoder is used to reconstruct missing pieces of a log mel-spectrogram. This task comes in two forms. In its CBoW variant, the autoencoder must reconstruct a central piece of the input spectrogram. In its skip-gram version, the autoencoder must reconstruct the pieces around the central section of a spectrogram. They also introduced a third task coined temporal gap, in which a model must estimate the duration between two sections of a spectrogram. All three tasks are illustrated in Figure 16 (Appendix).

The encoders from the pretext tasks were then isolated, frozen, and fine-tuned using extra linear layers for tasks such as speech recognition and urban sound classification. They found that merely training the linear layers on top of the learned representation led to performances that were almost on par with fully-supervised, state-of-the-art results.

Carr et al. used similar ideas in carr2021self. In their case, they split their input log spectrograms into a nine-piece jigsaw puzzle. They then used an autoencoder network to predict the correct permutation ordering. The encoder network was then extracted and fine-tuned using all layers (i.e. the encoder is not frozen). The performance they obtained surpassed end-to-end fully-supervised training on instrument family detection, instrument labelling, and pitch estimation tasks.

Finally, 888A last, non-musical, paper worth mentioning is ryan2020using. The authors used self-supervised learning applied to bird songs for downstream industrial audio classification. in the realm of music, Wu et al. wu2021multi recently used an encoder network to predict input song snippets’ classic music features such as MFCCs, Tempograms, and Chromas. The encoder was then extracted and trained on downstream datasets such as the FMA genre defferrard2016fma by adding an MLP on top of the network. The encoder was trained in a fully-supervised fashion (with random initializations), in a frozen fashion (pre-trained layers frozen), and a fine-tuning fashion (pre-trained layers also trained). The pretext task learned representations allowed the network to achieve results on par with end-to-end supervised training in the frozen context, and superior to end-to-end supervised training in the fine-tuning context.

Let us now focus on pretext tasks that are targeted to a specific downstream task. In the realm of music, two recent papers stand out, and greatly inspired our work. The first one is SPICE gfeller2020spice 999SPICE stands for Self-Supervised Pitch Estimation. In their work, Gfeller et al. propose a pretext task that can be adapted to automatically estimate musical pitch. Two pitch-shifted pieces of the same audio are used as input to a same encoder-decoder network. Note that the CQT of each pitch-shifted track is used as input to the network. The encoder must then produce a single scalar for each CQT. This scalar is then used for two purposes. First, the relative difference between each scalar produced must be proportional to the initial pitch shift between each encoder input. Second, the scalar is used to reconstruct the un-shifted audio input. Both of these outputs are used in SPICE’s loss function. The model proposed is then able to estimate pitch using a simple affine mapping, from relative to absolute pitch. Supervised learning is only used to calibrate this affine mapping. The results obtained on downstream pitch estimation tasks are superior to other fully-supervised methods, proving the efficacy of SPICE’s pretext task. Figure 6 gfeller2020spice outlines the pretext task’s full pipeline.

Figure 6: SPICE gfeller2020spice Pretext Task Overview.

The second work that inspired us makes use of inverse audio synthesis to detect pitch. In their paper, Engel et al. engel2020self make use of Differentiable Digital Signal Processing (DDSP) modules presented in engel2020ddsp. An input log mel-spectrogram is reconstructed using a mixture of harmonic and sinusoidal synthesizers. By doing so, their network is able to disentangle a piece of music’s pitch and timbre. The resulting pitch estimations outperform SPICE and other methods. To our knowledge, these results are still state-of-the-art.

1.4 Contrastive Learning

Let us now introduce the key concepts behind contrastive learning. In some sense, contrastive learning is a more refined and modern version of triplet-based metric learning. We define anchors, positives, and negatives in the same way. Training batches are created using a sample pair (i.e. an anchor and its corresponding positive) and negatives that correspond to other samples’ positives. These are generated randomly at each epoch. The standard contrastive loss function, defined by:


is then computed across each batch during the training process. denotes a similarity function (most often Cosine Similarity) and a temperature parameter (usually between 0 and 1). We assume a batch size of , where indices and are used for the anchor and positive.

In the field of audio, a few recent works inspired this thesis. First, Zeghidour et al. saeed2021contrastive published a very simple framework for generating general-purpose audio representations. Anchors and positives were different sections of a same audio clip. Their log mel-spectrogram was then used as input to an EffcientNet model tan2019efficientnet with two additional linear layers (used to project the output to a vector of size 512). The resulting outputs were compared using a contrastive loss with a bilinear similarity metric. The pre-trained EfficientNet model was then extracted and either frozen or fine-tuned for a set of downstream tasks that range from speaker identification to bird song detection. In almost all cases, the fine-tuned model vastly outperformed its fully-supervised counterpart. Figure

17 outlines the simple framework used.

Another recent work in field was published by Wang et al. wang2021multi. They used contrastive learning to match waveform audio representations to their spectrogram counterpart.

Finally, the most notable paper published in the field of musical contrastive learning is Contrastive Learning of Musical Representations by Spijkervet et al. In their work, anchors and positives are waveform snippets from a same song. Positives were augmented using techniques such as polarity inversion and gain reduction. Their model was frozen and fine-tuned to a musical tagging downstream task using a fully-connected layer. The results they obtained were in-line with fully supervised methods at the time. More importantly, they achieved an extremely high performance using merely 1% of the training data. Figure 18 outlines their pretext task pipeline. 101010Note that in both saeed2021contrastive and spijkervet2021contrastive, anchors are also compared to each other (i.e. are used as negatives for other anchors).

2 Audio Source Separation

In audio, the field of blind source separation (BSS) deals with the task of recovering the source signals that compose a mixture. In this context, one does not know how many sources the mixture has. Many algorithms do however assume fixed sources to perform separation.111111Spleeter spleeter2020, for example, assumes ’drum,’ ’bass,’ ’vocal,’ and ’other’ sources in its ’4stem’ setting.

Historically, BSS tasks were solved using techniques from Auditory Scene Analysis (CASA) brown1994computational or matrix decomposition methods. Most notably, independent component analysis (ICA) hyvarinen2000independent was used to separate a mixture into statistically independent and non-Gaussian sources. ozerov2007adaptation provides a more thorough overview of historical audio source separation methods.

In recent years, deep neural networks have taken over the field of source separation. The U-net architecture is extremely popular in the field of music, and has led to high-performing source separation algorithms jansson2017singing. Other high-performing deep learning models in the field include stoller2018wave and lluis2018end. We encourage the reader to consult peeters2021deep for a more thorough overview of deep learning networks applied to source separation.

In this thesis, we used Spleeter spleeter2020 121212

As a sidenote, Spleeter spleeter2020 was introduced by Deezer Research. Today, it is one of the most popular source separation algorithms in the field of music. It is notably used by large audio companies such as iZotope and Algoriddim. Note that the Python package is open-source.

to separate our songs into multiple stems. Spleeter allows a user to split songs into two stems (vocal and other stems), four stems (vocal, bass, drum, and other stems), and five stems (vocal, bass, drum, piano, and other stems). We made use of the four-stem model to separate drum stems from the rest of our signals.

3 Beat Tracking and Downbeat Estimation

Let us finally introduce some of the important works in the fields of beat tracking and downbeat estimation. As mentioned previously, the beat of a song is often described as the rhythm a listener taps his foot to when listening to a piece of music. We refer to the downbeat estimation as the first beat of each bar. Before delving into methods, let us introduce some details about common datasets in the field. These were used in both our work and previous papers in the field.

3.1 Datasets

We used a total of four datasets to evaluate our beat tracking and downbeat estimation methodology. Table 1 displays some of the important information about each of these (notably whether beat and downbeat annotations are available).

Dataset # files length Beats Downbeats
Ballroom gouyon2006experimental krebs2013rhythmic 685 5h57m yes yes
Hainsworth hainsworth2004particle 222 3h19m yes yes
GTZAN marchand2015swing tzanetakis2002musical 1000 8h20m yes yes
SMC holzapfel2012selective 217 2h25m yes no
Table 1: Common Datasets used for Beat Tracking and Downbeat Estimation

On a separate note, the Ballroom dataset gouyon2006experimental is comprised of dance music excerpts, such as tangos and waltzes. The Hainsworth hainsworth2004particle dataset is comprised of a wider variety of genres, such as classical and electronic music. The GTZAN tzanetakis2002musical dataset is comprised of 10 genres, spanning hiphop, jazz, and disco. Finally, the SMC holzapfel2012selective dataset spans a wide range of genres that are similar to those of hainsworth2004particle. One key difference with the other datasets is that each song was selected due the difficulty of estimating accurate beats. That is why most beat tracking systems perform much worst on this dataset than on others.

3.2 Classic Methods

Before deep learning, most methods in the field of beat tracking relied on a two-step process. The first was a front-end process that extracts onset locations (an onset describes the start of a musical event) from a time-frequency or subband analysis of a signal. A periodicity estimation algorithm would then find the rate at which these events occur. This was notably the case in miguel2004tempo. By 2012, however, deep learning had already achieved state-of-the-art results in the field. We recommend the reader consult peeters2021deep for a more thorough review of historical methods in beat tracking and downbeat estimation.

3.3 Modern Methods

When it comes to deep learning and beat estimation, a wide range of methods have been introduced lately. In these kinds of tasks, the neural network produces an activation function. This function is supposed to equal 1 when a beat occurs, and 0 otherwise. This function is then ”picked” using a DBN krebs2015efficient bock2014multi krebs2013rhythmic. These networks are probabilistic, and include Hidden Markov Model (HMM)s and particle filtering models. They read the activation function and output its beat locations. We will come back to these later in this report. The first beat tracking architectures that were found to work were RNN networks such as Long Short Term Memory (LSTM)s hochreiter1997long. These types of networks were notably used in bock2014multi and bock2016joint to produce both beat and downbeat activation functions. More recently, temporal convolutional networks were found to perform just as well matthewdavies2019temporal.

When it comes to training networks for beat tracking and/or downbeat estimation, we can distinguish two settings. The first is associated with training a model on one task only (i.e. we train one network on beat tracking or downbeat estimation only). In the second setting, both beat and downbeat locations are learned jointly during training, by two separate networks (the loss from each network is combined). This was found to vastly improve results for downbeat estimation bock2016joint. After all, beat tracking is an easier task, and a downbeat estimation network benefits from knowing where beats are located. Table 2 summarizes the performance of some popular beat estimation methods. We report the papers’ F1-measure 131313

Note that, although the F1-measure is the most popular evaluation metric for beat tracking and downbeat estimation, a wide variety of other metrics also exist. We will briefly introduce them later in this report.

of correct versus incorrectly predicted beats. Correctness is determined over a small window of 70 ms.

Dataset Methodology Beat F1 Downbeat F1
Ballroom gouyon2006experimental krebs2013rhythmic TCN matthewdavies2019temporal 0.933 NA
Joint RNN bock2016joint 0.938 0.863
MM bock2014multi 0.910 NA
FA-CNN durand2016feature NA 0.778/0.797
Hainsworth hainsworth2004particle TCN matthewdavies2019temporal 0.874 NA
Joint RNN bock2016joint 0.867 0.684
MM bock2014multi 0.843 NA
FA-CNN durand2016feature NA 0.657/0.664
GTZAN marchand2015swing tzanetakis2002musical TCN matthewdavies2019temporal 0.843 NA
Joint RNN bock2016joint 0.856 0.640
SPD davies2006spectral 0.806 0.462
MM bock2014multi 0.864 NA
FA-CNN durand2016feature NA 0.860/0.879
SMC holzapfel2012selective TCN matthewdavies2019temporal 0.543 NA
Joint RNN bock2016joint 0.516 NA
SPD davies2006spectral 0.337 NA
MM bock2014multi 0.529 NA
Table 2: F-measures Obtained by Popular Beat Tracking and Downbeat Estimation Algorithms

Most of the results obtained previously come from past Mirex challenges.141414Some of the datasets that comprise the Mirex combined dataset are not available. In these, the Ballroom gouyon2006experimental krebs2013rhythmic, Hainsworth hainsworth2004particle, and SMC holzapfel2012selective datasets were used for 8-fold CV whereas the GTZAN marchand2015swing tzanetakis2002musical dataset was used as a test set. We did not have access to the combined dataset, so evaluated our method using regular 8-fold CV on each dataset separately. This is a key difference that may explain the gap between some of the state-of-the-art results and ours.

4 Audio Input Representation

First, for both pretext and downstream tasks, all our audio input signals were resampled at a rate of 16000 Hz. In the case of the pretext task, this was done for all Spleeter-generated spleeter2020 stems. For the downstream tasks, this was done for each of our beat tracking data tracks.

These signals were then transformed using Librosa’s [mcfee2015librosa] VQT. We used a hop length (number of audio samples between adjacent Short-Time Frequency-Transform (STFT) columns) of 256. The minimum frequency used was 16.35 Hz (i.e. the frequency of the note C0). A total of 96 frequency bins were used for the resulting time-frequency representation. These correspond to a frequency range spanning eight octaves with a resolution of 12 notes per octave 151515All VQTs presented in this report follow the specifications described in this section.. This VQT was inspired by the equal temperament tuning system. We then absolute valued and logged each bin. 161616Note that a small number was added to each bin before computing the . The resulting matrix was used as input to all of our models.

5 Model Design

The model we designed was inspired by other beat tracking architectures bock2016joint bock2014multi. Table 3 summarizes the various layers that compose it. The table assumes an input shape of (this corresponds to the VQT of five seconds of audio).

The input log-VQT

is first fed into a series of convolutional and max-pooling layers. The max-pooling layers only diminish the frequency dimension. The first max-pooling layer reduces the dimension from 96 to 32, the second from 32 to 8, and the third from 8 to 1. In some sense, this is akin to reducing the frequency dimension to one value per octave, and max-pooling the resulting values. The time dimension is not reduced, however. This is due to the fact that our input time dimension already has a resolution of 62.6 bins per second, or approximately 16 ms per bin. For tasks such as beat tracking, this time resolution is standard. In matthewdavies2019temporal, for example, the authors use a time resolution of 10 ms per bin to achieve state-of-the-art results.

Layer Output Dimension (# Channels Freq. Bins Time Dim.) Kernel Size Stride Padding
Input 1 96 313
Conv2d 64 96 313 3 11 1 1 1 5
MaxPool2d 64 32 313 3 1 3 1 0 0
Conv2d 128 32 313 5 15 1 1 2 7
MaxPool2d 128 8 313 4 1 4 1 0 0
Conv2d 256 8 313 3 21 1 1 1 10
MaxPool2d 256 1 313 8 1 8 1 0 0
Conv2d 128 1 313 1 25 1 1 0 12
GRU 256 313
Conv1d 1 313 1 1 0
Table 3: Model Architecture

The convolutional layers’ kernel sizes are widened on the time dimension as the network deepens. The number of channels is also increased up to 256. Once the frequency dimension size is equal to 1, the sequence is fed into a stacked GRU cho2014learning layer (i.e. two consecutive GRUs; the latter reads the output of the former). Each GRU reads sequences in a bidirectional fashion. The stacked GRU output is then fed into a convolutional layer with kernel size . This layer reduces the number of channels to 1. A sigmoid activation layer is then used to squash all values between 0 and 1. The resulting sequence corresponds to our model’s output.

Do note that all convolutional and GRU layers are followed by a combination of ReLU activations and Dropout. For our pretext task, we used a Dropout probability value of 0.1 (values superior to 0.3 made it impossible for our model to train). This value was set to 0.5 for our downstream tasks. This was done so that our model would train correctly in the former case, and to combat overfitting in the latter case. Furthermore, we originally chose a Dropout value of 0.5 due to its optimality for a wide range of tasks srivastava2014dropout.

6 Pretext Task

Let us now introduce the pipeline we created for our contrastive learning experiment.

6.1 Data Processing

The first step of our experiment was to create a very large dataset that contained snippets of drum and Rest-of-Signal (ROS) snippets. Note that we define a ROS stem as a track without its drums (for simplicity purposes).

In order to do so, we first loaded each track in the FMA large defferrard2016fma dataset (106,574 tracks of 30s) using a re-sampling rate of 44100 Hz (Spleeter assumes an input sample rate of 44100 Hz). We then used Spleeter spleeter2020 to separate each track into its drum and ROS stems. The latter was created by mixing the bass, other, and vocals stems extracted by Spleeter’s 4stems model. Once this step was done, we were presented with a set of problem: not all tracks contain drums of course, and matching a ROS stem to an empty signal would likely make our pretext task fail (especially if the training set contains multiple tracks without drums, which is likely the case). We also ran into numerous cases where our drum stem contained all the audio, while our ROS stem was empty. This often occurred when a track was comprised of a bassline and its accompanying drums. When both were synchronized, the source separation model struggled to disentangle them.

In order to solve this problem, we computed the Root-Mean-Square (RMS), defined for a signal of length as:


on both the ROS and drum stems. Note that the RMS was computed over frames of length using a hop length of 512. This allowed us to verify whether drums occurred throughout the extracted stem, and not just in short sections at the beginning or end.

Assuming RMS values were extracted for each stem, we then verified that the following inequality was verified, for :

Figure 7: VQT representations of our model inputs. We also provide VQT illustrations of the ground truth drum and ROS stems. As one can notice, the Spleeter-generated stems are similar to the ground truths. The drum stem does however seem to contain artefacts, especially at the higher-frequency level. We will come back to the role these artefacts may have played in our pretext task in subsequent chapters.

If the above was satisfied for over 30% of the values obtained for each stem, 5 seconds of the song’s audio verifying 5 were extracted and used in our pretext training set 171717One final note: we chose to save our pretext data VQTs in memory before training our model. Our data processing can of course be done on-the-fly during training, but loading, re-sampling, and transforming audio to its VQT representation is a very expensive process time-wise.. Note that the process above was found to work very well in-practice, and allowed us to obtain training set stems that were balanced (i.e. had nicely separated drum and ROS signals). 181818This may have contributed to a song selection bias. We will come back to this idea in Chapter 6. A total of 35200 5-second stem pairs were generated from the FMA defferrard2016fma dataset in this way. This corresponds to almost 49 hours of audio. The VQT of each stem was then computed. Figure 7 191919We apologize for the size of the and axes for some figures in this chapter and the next. Unfortunately, many of these were generated during our experiments, and would take an experiment re-run to recreate. illustrates our pretext task inputs.

6.2 Batch Creation

Once all 35200 sample pairs were created, we created our batches by first randomly selecting an anchor (ROS stem VQT) and its corresponding positive (drum stem VQT). We then filled the rest of the batch using other randomly selected positives. For a batch size of 64, this corresponds to 62 other drum stems.

We split our data into a training set of size 28800 and a validation set of size 6400 202020We randomly selected anchors at each validation epoch. We acknowledge this may be bad practice, as the validation set batches were not constant throughout training. Regardless, our model exhibited the desired behaviour during the learning process.. Hence, during each epoch, 450 anchors were used for training and 100 for validation. The contrastive loss defined in the previous section, 3, was then computed over each batch and used to train the model.

Note that we used Cosine Similarity, defined by:


for two vectors and , as a similarity metric. Similar vectors have a Cosine Similarity close to 1, whereas dissimilar vectors have a Cosine Similarity close to -1.

6.3 Experimental Setup

Our model and loss function were computed using PyTorch NEURIPS2019_9015. We used the Adam optimizer kingma2014adam throughout training with an initial learning rate of


Higher learning rates did not enable our model to train. Even worse, they often led to an exploding gradients problem.

. This learning rate was divided by 2 every 200 epochs. We stopped training the model after 425 epochs 222222We initially planned to train our model for 600 epochs. Due to memory issues, training was stopped at 425 epochs. The model’s training plots exhibited the desired behaviour so we opted to stop any further training.. The model that performed the best on the validation set (i.e. that had the lowest average loss across each validation batch) was saved and re-used for our downstream tasks.

7 Downstream Tasks

7.1 Data Processing

As outlined earlier, we used four datasets for our beat tracking task and three for our downbeat estimation task. The Ballroom gouyon2006experimental krebs2013rhythmic 232323Note that the duplicates in the Ballroom gouyon2006experimental krebs2013rhythmic dataset identified in were removed., Hainsworth hainsworth2004particle, and GTZAN marchand2015swing tzanetakis2002musical datasets were used for both downstream tasks whereas the SMC holzapfel2012selective dataset was used for beat tracking only.

Figure 8: Beat Tracking Data Processing Pipeline. The first figure represents the input song’s waveform representation, the second its VQT representation, the third its beat tracking activation function, and the fourth its downbeat estimation activation function.

For each song in the datasets, we computed the VQT

using the same resolution as the pretext task. It was used as input to our model. The target output of our model was an activation function with the same time resolution as our input

VQT. We used the annotations provided by each dataset to determine the locations of each beat (i.e. each value of 1) on the activation function. Note that the time steps preceding and following each beat were annotated with a value of 0.5 to aid the model in its beat identification task.

Figure 8 outlines the data processing pipeline we used, with examples of both the input and target outputs of our model.

7.2 Experiments

Let us now introduce the experiments we conducted. For each of these, the model was initialized either randomly or with the pretext task’s pre-trained weights.

Note that for every experiment, the Binary Cross-Entropy (BCE) loss between the target and model output, defined by:


for targets and predictions of length , was computed.

dataset Vanilla Learning Pre-trained Learning
Ballroom gouyon2006experimental krebs2013rhythmic
- Beat
- Downbeat
- Joint
Hainsworth hainsworth2004particle
- Beat
- Downbeat
- Joint
GTZAN marchand2015swing tzanetakis2002musical
- Beat
- Downbeat
- Joint
SMC holzapfel2012selective
- Beat
Table 4: Experimental Setup Learning Rates

Table 4 outlines the learning rates used for the experiments described in the subsequent sub-sections. All our models were trained for a maximum of 50 epochs. These were then evaluated using the F1-score, AMLc, AMLt, CMLc, and CMLt evaluation metrics. The CMLc and CMlt metrics evaluate how continuously correct a beat tracking estimation is (use of the maximum length of correct predictions). The AMLt and AMLc metrics are similar but allow offbeat variations of an annotated beat sequence to be matched with detected beats. One can read more about each metric in davies2009evaluation.

Pure Beat Tracking and Downbeat Estimation

The first, and simplest, experiments we conducted were aimed at training our model for either one of our downstream tasks. For both beat tracking and downbeat estimation dataset, we used 8-fold Cross Validation (CV) to evaluate our model. Each fold was used once as a test set. The rest of the data samples were used for training or validation.242424For each test fold, the validation set was comprised of randomly selected data samples from the other seven folds. Note that both the test and validation sets were of the same size. For each experiment, we used the Adam kingma2014adam optimizer with learning rates described in Table 4 and a batch size of 1.252525We used a batch size of 1 because most tracks did not have the same input size. The model that achieved the highest mean F1-score on the validation set over 50 epochs was selected for evaluating the test set. This was done for each of the 8 test sets that compose a dataset. We evaluated every single dataset separately.

The pre-trained DBN in krebs2015efficient was used to ”pick” our activation function for both beat tracking and downbeat estimation. Since it is tailored to beat tracking tasks, we used a Beats-per-Minute (BPM) range of for the former task, and in the latter case. These ranges were found to be optimal for each task and every dataset. Note that all DBNs were provided by the madmom bock2016madmom Python library. All the evaluation metrics were computed using the raffel2014mir_eval library and its default settings.

Finally, the learning rates used were generally in the same range. For larger datasets, we used the same learning rate for both vanilla and fine-tuning training. Models were found to train correctly in both instances. For smaller datasets, we often divided the learning rate by two for fine-tuning, as the model would quickly overfit. This was done because pre-training was found to greatly accelerate learning. The next chapter will cover this in-depth.

Joint Estimation

In this experimental setting, our network’s goal is to learn beat and downbeat annotations in parallel. Beat tracking is usually an easier task, and can guide a downbeat estimation network. This form of learning usually greatly enhances a model’s downbeat estimation capabilities. We defined two networks with the same architecture described previously. One focused on beat tracking whilst the other focused on downbeat estimation. BCE

loss was computed on each of the two networks’ outputs. The sum of both losses was then backpropagated to the networks.

We used the DBN defined in bock2016joint 262626Note that we limited our DBN’s beats-per-bar setting to 3 and 4 (i.e. the DBN only models bars with 3 or 4 beats) for our joint estimation task. to process both activations simultaneously. This allowed our beat and downbeat outputs to be synchronized in time. At each epoch, we summed the mean F1-scores obtained by each model on the beat tracking and downbeat estimation validation set. The models that achieved the highest sum were selected for testing. The rest of the experimental setup was exactly like in the previous section.

Impact of Training Set Size on Learning Performance

In order to verify the impact of our pre-training on downstream performance, we studied the amount of training examples needed to achieve a decent performance on the joint estimation task. We isolated an eighth of each one of our datasets as test sets. Another eighth was used for our validation sets. From there, a random subset of the remaining data were used as our training set. This subset corresponded to 1%, 2%, 5%, 10%, 20%, 50%, or 75% of the remaining training set. Table 9 outlines the size of the train sets for each dataset.

We ran this random selection and training process 10 times for each percentage of the train set used. The test set results were then averaged to determine whether pre-trained models needed less data to achieve higher performance.

Cross-dataset Generalization

The final experiment we conducted was centred around determining how our models generalized from one dataset to another. 2727278-fold CV was also applied. Each fold was used as a validation set. We trained our models on one of the GTZAN tzanetakis2002musical marchand2015swing or Ballroom gouyon2006experimental krebs2013rhythmic datasets, and tested them on the Hainsworth dataset hainsworth2004particle. This was only done in a joint estimation setting.

8 Pretext Task

8.1 Training Behaviour

When training our neural network on our computer-generated data set, one can notice that the loss decreases slowly but surely.282828Do note that our model was tailored to perform well on the downstream tasks too. Popular deep neural networks, such as Residual Networks he2016deep, performed much better on our pretext task, but were not suited to our downstream tasks. The mean batch loss does not however decrease very drastically (it only decreases from an initial value of 4.1 to approximately 3.5).

The evolution of the cosine similarity metric is however quite interesting. We obtain the behaviour we initially desired: anchors and positives whose Cosine Similarity increases towards 1, and anchors and negatives whose Cosine Similarity gradually decreases towards 0292929Since our model’s output vectors are positive, the minimum possible distance between our vectors is 0. (the mean similarity between anchors and negatives plateaus at around 0.2). These values are averages of all the 550 batches present in our training and validation sets.303030Since our dataset contains more than one sample pair, achieving a perfect anchor/ positive similarity of 1 and a perfect anchor/ negative similarity of 0 is almost impossible. More importantly, both sets exhibit similar behaviour, suggesting that our model is indeed capable of matching the correct drum and ROS stems throughout each batch.

Figure 9: Pretext Task Training Behaviour

8.2 Onset Function

When analyzing the vectorial representation learned by our model, one can immediately notice that it greatly resembles an onset function. For the most part, the vector’s values are close to 0. They do however ”spike” during certain musical events. We evaluated our model on the Mus dB musdb18 data set to judge its performance. This data set is comprised of 150 songs, and each of their stems. We extracted 10-seconds worth of drum and ROS audio for each stem in order to gauge our pre-trained network’s performance on longer audio segments.

Figure 10: Successful Stem Match. The Cosine Similarity between the anchor and positive is 0.663.
Figure 11: Failed Stem Match. The Cosine Similarity between the anchor and positive is 0.101.

When computing the drum and ROS stem representations of these clean stems, we first noticed that the performance was somewhat lower. Successful matches only had a Cosine Similarity in the range of (compared to 0.7+ using Spleeter-generated spleeter2020 stems), and a number of sample pairs had a similarity closer to 0.1. This suggests that Spleeter’s artifacts spleeter2020 may have aided the model’s matching of anchors and positives. 313131Visually, it seems like Spleeter spleeter2020 mostly created spectral ”holes” in the anchor’s VQT throughout the source separation process. These holes were usually drum locations and helped our model match ROS and drum stems. Note that this remains a hypothesis.

Moreover, the representations learned by the model were found to resemble an onset function. This is due to the fact that output vectors are quite parsimonious. The observed ”spikes” seem to correlate nicely with rhythm, however. This is notably the case in Figure 10 (b), where the peaks seem to be aligned with the kick sounds (they are just shifted in time). Further work needs to be done in order to determine whether these onset functons can be used in a standalone323232By standalone, we refer to the idea that the onset functions would be used as the sole input to a MIR algorithm. fashion for tasks such as beat tracking or tempo estimation. Figure 11 illustrates a failed stem match. In this case, the onset functions are much less interpretable.

9 Downstream Tasks

Let us now study our downstream task performance. The following sub-section contains tables outlining our various results. These were compared to the state-of-the-art methods in the field. For each table, we report the mean and standard deviation test set performance (usually F1-measure). The standard deviation values were not provided by any other papers.

Pure Beat Tracking and Downbeat Estimation

When looking at our results (Table 5), a few elements stand out. First, our network’s performance on beat tracking tasks is quite good compared to other state-of-the-art methods, using both random and pre-trained initializations. This is especially the case for larger datasets such as GTZAN marchand2015swing tzanetakis2002musical and Ballroom gouyon2006experimental krebs2013rhythmic. This is not the case on the Hainsworth dataset hainsworth2004particle however. The Hainsworth data set hainsworth2004particle is quite small, and benefits greatly from being trained alongside other data sets. Bock et al. matthewdavies2019temporal bock2016joint do so in their works, whereas we train a different model on each dataset.333333And for each fold. This most likely explains the performance gap observed.

Second, pre-training does not help with performance. In fact, in most cases, the network’s performance slightly worsens with pre-training. Note that we tried using smaller learning rates, larger drop-out, and frozen layers to no avail. We believe this is due to the fact that pre-training led our network to overfit more easily. Figures 19-20-21 in the appendix illustrate this idea. When the network is pre-trained, we observe that the validation set’s F1-score is higher during the first few epochs. This most likely led our network to cater to the training set too fast, and by extension generalize less well to unseen data.

Finally, the model’s performance on the pure downbeat estimation task is extremely poor compared to the state-of-the-art today (it is still quite good compared to previous methods). Downbeat estimation is a very complex task which benefits greatly from knowing a song’s beat annotations. Our results in the next subsection demonstrate this idea. Tables 5 and 6 outline the mean test set performance using 8-fold CV for each dataset.

Dataset F1 CMLc CMLt AMLc AMLt
Ballroom gouyon2006experimental krebs2013rhythmic
- Vanilla 0.933 (0.011) 0.865 (0.019) 0.884 (0.020) 0.908 (0.008) 0.929 (0.008)
- Pre-trained 0.920 (0.011) 0.854 (0.022) 0.872 (0.021) 0.896 (0.015) 0.916 (0.014)
- TCN matthewdavies2019temporal 0.933 0.864 0.881 0.909 0.929
Hainsworth hainsworth2004particle
- Vanilla 0.753 (0.029) 0.556 (0.057) 0.627 (0.051) 0.752 (0.078) 0.849 (0.063)
- Pre-trained 0.757 (0.041) 0.533 (0.083) 0.600 (0.088) 0.748 (0.040) 0.845 (0.030)
- TCN matthewdavies2019temporal 0.874 0.755 0.795 0.882 0.930
GTZAN marchand2015swing tzanetakis2002musical
- Vanilla 0.862 (0.022) 0.748 (0.045) 0.771 (0.039) 0.866 (0.032) 0.899 (0.024)
- Pre-trained 0.859 (0.019) 0.737 (0.035) 0.760 (0.032) 0.876 (0.028) 0.906 (0.027)
- TCN matthewdavies2019temporal 0.843 0.695 0.715 0.889 0.914
SMC holzapfel2012selective
- Vanilla 0.528 (0.027) 0.346 (0.062) 0.452 (0.073) 0.473 (0.018) 0.628 (0.030)
- Pre-trained 0.526 (0.057) 0.337 (0.084) 0.451 (0.092) 0.447 (0.081) 0.610 (0.080)
- TCN matthewdavies2019temporal 0.543 0.315 0.432 0.462 0.632
Table 5: Pure Beat Tracking Results. We compare our results with those in matthewdavies2019temporal. Like us, they make use of a CNN and DBN to obtain their results in a supervised fashion. The performance on larger datasets, such as GTZAN marchand2015swing tzanetakis2002musical and Ballroom, match or outperform matthewdavies2019temporal. This is not the case for the Hainsworth hainsworth2004particle and SMC holzapfel2012selective datasets. Also, the vanilla model often outperforms the pre-trained model. 8-fold CV standard deviation is reported between parentheses.
Dataset F1 CMLc CMLt AMLc AMLt
Ballroom gouyon2006experimental krebs2013rhythmic
- Vanilla 0.570 (0.024) 0.090 (0.036) 0.090 (0.036) 0.597 (0.035) 0.603 (0.033)
- Pre-trained 0.557 (0.013) 0.090 (0.020) 0.090 (0.020) 0.583 (0.045) 0.588 (0.043)
- Joint RNN bock2016joint 0.863 NA NA NA NA
Hainsworth hainsworth2004particle
- Vanilla 0.481 (0.079) 0.294 (0.097) 0.302 (0.101) 0.643 (0.108) 0.663 (0.101)
- Pre-trained 0.492 (0.063) 0.276 (0.069) 0.284 (0.070) 0.685 (0.075) 0.701 (0.076)
- Joint RNN bock2016joint 0.684 NA NA NA NA
GTZAN marchand2015swing tzanetakis2002musical
- Vanilla 0.460 (0.008) 0.021 (0.007) 0.022 (0.007) 0.433 (0.029) 0.442 (0.026)
- Pre-trained 0.445 (0.017) 0.015 (0.008) 0.016 (0.009) 0.420 (0.033) 0.429 (0.033)
- Joint RNN bock2016joint 0.640 NA NA NA NA
Table 6: Pure Downbeat Estimation Results. We compare these results with bock2016joint. Although this paper also makes use of a CNN and DBN, the RNN is trained in parallel with a beat tracking network. It therefore significantly outperforms our network, which was trained purely for the downbeat estimation task. Again, we observe that pre-training our network did not lead to a significant performance gain on any of our datasets.
Joint Estimation

In our joint estimation setup, the effects of pre-training were more noticeable. This is most likely due to the fact that two networks were initialized with pre-trained weights. Whilst vanilla learning struggled with training both networks simultaneously (especially during the first 5-10 epochs), our pretext task training allowed the model to achieve better results more quickly. Overall, pre-training our network led to better test set performance for both downbeat estimation and beat tracking. Do however note that the obtained results are not yet up-to-par with state-of-the-art joint estimation methods. Tables 7 and 8 contain the experiment’s results and commentary.

One should also notice how much better the downbeat estimation results are. In the pure downbeat estimation task, our model never achieved a mean F1-score above 0.570 on the Ballroom gouyon2006experimental krebs2013rhythmic dataset, 0.492 on the Hainsworth dataset hainsworth2004particle, and 0.460 on the GTZAN dataset tzanetakis2002musical marchand2015swing. In the joint setup, we obtain maximum scores of 0.822, 0.517, and 0.613. The model in both experiments does not change. However, the joint training method allows it to learn much better.

Dataset F1 CMLc CMLt AMLc AMLt
Ballroom gouyon2006experimental krebs2013rhythmic
- Vanilla 0.885 (0.049) 0.744 (0.097) 0.769 (0.102) 0.861 (0.058) 0.889 (0.059)
- Pre-trained 0.909 (0.023) 0.795 (0.059) 0.826 (0.064) 0.872 (0.019) 0.905 (0.021)
- Joint RNN bock2016joint 0.938 NA NA NA NA
Hainsworth hainsworth2004particle
- Vanilla 0.763 (0.062) 0.579 (0.074) 0.657 (0.077) 0.717 (0.087) 0.818 (0.078)
- Pre-trained 0.750 (0.052) 0.541 (0.073) 0.626 (0.071) 0.692 (0.056) 0.798 (0.050)
- Joint RNN bock2016joint 0.867 NA NA NA NA
GTZAN marchand2015swing tzanetakis2002musical
- Vanilla 0.829 (0.017) 0.661 (0.044) 0.690 (0.038) 0.833 (0.026) 0.870 (0.020)
- Pre-trained 0.831 (0.022) 0.675 (0.037) 0.702 (0.031) 0.842 (0.030) 0.877 (0.022)
- Joint RNN bock2016joint 0.856 NA NA NA NA
Table 7: Joint Estimation Beat Tracking Results. We observe that for larger datasets, pre-training was actually quite beneficial, and led to a slight increase in beat tracking performance compared to purely supervised, vanilla training. This was not the case for the smaller, Hainsworth hainsworth2004particle dataset. The results we obtain in this joint setup are also poorer than those obtained in the pure beat tracking experiment. They also are not up-to-par with the RNN methodology presented in bock2016joint (the other metrics were never presented in bock2016joint).

Overall, for both beat tracking and downbeat estimation, the pure and joint experiments did not indicate that our pre-training method was beneficial for test set performance. Generally, vanilla and pre-trained network performances were within a standard deviation of each other. The training plots in Figures 19-20-21 (appendix) did however show that pre-training our models led to faster downstream training. This inspired the next experiment, in which we limited the amount of training data our network was exposed to. We then validated and tested it on a constant 25% of each dataset. In this setting, pre-training was found to be quite beneficial. 343434Experiments remain separate for each dataset.

Dataset F1 CMLc CMLt AMLc AMLt
Ballroom gouyon2006experimental krebs2013rhythmic
- Vanilla 0.806 (0.062) 0.710 (0.107) 0.712 (0.109) 0.875 (0.050) 0.877 (0.051)
- Pre-trained 0.822 (0.040) 0.767 (0.069) 0.768 (0.069) 0.879 (0.038) 0.881 (0.039)
- Joint RNN bock2016joint 0.863 NA NA NA NA
Hainsworth hainsworth2004particle
- Vanilla 0.517 (0.083) 0.452 (0.084) 0.454 (0.083) 0.705 (0.083) 0.712 (0.081)
- Pre-trained 0.501 (0.062) 0.429 (0.097) 0.432 (0.097) 0.692 (0.066) 0.697 (0.066)
- Joint RNN bock2016joint 0.684 NA NA NA NA
GTZAN marchand2015swing tzanetakis2002musical
- Vanilla 0.612 (0.043) 0.527 (0.050) 0.528 (0.049) 0.796 (0.039) 0.799 (0.039)
- Pre-trained 0.613 (0.034) 0.536 (0.044) 0.537 (0.044) 0.806 (0.031) 0.808 (0.031)
- Joint RNN bock2016joint 0.640 NA NA NA NA
Table 8: Joint Estimation Downbeat Estimation Results. We observe the same behaviour as we do with beat tracking.
Impact of Training Set Size on Learning Performance

Our pre-trained model was found to outperform our vanilla model in a limited-data setting. Table 9 displays our results. As one can see, when the number of training samples is extremely low (i.e. under 10), the pre-trained model significantly outperforms the vanilla model. For beat tracking, the difference is only significant when the number of training samples is very low. In the Ballroom gouyon2006experimental krebs2013rhythmic dataset, pre-trained and vanilla models average a similar F1-score at around 26 samples, or only 5% of the training set. The results progress in a similar fashion as the number of training samples increases. On the other hand, when only 1% of the training set is used (5 samples), the difference in performance is huge. The vanilla model averages an F1-score of 0.281, whereas the pre-trained model averages an F1-score of 0.694. The trend is similar for both the Hainsworth hainsworth2004particle and GTZAN tzanetakis2002musical marchand2015swing datasets. After about 15 to 20 training samples, pre-training seems to become insignificant.

For downbeat estimation, the story is similar. The task does however seem to require a bit more data. In general, model performance is similar after using approximately 10-20% of the training set. Table 9 contains all our results for this experiment. Overall, we believe these results to be encouraging. For a few-shot learning task related to musical rhythm, perhaps our pre-trained model initialization could learn more quickly, or even adapt in real-time. It has, after all, shown an ability to learn using very few examples for beat tracking and downbeat estimation tasks.

Dataset 1% 2% 5% 10% 20% 50% 75%
Ballroom gouyon2006experimental krebs2013rhythmic
- Train Set Size 5 10 26 51 103 257 386
- Vanilla Beat F1 0.281 (0.008) 0.677 (0.098) 0.739 (0.100) 0.786 (0.045) 0.860 (0.014) 0.907 (0.018) 0.927 (0.021)
- Pretrain Beat F1 0.694 (0.038) 0.737 (0.027) 0.727 (0.066) 0.775 (0.060) 0.825 (0.025) 0.877 (0.020) 0.908 (0.015)
- Vanilla Down F1 0.061 (0.007) 0.353 (0.148) 0.536 (0.049) 0.576 (0.050) 0.701 (0.015) 0.776 (0.038) 0.839 (0.028)
- Pretrain Down F1 0.423 (0.045) 0.476 (0.038) 0.506 (0.074) 0.579 (0.075) 0.644 (0.052) 0.741 (0.052) 0.802 (0.031)
Hainsworth hainsworth2004particle
- Train Set Size 2 3 8 17 33 83 125
- Vanilla Beat F1 0.273 (0.013) 0.278 (0.016) 0.437 (0.141) 0.606 (0.033) 0.631 (0.014) 0.655 (0.093) 0.717 (0.020)
- Pretrain Beat F1 0.489 (0.054) 0.498 (0.113) 0.588 (0.040) 0.612 (0.025) 0.597 (0.032) 0.695 (0.020) 0.708 (0.026)
- Vanilla Down F1 0.063 (0.009) 0.067 (0.005) 0.074 (0.036) 0.226 (0.078) 0.279 (0.072) 0.376 (0.075) 0.462 (0.038)
- Pretrain Down F1 0.167 (0.078) 0.208 (0.075) 0.245 (0.049) 0.303 (0.056) 0.307 (0.082) 0.389 (0.065) 0.434 (0.056)
GTZAN marchand2015swing tzanetakis2002musical
- Train Set Size 8 15 38 75 150 375 563
- Vanilla Beat F1 0.495 (0.125) 0.739 (0.024) 0.803 (0.016) 0.784 (0.027) 0.814 (0.011) 0.811 (0.010) 0.819 (0.010)
- Pretrain Beat F1 0.656 (0.057) 0.701 (0.054) 0.741 (0.040) 0.783 (0.020) 0.803 (0.019) 0.820 (0.009) 0.831 (0.015)
- Vanilla Down F1 0.089 (0.089) 0.135 (0.022) 0.430 (0.026) 0.470 (0.044) 0.506 (0.026) 0.537 (0.027) 0.580 (0.016)
- Pretrain Down F1 0.325 (0.048) 0.380 (0.043) 0.424 (0.044) 0.482 (0.049) 0.515 (0.033) 0.572 (0.030) 0.573 (0.019)
Table 9: Impact of Training Set Size on Learning Performance. For each training set percentage, we randomly select a subset of the training set for model training. The validation and test sets are constant for each dataset (each 12.5% of a the dataset). We report the mean and standard deviation of 10 experiments for each percentage.
Cross-data Set Generalization

Training Dataset F1 CMLc CMLt AMLc AMLt
Ballroom gouyon2006experimental krebs2013rhythmic
- Vanilla Beat F1 0.715 (0.013) 0.475 (0.021) 0.569 (0.024) 0.616 (0.017) 0.760 (0.017)
- Pretrain Beat F1 0.699 (0.011) 0.447 (0.019) 0.545 (0.022) 0.571 (0.011) 0.714 (0.016)
- Vanilla Down F1 0.478 (0.025) 0.403 (0.028) 0.412 (0.026) 0.641 (0.018) 0.656 (0.017)
- Pretrain Down F1 0.468 (0.015) 0.391 (0.023) 0.401 (0.025) 0.619 (0.017) 0.641 (0.017)
GTZAN marchand2015swing tzanetakis2002musical
- Vanilla Beat F1 0.759 (0.014) 0.568 (0.030) 0.690 (0.030) 0.672 (0.018) 0.820 (0.011)
- Pretrain Beat F1 0.756 (0.006) 0.576 (0.016) 0.682 (0.011) 0.695 (0.019) 0.830 (0.010)
- Vanilla Down F1 0.529 (0.012) 0.494 (0.017) 0.501 (0.017) 0.730 (0.016) 0.741 (0.016)
- Pretrain Down F1 0.534 (0.013) 0.505 (0.011) 0.509 (0.012) 0.734 (0.020) 0.742 (0.019)
Table 10: Hainsworth hainsworth2004particle Mean Test Set Results. Each model was trained on ths of the training data set and validated on the last fold. We report the mean and standard deviation for each evaluation metric on beat tracking and downbeat estimation. The experiment was conducted in a joint estimation setting using 8-fold CV on each train set.

Finally, we did not notice that pre-training aided our model to better generalize to new data. When being trained on the Ballroom gouyon2006experimental krebs2013rhythmic data set, our pre-trained model performed worse than our vanilla model on the Hainsworth data set hainsworth2004particle. For beat tracking, the vanilla model averaged an F1-score of 0.715 whereas the pre-trained model averaged a score of 0.699 (for downbeat estimation, we obtain scores of 0.478 and 0.468). For the GTZAN gouyon2006experimental krebs2013rhythmic dataset, we found the pre-trained model to perform slightly better than the vanilla one. In both cases, however, the evaluation metrics are quite similar. Table 10 displays these results. This is most likely due to our pre-trained model overfitting to the training set once again.

Appendix A Chapter 3

a.1 Audio Representation Learning

a.1.1 Audio-Visual Correspondence
Figure 12: Look, Listen, and Learn Audio-Visual Correspondence Architecture arandjelovic2017look
Figure 13: The Sound of Pixels zhao2018sound Audio Source Separation Pipeline
Figure 14: The Sound of Pixels zhao2018sound Sample Sound Localization Results
a.1.2 Triplet-Based Metric Learning
Figure 15: Vocal and Background Track Mashup Pipeline lee2019learning
a.1.3 Self-Supervised Learning
Figure 16: Overview of the Self-Supervised Learning Tasks Introduced by Tagliasacchi et al. tagliasacchi2019self tagliasacchi2020pre. Note that in the tasks that contain multiple inputs, all inputs are passed to the same encoder. Its outputs are then concatenated and either passed to a decoder (in CBoW) or a regular feedforward network (in temporal gap) for further processing.
a.1.4 Contrastive Learning
Figure 17: Overview of Contrastive Learning Applied to Audio saeed2021contrastive
Figure 18: Overview of Contrastive Learning of Musical Representations spijkervet2021contrastive

Appendix B Chapter 6

b.1 Downstream Tasks

Figure 19: Ballroom gouyon2006experimental krebs2013rhythmic F1-score on Train and Validation Sets
Figure 20: Hainsworth hainsworth2004particle F1-score on Train and Validation Sets
Figure 21: GTZAN tzanetakis2002musical marchand2015swing F1-score on Train and Validation Sets