Although there are many different ways to describe musical tempo (e.g., measures per minute, bars per minute, or even a range of Italian terms), beats per minute (BPM) is the most commonly used measurement unit. The estimation of BPM plays an important role in a variety of applications, such as music recommendation, automatic accompaniment, playlist generation, etc. Because of its utility, the automatic estimation of tempo has been an important task and received continuous attention in the field of music information retrieval (MIR) [goto1994beat, scheirer1998tempo, gouyon2006experimental, schreiber2020data].
Traditional methods for automatic tempo estimation are usually based on hand-crafting signal processing. To estimate the tempo of a given audio segment, an onset strength signal (OSS) function is firstly derived, and the frequency of the major pulses is extracted and converted to BPM. The OSS function is a function whose peaks should correspond to onset times. It can be obtained by various methods, such as means of auto-correlation [dixon2001automatic, alonso2006accurate], comb filters [scheirer1998tempo, klapuri2005analysis] and Fourier analysis [cemgil2000tempo]peeters2012perceptual]
, support vector machines (SVM)[gkiokas2012reducing, percival2014streamlined], k-nearest neighbors (k-NN) [wu2014supervised, wu2015musical]schreiber2017post] and so on. Since Böck [bock2015accurate]gkiokas2017convolutional, bock2019multi, bock2020deconstruct].
In all methods mentioned above, the extraction of BPM depends on some post-processing of OSS functions or beat activation functions. It is only in recent years that thesingle-step tempo estimation systems based on DNN appeared. As the first single-step approach for tempo estimation, the CNN model proposed by Schreiber [schreiber2018single] is capable of extracting BPM value directly from a Mel-scaled spectrogram. In this work, classification is proved to be an effective method for tempo estimation. Adopting a similar idea, Foroughmand [foroughmand2019deep] proposed the Harmonic-Constant-Q-Modulation (HCQM), a new representation of audio signal, as the input of a relatively simple CNN classification model. The experimental results also showed its effectiveness.
A commonly used metric in tempo estimation is Accuracy1 [gouyon2006experimental], indicating the percentage of correct estimates allowing a tolerance. However, automatic tempo estimation systems tend to predict a wrong tempo by a factor of 2 or 3, known as octave errors. As an additional measure, Accuracy2 is introduced, which ignores octave errors. In some applicational scenarios (such as DJ software), accurate tempo annotations are mandatory and octave errors are unacceptable [gartner2013tempo], but most existing algorithms’ performance on Accuracy1 is still far from satisfactory.
Previous works [schreiber2018single, foroughmand2019deep] have shown the potential of CNN-based single-step approach to improve performance on Accuracy1. Following the success of these methods, in this paper we propose a CNN-based single-step model named Multi-scale Grouped Attention Network (MGANet). A multi-scale network architecture is designed to aggregate information from different scales to produce superior feature representations. Furthermore, a Grouped Attention Module (GAModule) is proposed to capture long-range dependencies and refine the feature based on the attention mechanism.
The remainder of this paper is organized as follows. In Section 2, we introduce the proposed method in detail. In Section 3, experimental results are presented to show the effectiveness of our method. Finally, we make further conclusion in Section 4.
2.1 Proposed Model
Same as [schreiber2018single] and [foroughmand2019deep]
, we also treat tempo estimation as a classification problem. The output of our model is a probability distribution of 256 BPM classes (from 30 to 285 BPM). Because the Mel-scaled frequency matches closely the human auditory perception, we choose the Mel-scaled spectrogram as the raw feature. First, the original audio data is resampled to 11.025 kHz. Then, we use half-overlapping windows of 1,024 frames, and transform each window into an 81-band Mel-scaled magnitude spectrum. The input of the proposed model is designed as a spectrogram segment of 128 frames, roughly 6 seconds long.
In the rest of this section, we first present the overall architecture of the proposed MGANet. Then, we introduce the GAModule, which is the key component of the network.
2.1.1 Multi-scale Network Architecture
The goal of tempo estimation is to extract a periodic pattern from an audio signal. Therefore, global information of the input spectrogram is particularly important. Due to the characteristics of CNN, overall pattern extraction is usually achieved by stacking multiple layers. But directly repeating convolution layers makes the model difficult to design and optimize. Another way is to use large-size convolution kernels to enlarge the receptive fields. However, this is also costly because of the increase in parameters and multiply-add operations. To solve the problem, we introduce the idea of multi-scale structure, which has been proved to be effective in many classification tasks [huang2018multi, wang2020deep, adegun2020fcn]. By downsampling / upsampling the feature to different scales and exchanging information repeatedly, high-level representations can be derived after just a few layers.
As shown in Figure 1, the overall architecture of MGANet is mainly composed of three branches for different scale. In each branch, input features are gradually downsampled over the frequency (vertical) axis, but maintains the resolution through the whole process on the time (lateral) axis. Furthermore, these feature maps from different scales are merged repeatedly to integrate contextual information, leading to high-level representations amenable to classification.
Specifically, the input spectrogram is first downsampled by 1/2 and 1/4 over the time axis with average pooling, resulting in three representations of sizes (81, 128), (81, 64), and (81, 32). Then, the representations are fed into three parallel branches respectively to perform feature processing. The processing is mainly done by the proposed GAModule described in section 2.1.2. Through the whole structure, we repeat multi-scale fusion by rescaling and concatenation. Average pooling and transposed convolution [xiao2018simple] layers with kernel size of are used to perform rescaling. For concatenation, a convolution layer with the exponential linear unit (ELU) [clevert2015fast] activation is followed to adjust the channel number.
Processed by GAModules, the features are gradually downsampled over the frequency axis to summarize frequency bands, making the representations easier to detect periodicity. On each branch, the downsampling is repeated four times. Accordingly, the channel numbers of the features are increased. After the above processes, three feature maps with shapes (1, 128, 128), (1, 64, 128), and (1, 32, 128) are obtained. Then, these feature maps are fused again and fed into a
convolution layer to adjust channel numbers to 256. After global average pooling, three vectors of length 256 are concatenated together. Finally, a fully connected layer takes the vector as input and a softmax layer is used to derive the probability distribution of 256 tempo classes.
2.1.2 Grouped Attention Module
The proposed GAModule structure is shown in Figure 2. The module consists of two parts: a trunk branch performing feature processing, and attention branches producing an attention mask to capture global context information and recalibrate the output feature map.
The structure of the attention branch is mainly inspired by the global context network (GCNet) [cao2019gcnet], which is designed for long-range dependency modeling through attention mechanism. The attention mechanism biases the allocation of the most informative feature expressions and suppresses the less useful ones. Recently, the benefits of the attention mechanism have been demonstrated in a series of tasks. We introduce the attention mechanism into GAModule mainly for two purposes: 1) model the long-range dependencies to obtain global context features; 2) reweight the importance of different channels to improve the representational capacity of the refined feature.
Unlike the images in the field of computer vision, the two axes of audio spectrograms have different meanings, which respectively represent frequency and time. Furthermore, it is known that different musical instruments have different frequency ranges, and different frequency ranges have a different impact on the total sound. These facts indicate that different frequency bands contain relatively independent information. Based on these observations, we believe that it’s inappropriate to aggregate the whole spatial scope at once to calculate long-range dependencies. Instead, different frequency positions of the feature should be handled separately, which will help to filter the useful information more efficiently. Therefore, different from traditional channel-wise attention models that aggregate the entire feature to generate one attention map (e.g., squeeze-and-excitation networks[hu2018squeeze]), we divide the feature equally into groups along the frequency axis and send each fragment into an independent attention branch. We termed the operation as grouped channel attention.
As shown in Figure 2, the framework of the attention branch is roughly the same as the GC block in GCNet. Firstly, the feature map is squeezed into a channel descriptor by global attention pooling. The pooling is achieved by convolution, softmax, and matrix multiplication. For an input feature map , the generated descriptor is calculated by
where and enumerate all possible positions, and
denotes linear transformation matrix. We adopt ELU as the activation of the convolution layer to further increase robustness. After the pooling, global spatial information is gathered in the descriptor. Then, a bottleneck of two-layer architecture is formed to transform information. We adopt a reduction ratio of 4 and ELU activation in the first layer. A sigmoid function is then applied to rescale the transformation output. Finally,attention maps with the shape of can be obtained. We concatenate these attention maps along the frequency axis and get the output attention map of .
Simultaneously, in the trunk branch we simply stack three convolution layers with kernel of and ELU activation. Because of the existence of attention branches, the trunk does not need a complex structure and too many layers, which reduces the number of parameters and the complexity of the model. We use average pooling with pooling size of to downsample the feature map to . Finally, broadcast element-wise multiplication is performed to fuse the output of the trunk branch and attention branches. Through the fusion, the output feature map is refined by global contextual information gathered by grouped attention operation.
2.2 Training Data & Augmentation
For training and validation, we adopt the three training datasets used in [schreiber2018single]: LMD Tempo (3,611 items), MTG Tempo (1,159 items), and Extended Ballroom (3,826 items). However, though covering multiple musical genres, the combination of these datasets is not genre-balanced, and some common genres are even missing. It is known that tempo perception is closely related to music genre. For example, for popular music, people usually perceive tempo through drumbeats, while for classical music, people often perceive tempo from bass instruments such as double bass. To alleviate the genre imbalance, we use two additional datasets to supplement the training data:
RWC-popular: To further enhance the model’s ability to estimate pop music tempo, we used RWC-popular [goto2002rwc] (a pop music database with 100 pieces) for training. We cut the songs into 30s fragments without overlapping, resulting in 735 items.
FD-Tempo: To enrich the genres of training data, we selected some tracks of classical music. For each track, we chose several 30s excerpts with stable tempi and annotated them by manually tagging. Finally, 530 items are obtained as an additional dataset termed FD-Tempo.
We use the combination of the five datasets for training and validation. It contains 9,861 tracks with a total length of 41h 3min. Specifically, we randomly choose 500 tracks for validation, and the rest 9,361 tracks are used for training.
To alleviate the BPM class imbalance, we further augment the training set by speeding up / slowing down the selected tracks with factors randomly chosen from 0.71.4 without altering the pitch. We retain the original files and make sure that the same audio will not be selected more than 15 times. After augmentation, the number of tracks increases from 9,361 to 23,512. Note that the validation set is not augmented. The tempo distribution in the training set before and after augmentation is shown in Figure 3. Besides, we also adopt the scale-&-crop data augmentation mentioned in [schreiber2018single] to further increase the variability of training data.
2.3 Training Details
For training, the batch size we set is 32. In each epoch, 128 consecutive frames of each sample are randomly selected for training. We choose the categorical cross-entropy as the loss function, and an Adam optimizer[kingma2014adam] is applied with a learning rate of 0.001. We evaluate Accuracy1 of the validation set every 500 iterations, and save the model with the highest accuracy. The training is not stopped until Accuracy1 has not improved for 50,000 iterations.
We choose Accuracy1 (ACC1) and Accuracy2 (ACC2) [gouyon2006experimental]
as the evaluation metrics. Accuracy1 is defined as the percentage of correct estimates allowing atolerance. Accuracy2 ignores octave errors by a factor of 2 and 3, and also allows a tolerance. As mentioned earlier, the demand for highly accurate tempo annotations has become increasingly urgent in many applicational scenarios. Hence we mainly focus on improving Accuracy1.
We focus on the performance on global tempo estimation based on the assumption the tempo of the input track stays constant, and only one BPM value will be returned by the estimation system. In the experiment, the global tempo is obtained by averaging the outputs of softmax layer over different parts of a full track [schreiber2018single].
3.1 Ablation Study
We study the effect of each idea in our approach. To simplify the discussion, we select two test datasets GTzan [marchand2015gtzan] and ACM Mirum [peeters2012perceptual] for analysis. These two datasets are relatively large (999 and 1,410 items respectively), and both cover rich genres.
To investigate how much the proposed GAModule contributes to the model, we design a set of experiments. Firstly, we remove the attention branches in the module, and only the trunk branch is remained to process features. As shown in Table 1, the performance degrades for both datasets. When focusing on Accuracy1, the performance decreases by 1.9% for GTzan and 2.3% for ACM Mirum. Then, in another experiment we keep only one attention branch in each module, which can be achieved by setting GAModules’ parameter to 1. The Accuracy1 reduced by 0.4% and 3.1% respectively. For Accuracy2, in both experiments there is also a certain degree of decline. These results indicate that the attention mechanism is helpful to capturing long-range dependencies and therefore improve the generalization of the model. But directly using existing modules may hinder the effect. The proposed grouped attention takes into account the characteristics of spectrogram and achieves further improvements of the model.
Then, we analyze the effect of the multi-scale architecture by changing the architecture to a single-scale one. We remove all downsampled subnetworks and only retain the one with the highest resolution (the topmost branch in Figure 1). As shown in Table 1, model without multi-scale architecture shows significantly worse performance on Accuracy1. The Accuracy1 decreases by 3.1% and 10.9% for GTzan and ACM Mirum respectively. For Accuracy2, there is also a certain degree of performance degradation. The results demonstrate that the multi-scale can improve the classification ability as well as robustness.
3.2 Comparison with Previous Work
To compare with previous work, we use the same test datasets as in [schreiber2018single] (see [schreiber2017post] for details): ACM Mirum [peeters2012perceptual] (1,410 items), Hainsworth [hainsworth2003techniques] (222 items), GTzan [marchand2015gtzan] (999 items), SMC [holzapfel2012selective] (217 items), GiantSteps [knees2015two] (664 items), Ballroom [gouyon2006experimental] (698 items), and ISMIR04 [gouyon2006experimental] (465 items). The union of all test datasets is referred to as Combined. The most recent annotations available are used.
We compare our work (mgan) with previous studies by Schreiber (schr) [schreiber2018single] and Foroughmand (foro) [foroughmand2019deep]. These two methods are both CNN-based single-step models that we are committed to improve. We consider them as the state-of-the-art among single-step approaches. In addition, we also compare the model with an RNN-based traditional periodicity analysis approach by Böck (böck) [bock2015accurate]. The results are shown in Table 2. Note that Ballroom, Hainsworth, and SMC are used for training in böck (values marked with asterisks *).
Focusing on Accuracy1, the experimental results show that the proposed model surpasses other methods in most cases, which proves the effectiveness of the proposed idea to improve Accuracy1. Especially for GaintSteps (664 electronic dance music excerpts), there shows a significant improvement of over 6.6%. The richness of electronic dance music in training data can be considered as a reason. The good performance in ACM Mirum and GTzan (both multi-genre datasets) shows the generalization potential of our model. Moreover, for Hainsworth, the model achieves the highest Accuracy1 among single-step approaches. Finally, the proposed method also reaches the highest Accuracy1 for Combined (79.8%) compared with other methods, gaining improvement of 5.4%.
As for Accuracy2, it can be observed that böck achieves the highest accuracy in most cases. Ignoring böck, the proposed model shows a similar performance to other single-step methods.
Among all datasets, the worst results of our model are obtained for SMC. The dataset was designed to be difficult to estimate tempo, covering various genres. Although we have tried to supplement and augment the training data, the genre-imbalance problem has not been solved very well. This indicates the necessity to supplement more data with different genres in the future work.
3.3 Comparison with Multi-task Approaches
In recent years, some works [bock2019multi, bock2020deconstruct] have not only focused on a single rhythm attribute, but combined the estimation of interconnected rhythm attributes (such as beats, downbeats, etc.) by multi-task learning, so that these highly related tasks can reinforce each other. These approaches are capable of embedding more musical knowledge into a single model, and enrich the training data of each task. In order to further explore the potential of the proposed MGANet and compare its performance with multi-task approaches, we conduct experiments with reference to [bock2019multi], combining the beat tracking task to our model.
To predict beat positions, we add a branch to the original network structure. The inputs of the branch are the feature maps before sent into tempo classifier, with shapes of (1, 128, 128), (1, 64, 128), and (1, 32, 128). The low resolution feature maps are up-sampled to 128 frames length on time axis by transposed convolution layers. Then, the concatenated feature map with shape (1, 128, 384) is processed by three convolution layers (output channel number are set to 128, 32, and 1 respectively). After a sigmoid operation, the beat activation function is derived. This extended network structure is trained as a multi-output model to combine the two tasks.
For the training of beat tracking, we use a combination of the following datasets: Hainsworth [hainsworth2003techniques], SMC [holzapfel2012selective], Ballroom [gouyon2006experimental], ISMIR04 [gouyon2006experimental], Beatles [davies2009evaluation], and HJDB [hockman2012one]. As for the training of tempo estimation, the training and validation datasets in section 2.2 are used. To further enrich the data, beat annotated datasets are also adopted for the training of tempo classifier, using the average BPMs derived from beat annotations as training labels. We train the two task alternatively every epoch, without changing other experimental settings mentioned in section 2.3.
The experimental results are shown in Table 3. Three datasets ACM Mirum [peeters2012perceptual], GTzan [marchand2015gtzan], and GiantSteps [knees2015two] are used as test datasets. We compare our works (the original model mgan and the multi-task model mgan+) with two multi-task approaches böck19 [bock2019multi] and böck20 [bock2020deconstruct]. By multi-task training, improvement can be observed on ACM Mirum and GTzan. Especially for ACM Mirum, the Accuracy1 is increased by 2.5%, achieving the best result among all approaches. Because the two test datasets are both multi-genre datasets, it can be considered that the good performance comes from not only the multi-task learning, but also the beat tracking datasets with rich music genres. As for GiantSteps, mgan+ performs better than böck19 and böck20, but a bit worse than mgan. This is also due to the supplement of data, which affects the dominant position of dance music in training data.
3.4 Grad-CAM Analysis
Gradient-weighted Class Activation Mapping (Grad-CAM) [grad-cam] is a method that can faithfully highlight the important regions in inputs for a CNN-based classification model. It uses the gradient information in back-propagation as weights (grad-weights) to explain the network’s decisions. We visualize the activation maps derived by Grad-CAM as shown in Figure 4 and Figure 5. Red indicates the part more important in predicting tempo while blue contributes less.
Figure 4 shows the activation maps on branches with different resolutions. Their inputs are two audio clips from Ballroom dataset. Time duration is marked below the corresponding images, following the audio title set in italic. Figure 3(a) comes from a piece of Samba mainly played by piano and kick drum. The piano in the clip has a higher pitch, played with quarter notes while the kick drum falls on every beat in the bar. It can be observed from the activation maps that the model mainly focuses on short-duration parts of piano in the high-resolution branch, and the kick drum parts with long duration in the low-resolution branch. As for the second example, which is a Cha Cha song, the beat positions can be identified from kick drum in low-frequency part, vocal in middle-frequency part, and claves in high-frequency part. Figure 3(b) shows that the low-resolution branch considers downbeats to be important, while the high-resolution branch focus on not only downbeats but every other beat in a bar. It can be proved that the multi-scale structure is capable of integrating useful information with different granularities.
We also visualize the activation maps before and after the proposed grouped channel attention to explore the its effect. The results are shown in Figure 5. The music excerpt of Figure 4(a) is played with regular claves and double bass, hence the high-frequency part and the low-frequency part contribute more to tempo estimation. The attention branch reweights the feature maps from the trunk branch, giving top and bottom parts higher weights to detect tempo information easier. In contrast, the vocal dominates the rhythm information in the song of Figure 4(b), thus the model gives higher attention to the middle-frequency part after grouped attention. By grouped attention, the network can efficiently find which part would be considered to be important for tempo estimation.
In this paper, we propose a new CNN-based single-step approach for tempo estimation. We introduce the idea of multi-scale network to construct the architecture of the proposed MGANet. The GAModule with the grouped channel attention is designed to be the key component of the network. Compared with previous work, the proposed approach exhibits good performance on Accuracy1 and outperforms existing models in most cases.
This work was supported by National Key R&D Program of China (2019YFC1711800), NSFC (61671156).