Automatic music genre recognition (MGR) aims to extract genre tag(s) from an audio recording of music. It is an extension of the well-studied task of automatic genre classification (MGC), where the aim is to extract a single genre tag from a musical piece. MGC can be considered a subtask of music auto-tagging (MAT), where the aim is to extract semantic tags (i.e., high-level descriptive keywords related to genres, moods, time period, and instruments, etc.) from a song. An accurate MGR system enables diverse applications in music production, computational musicology, and especially music search and recommendation, where genres are widely used to organize music and aid in retrieval and discovery [5, 27, 15].
Many recent studies on MAT, MGC, and MGR implement a convolutional neural network (CNN) architecture[32, 26, 38, 8]. Several variants of the CNN have been explored in these studies. At a high level, these architectures differ in the input representation and amount of domain knowledge leveraged. Most previous studies convert the raw waveform of a music audio recording into a spectrogram (particularly the mel-spectrogram) input representation [39, 8, 30]
and leverage popular CNN methods from Computer Vision (CV) like VGG-style networks (i.e., a deep stack of convolutional blocks with 2D filters) [8, 30, 38, 34]. On the other hand, a small selection of recent studies experiment with systems that learn directly from the raw waveform audio inputs. The most successful waveform-based networks also adapted several approaches from CV including VGG-style 
networks, residual connections, and squeeze-and-excitation  attention. The primary differences from spectrogram-based networks and these networks are that they use 1D convolutions and avoid preprocessing the raw waveform audio.
Despite the success of CNN architectures for music tagging and genre recognition tasks, the inductive biases in the convolutional layer may be too restrictive to account for the complexities of certain music tags. In particular, music genres. They can be expressed on different timescales (i.e., local vs global structure), have different levels of abstraction (e.g., language, region, sound, tempo, subject), and overlap with other genres (e.g., rock - country, hip hop - funk) [30, 27, 15]. For example, identifying factors of pop music typically include repeated choruses and hooks and rhythms or tempos that can be easily danced to. It’s also common for pop music to borrow elements from other genres such as rock, dance, and Latin. Yet, most successful music tagging networks operate on a single timescale and implement sample-level filters (i.e., filter size < 6) [38, 32, 26].
The self-attention mechanism 
was recently proposed to model sequential data for natural language processing (NLP) tasks. The key idea behind self-attention is to produce a weighted average of values computed from hidden units. Unlike the pooling or the convolutional operator, the weights used in the weighted average operation are produced dynamically via a similarity function between hidden units. As a result, the interaction between input signals depends on the signals themselves rather than being predetermined by their relative location like in convolutions. This enables self-attention to capture long-range interactions without increasing the number of parameters.
In this work, we attempt to improve the expressive capacity of waveform-based discriminative music networks by modeling both sequential (temporal) and hierarchical information in an efficient end-to-end architecture. We present MuSLCAT, or Multi-scale and Multi-level Convolutional Attention Transformer, a novel architecture for learning robust representations of complex music tags directly from raw waveform recordings. We also introduce a lightweight variant of MuSLCAT called MuSLCAN, short for Multi-scale and Multi-level Convolutional Attention Network. Both MuSLCAT and MuSLCAN model features from multiple scales and levels by integrating a frontend-backend architecture. The frontend targets different frequency ranges while modeling long-range dependencies and multi-level interactions by using two convolutional attention networks with attention augmented convolution (AAC) blocks. The backend dynamically recalibrates multi-scale and level features extracted from the frontend by incorporating self-attention. The difference between MuSLCAT and MuSLCAN is their backend components. MuSLCAT’s backed is a modified version of BERT . While MuSLCAN’s is a simple AAC block. We validate both MuSLCAT and MuSLCAN by comparing them to state-of-the-art networks on four benchmark datasets for music tagging and genre recognition. Our experiments show that MuSLCAT and MuSLCAN consistently yield competitive results to state-of-the-art waveform-based models, yet requires considerably fewer parameters.
Attending jointly to channel and temporal subspaces by implementing attention augmented convolution blocks.
Emphasizing different frequencies by using two convolutional attention network branches.
Leveraging all the information in multi-level or multi-scale and level features by recalibrating features using self-attention instead of summarizing them with global pooling.
Modeling long-range dependencies and multi-scale and level feature interactions by integrating either AAC or a modified BERT  in the backend.
Training in an end-to-end fashion (i.e., no pretraining or preprocessing).
Reducing the number of parameters without sacrificing representational power.
The rest of this paper is structured as follows. We review previous music tagging methods and relevant research for sequence modeling in Section 2. In Section 3, we formally describe the proposed MuSLCAT architecture. We describe our experiments in Section 4 and discuss results in Section 5. We offer concluding remarks and ideas for future work in Section 6.
2 Related Work
In this section, we describe the most successful architectures for music tagging and classification, discuss complementary studies, and briefly introduce relevant research in natural language processing (NLP) and computer vision (CV).
2.1 Deep neural networks for discriminative music tasks
Recent research shows feature learning approaches that use a deep neural network (DNN) outperform traditional hand-crafted feature modeling approaches for solving discriminative music tasks such as MAT and MGR[32, 26, 38, 8, 7, 14, 48, 46]. Perhaps the most ubiquitous DNN variant is the convolutional neural network (CNN). It exhibits several implicit biases such as weight sharing and shift equivariance (i.e., translation equivariance) and can be configured to learn invariant feature representations. It also pairs well with current compute and modeling resources, which helps improve training speed and ease of implementation. Many recent CV and NLP studies introduce operations or design configurations to enhance the modeling power of the CNN and reach state-of-the-art performance as a result [18, 20, 19]. MIR studies quickly began integrating these CV and NLP CNN methods for discriminative music tasks.
We detail the state-of-the-art networks for music tagging and genre recognition in the two subsections below111 All networks described in these section use batch normalization followed by ReLU activation after each convolutional operation, unless another method is explicitly stated.
All networks described in these section use batch normalization
followed by ReLU activation after each convolutional operation, unless another method is explicitly stated.. Each subsection represents a popular input representation and the networks that use it.
2.1.1 Spectrogram-based networks
. This requires a preprocessing stage with multiple computational steps including Short-Time Fourier Transforms (STFT), absolute value operation, linear-to-mel mapping, and magnitude compression to transform the raw waveform audios into spectrograms before training and inference can occur. The preprocessing stage can be seen as special case of neural networks where linear transforms (STFT and linear-to-mel mapping) and non-linearity functions (absolute value operation and magnitude compression) are fixed by hand design[23, 34, 25]. The hyper-parameters of the "fixed networks" such as window size, hop size, and mel-filter bank size are sensitive to different audio domains, so its common to select them from the best practice in each task. We refer to models that use spectrograms as inputs as “spectrogram-based.”
A deep fully convolutional network (FCN) 
was one of the first deep learning approaches to reach state-of-the-art performance for music tagging. It consists of only convolutional layers without any fully-connected (FC) layers. The preprocessing step for the FCN involves converting an audio segment of 29.1 seconds to amel-spectrogram. The mel-spectrogram is then fed as input to a 4-layer CNN. Each layer contains a convolutional operation with
2D filters followed by a max-pooling operation. Different sizes of strides are used for max-pooling layers ((2, 4), (4, 5), (3, 8), (4, 8)) to increase the size of receptive fields to cover the entire input mel-spectrogram (). This training method is termed “song-level” training  since most music tagging and genre recognition datasets can only provide approximately 30s-long audio clips for songs.
MusiCNN  enhanced the representational capacity of the CNN for music tagging by leveraging some music domain knowledge. In particular, the researchers encourage MusiCNN to learn timbral characteristics and temporal patterns by carefully designing vertical and horizontal filters in the first convolutional layer. The vertical filters model pitch invariant timbral features by employing filters to capture sub-band information across a short period of time and then enforce pitch invariance by using global max-pooling across the frequency axis. The horizontal filters model temporal energy envelope of the audio by first applying global mean-pooling across the frequency axis of input mel-spectrogram and then feeding the output to long horizontal filters. The extracted timbral (vertical output) and temporal (horizontal output) features are concatenated channel-wise and forwarded to a backend CNN network. The backend contains three 1D convolutional layers with residual connections  followed by two FC blocks. MusiCNN also leverages multi-level feature modeling. It first performs channel-wise concatenation of features extracted from the frontend and each convolutional layer of the backend. Channel statistics are then computed by applying both global max and mean pooling respectively across the temporal dimension of each channel. The extracted max and mean channel statistics are channel-wise concatenated and fed to the two FC blocks to predict relevant tags. Different from FCN, the MusiCNN uses short 3s-long audio excerpts (which are converted to mel-spectrograms) as its inputs during training. This short audio training methods is termed "chunk-level" training .
HarmonicCNN , like MusiCNN, leverages some music domain knowledge in its design. It takes advantage of trainable harmonic band-pass filters in the first layer, which make the model more flexible to learning representations from the spectrogram. The extracted harmonic features are fed to a 7-layer CNN. Each layer contains a convolutional operation with sample-level 2D filters followed by either a or max-pooling operation. The CNN output is then passed through two FC blocks to predict relevant tags. It also uses 5s-long audio segments (which are converted to spectrograms) to perform chunk-level training.
ShortChunkCNN  takes advantage of the popular VGG-style CNN network  that is prevalent for CV tasks. It is a 7-layer CNN frontend and a 2-layer fully-connected backend. In the frontend, each layer contains a convolutional operation with 2D filters followed by a max-pooling operation. The frontend was also enhanced with residual connections  in the same study . We refer to this enhanced network as ShortChunkCNN + Res. Both networks use short 3.69s-long chunks of audio (which are converted to spectrograms) to perform chunk-level training, hence their names. The main difference between FCN and these two networks is that they use chunk-level training and smaller max-pooling operations
The sequential nature of music may lead one to believe modeling long-range dependencies could be a desirable property of a DNN. A convolutional recurrent neural network (CRNN) was proposed, to this effect. It extracts local features using a 4-layer CNN frontend, summarizes them with a recurrent neural network (RNN) backend, and feeds the resulting output to a single FC layer to predict relevant tags. Each CNN layer contains a convolution with 2D filters and a max-pooling operation. Different sizes of filters and strides are used for max-pooling layers ((2, 2), (3, 3), (4, 4), (4, 4)) to downsample the input before passing it to the RNN. It also uses 29.1s-long audio chunks (which are converted to mel-spectrograms) to perform song-level training.
to better model long-range dependencies and avoid the vanishing/exploding gradient problem that occurs in RNNs. It contains a CNN frontend composed of a deep stack of seven 2D convolutional blocks. Each block integrates a convolutional operation with2D filters, a residual connection, and a max-pooling operation. Long-term dependencies are learned by feeding the extracted features from the frontend to a backend network. It is composed of BERT (short for Bidirectional Encoder from Transformers) , which is a very popular architecture for various NLP tasks. BERT uses a deep stack of self-attention layers. CNN + SA uses 15s-long audio excerpts (which are converted to mel-spectrograms) to perform chunk-level training.
We note that the main differences from previous sequential networks (CRNN and CNN + SA) and our proposed models are that ours operate directly on waveform inputs, integrate multi-scale and multi-level features, attend jointly to temporal and channel subspaces, and do not remove information from feature maps.
2.1.2 Waveform-base networks
A small collection of recent studies experiment with models that can operate directly on raw waveform input representations [14, 32, 26]. We refer these models as “waveform-based.” The advantages of waveform-based models are that they can operate in an end-to-end, assumption-free fashion and can learn task/domain-specific filters, since they avoid preprocessing audio inputs and relax the time-frequency resolution trade-off implicit in spectrogram-based networks [25, 49]. The main disadvantage is that they generally require more training data [34, 48].
SampleCNN  was the first waveform-based CNN architecture to achieve competitive performance for music tagging. It is an end-to-end, assumption-free model. It contains a deep stack of ten 1D convolutional blocks. Each block consists of a convolutional operation with sample-level 1D filters followed by a max-pooling operation. However, the first block uses a stride of 3 to downsample the input. SampleCNN + SE  subsequently enhanced the representational power of SampleCNN by adding a residual connection  and the squeeze-and-excitation (SE)  channel-wise attention operation to each convolutional block. Both sample-level networks also integrate multi-level feature modeling by applying channel-wise global max-pooling to the top three layers. The extracted channel-wise statistics from each layer are then concatenated along the channel dimension and fed to two FC blocks to predict relevant tags. A variant of SampleCNN was proposed to target multi-scale features in addition to multi-level ones. We call this variant “SampleCNN (x9)” 
. It consists of a frontend-backend architecture. The frontend contains an ensemble of nine pretrained SampleCNNs. While the backend is a 2-layer fully connected network. Each SampleCNN in the frontend is trained in a supervised fashion on a different timescale by adjusting its filter and stride size in the first layer. Once trained, multi-scale and level features are extracted and summarized by a combination of average and max-pooling. The resulting outputs are then concatenated and feed to the backend to predict relevant tags. It’s relevant to notice that SampleCNN (x9) is not an end-to-end network since it requires pretraining, unlike SampleCNN and SampleCNN + SE. SampleCNN, SampleCNN + SE, and SampleCNN (x9) use short audio clips (3-5s) for chunk-level training. Studies show that SampleCNN and SampleCNN + SE also perform well across different audio tasks including speech and acoustic scene recognition. The main difference from other waveform-based models and the above three networks is that they use very small 1D filters (between 2 and 5) in all layers.
To summarize, at a high level, the architectures described in sections 2.1.1 and 2.1.2 can be distinguished by two factors: the input representation and the amount of music domain knowledge used in the design process. Previous studies tend to use spectrogram or one of its variants like mel-spectrogram as the input representation for a network [8, 38, 30, 47, 46]. While a small selection of recent studies explore raw waveform audio as the input representation for a network [14, 32, 26]. Many previous studies incorporate music domain knowledge in the network design process on some level. This might involve designing filters to target harmonic information [38, 46], using sequential methods to model long-range interactions [7, 47], and/or integrating multi-scale and/or multi-level features to account for hierarchical dependencies of music tags [30, 38, 26], while only a few studies use little or no music domain knowledge, or at least not explicitly [32, 48].
2.2 Attention and Transformers
Attention mechanisms have enjoyed widespread adoption as a computational module for sequence modeling [2, 44, 12, 3]. Notably, Bahdanau proposed integrating attention with a recurrent neural network (RNN)  for machine translation. Vaswani et al.  introduced the Transformer architecture, obtaining state-of-the-art performance on machine translation, and subsequently, many other NLP tasks. The Transformer consists of an encoder and decoder, which both use a deep stack of self-attention  and point-wise FC layers. It also incorporates positional information by augmenting inputs with absolute position embeddings. The position embeddings can be fixed or learned during training to model the dependency between elements at different positions in the input sequence. Devlin et al.  enhanced the Transformer by introducing BERT, or Bidirectional Encoder Representations from Transformers, which only uses the encoder part from the Transformer. BERT’s main contribution is incorporating bidirectional information into the encoder by using a masked language modeling (MLM) pretraining task.
The primary component of the Transformer and its variants like BERT is self-attention . Self-attention is a form of attention that processes a sequence by replacing each element by a weighted average of the rest of the sequence. It also does not suffer from the vanishing/exploding gradient problem that is common in other sequential modeling techniques such as RNNs.
Despite the inherent sequential structure of music, attention has not been widely explored in MIR, particularly for discriminative music tasks. Notable examples of attention in MIR include:  and  for music generation,  for source separation, [25, 45] for music tagging, and  for chord recognition. Previous studies on attention mechanisms for music tagging tasks focus on recalibrating convolutional features by addition  or gating , and typically only attend to the channel subspace as opposed to the temporal subspace.
In this section, we introduce the motivation behind our proposed waveform-based Multi-Scale, Multi-Level Convolutional Attention Transformer (MuSLCAT) architecture and formally detail its construction. We provide a high-level representation of MuSLCAT’s structure in Figure 2.
3.1 Motivation: Learning representations for complex music tags
Previous MAT / MGR studies suggest that individual music tags have different performance sensitivity to different timescales and levels of features [30, 29, 17, 13, 43, 27]. Unsurprisingly, most of the top-performing MAT / MGR networks enhance their representational power by combining features from different scales, levels, or both [26, 31, 30]. In particular, two approaches are common. Most previous studies learn multi-scale and level features by using a three-stage training method: local feature learning, feature aggregation, and global classification [31, 30]
. In stage one, an ensemble of CNNs is trained in a supervised manner with tag labels, where each CNN takes different sizes of input. The feature aggregation stage extracts multiple-level features using the pre-trained CNNs and summarizes them into a single song-level feature vector. Finally, the last step uses a multi-layer fully-connected (FC) network to predict relevant tags. Another more recent approach is to only model multi-level features (but not multi-scale features) by using squeeze-and-excitation attention and multi-level feature aggregation. The advantage being that it can be trained in a single stage (i.e., no pretraining).
3.2 The MuSLCAT Architecture
Previous attempts to model multi-scale, multi-level, or multi-scale and multi-level features either sacrifice training simplicity for enhanced representational power (i.e., multi-stage training) or vice versa (i.e., multi-level modeling only), which is not ideal. Additionally, few previous studies model long-range interactions by learning directly from raw waveform audio. Thus, we present MuSLCAT, or Multi-scale and Multi-level Convolutional Attention Transformer, a novel architecture for learning robust representations of complex music tags directly from raw waveform recordings in an end-to-end fashion. We also introduce a lightweight variant of MuSLCAT called MuSLCAN, short for Multi-scale and Multi-level Convolutional Attention Network. Both MuSLCAT and MuSLCAN model features from multiple scales and levels by integrating a frontend-backend architecture. The frontend targets different frequency ranges while modeling long-range dependencies and multi-level interactions by using two convolutional attention networks with attention augmented convolution (AAC) blocks. The backend dynamically recalibrates multi-scale and level features extracted from the frontend by incorporating self-attention. The difference between MuSLCAT and MuSLCAN is their backend components. MuSLCAT’s backed is a modified version of BERT . While MuSLCAN’s is a simple AAC block.
3.2.1 The Frontend
MuSLCAT’s frontend combines two convolutional attention network (CAN) branches to efficiently learn useful representation from raw audio data. Each CAN captures temporal and hierarchical interaction using a mixture of convolution, squeeze-and-excitation (SE), and attention augmented convolution (AAC) blocks. We apply batch normalization  and ReLU activation in convolution and SE blocks and layer normalization  in AAC blocks. We occasionally employ max pooling with filter and stride size 3 after blocks to reduce the network’s memory footprint. Finally, we set the depth (i.e., number of layers) of the CANs to 10 and 11, respectively.
We encourage the CANs to emphasize different frequencies (e.g., low, mid, high) by adjusting their filter and stride size in the first layer. One CAN emphasizes high frequencies using a convolution with a 1D filter. While the other CAN targets low to mid-level frequencies using a convolution with a 1D filter. We name these two branches highCAN (red branch in Figure 2) and lowCAN (blue branch in Figure 2) according to the frequencies they aim to represent. We model multi-level feature interactions by concatenating feature maps from the top four layers and reweighing them using an AAC block in each CAN respectively.
3.2.2 The Backend
MuSLCAT’s backend consists of two modules 1) our modified BERT architecture (detailed in section 3.6
), which recalibrate multi-scale and level features extracted from the frontend, and 2) a simple 2-layer FC classifier to predict relevant music tags. We apply the channel-wise concatenate procedure described in section3.5 to extract and combine multi-scale and level features from the frontend and recalibrate them using BERT. For BERT, we use a 6-layer 512-feature model with 8 attention heads. We set the dropout rate to 0.2 in self-attention and feed-forward operations respectively. We treat the multi-scale and level features as BERT’s input embeddings, and add a [CLS] token to the beginning of each sequence as described in , since our goal is multi-label binary classification. The activations of the highest layer of the model at the [CLS] token are treated as the final feature representation of the multi-scale and level audio features. These features are then linearly projected to the prediction space using the 2-layer FC classifier.
3.3 Attending to important features
Here we describe the attention mechanisms used in MuSLCAT.
3.3.1 Squeeze-and-Excitation Attention
The convolution operation is limited by its locality and lack of understanding of global context, and multiple MIR studies suggest methods to overcome these limitations [30, 32, 26]. In particular, SampleCNN + SE  extends the original SampleCNN  by integrating squeeze-and-excitation (SE)  blocks, resulting in state of the art performance for waveform-based networks.
The SE attention method performs channel-wise recalibration of convolutional feature maps via two operations. First, a squeeze operation applies global average pooling along the temporal dimension of feature maps to produce channel-wise statistics, resulting in a
dimensional tensor (we omit the batch dimension for clarity), whererepresents the number of filters in the convolution. Then, an excitation operation feeds the squeeze operation’s output to a simple gating mechanism to produce per-channel modulation weights. The gating mechanism consists of two fully-connected (FC) layers that compute nonlinear interactions among channels. Finally, channel-wise multiplication between the original convolution’s output and the excitation operation’s output recalibrates the feature maps, producing the final feature maps.
3.3.2 Self-Attention over waveforms
Introduced in  for NLP tasks, self-attention is a recent advance to capture long-range interactions in sequential data. It works by computing a weighted average of values from hidden units using a similarity function that can dynamically adjust the weights. The use of multi-head attention in self-attention enables it to attend jointly to both temporal and channel subspaces and remove none of the temporal information, unlike SE, which only attends to channels and removes temporal information via a global aggregation operation.
Below we detail self-attention over waveforms. We use the following naming conventions: and refer to the sequence length and number of input channels of a feature map. , and , respectively refer to the number of heads, the depth of values, the depth of queries and keys, and a single-head in multi-head attention. Additionally, we omit the batch dimension for clarity.
Given an input tensor of shape , we compute multi-head attention (MHA) as proposed in the original Transformer architecture . The output of the self-attention mechanism for a single head can be formulated as:
where , , and are learned linear transformations that projects the input to queries , keys and values . The outputs of all heads are then concatenated and projected again as follows:
where is a learned linear transformation. is then reshaped into a tensor of shape to match the original temporal dimension. The MHA operation incurs a complexity of and a memory cost of , since it stores attention maps for each head.
Positional Embeddings: Without explicit information about positions, self-attention is permutation equivariant [3, 40], which may not be optimal for modeling highly structured data such as music. Several studies suggest positional encoding methods that augment feature maps with explicit temporal information to address permutation issues. In particular, the original Transformer  introduces absolute position representations, using either positional sinusoids or learned position embeddings that are added to the per-position input representations.
However, absolute positional encodings do not satisfy translation equivariance, which we hypothesise is a desirable property for processing music data. The work of Shaw et al.  introduces relative position representations for the original Transformer, which allows attention to be informed by how far two positions are apart in a sequence. This involves learning an additional relative position embedding of shape , which has an embedding for each possible pairwise distance between a query and key in position (the i-th row of ) and (the j-th row of ) respectively. The embeddings are learned separately for each head. The self-attention output for head now becomes:
is the matrix of relative position logits that satisfies.
The original implementation of relative attention  explicitly stores all relative embeddings in a tensor of shape , which incurs an additional memory cost of , thus restricting its application to long sequences and large batch sizes. The work of  introduce a memory efficient version of relative masked attention. We slightly modify this approach to unmasked relative self-attention for music audio recordings, thus reducing the relative positional embedding memory cost from to .
3.3.3 Attention Augmented Convolution
Introduced in  for image recognition and object detection tasks, attention augmented convolution (AAC) augments convolutional feature maps with self-attention features maps. This is achieved by concatenating convolutional features and MHA features along the channel axis. The combination of a convolution operation and a self-attention operation enables an AAC block to attend jointly to both channel and temporal subspaces, and add additional features to convolutional feature maps rather than refining them like in an SE block. It also makes it easy to adjust the proportion of attentional versus convolutional channels in each block. These properties make AAC an ideal candidate for our task, hence we modify the original implementation to operate on 1D data (i.e., raw waveforms) and integrate it in our proposed MuSLCAT network. Formally, given a convolution with input channels and output channels represented by and respectively, we can describe the corresponding AAC as:
where denotes the ratio of attentional channels to the number of original convolutional channels and denotes the ratio of key depth to the number of output channels. In terms of AAC’s impact on the number of parameters, the MHA attention component introduces a convolution with input filters and output filters to compute query, key, and value features, and an additional convolution with input and output filters to mix the contributions of all the heads. Thus, the number of parameters can then be approximated by
where denotes the original convolution operation’s kernel size222We ignore relative position embedding parameters for simplicity since these are negligible.. In practice, this yields a slight reduction in parameters when replacing convolutions and a slight increase in parameters when replacing convolutions. Interestingly, in experiments we find that our AAC enhanced networks achieve competitive results and require fewer parameters than state-of-the-result waveform-based models.
3.4 Multi-level feature modeling
Multiple studies suggest music tags have different levels of abstraction, and so combining features from different levels improves performance [30, 29, 17, 13, 26, 31]. The common multi-level feature modeling method has two steps: extract multi-level features and summarize them into a single feature vector. The first step extracts features from the last layers (). The next step applies global pooling (either average or max) to produce channel-wise statistics for each layer, resulting in tensors of shape , where is the number of convolutional filters in the -th layer333We ignore the batch dimension for clarity.. The summarized per layer features are finally concatenated to produce the final multi-level feature representations, which is a tensor of shape , where equals the sum of all filters from the last layers.
This multi-level approach removes potentially valuable information from the final multi-level feature representations by applying global average or max-pooling. We hypothesize this could limit the representational power of a multi-level model. As a solution, we suggest modeling the interactions between levels by concatenating features from different levels along the temporal axis and using an AAC block to recalibrate them instead. Formally, given a network with layers, multi-level depth of layers () and an AAC with output filters, our multi-level method can be formulated as:
where ( is equal to the sum of the number of temporal features from all layers) and represents the features from the -th layer (). Note that our approach requires the last layers to have the same number of output filters.
3.5 Multi-scale feature modeling
For a CNN waveform-based MAT / MGR system to be useful, the filter bank in the first convolutional layer needs to learn representations that cover the entire frequency spectrum. But the fixed filter and stride size in the first convolutional layer tend to represent some frequencies better than others, with smaller filter and stride size emphasizing high frequencies [14, 32, 49]. This property makes it difficult for a network to span the full frequency spectrum. A few multi-scale feature modeling approaches have been proposed to address this challenge. In particular, Zhu et al.  introduce a frontend architecture for speech recognition that learns multi-scale representations by concatenating pooled feature maps from three separate convolutions. Each convolution employs a different filter and stride size to emphasize different frequencies (e.g., low, mid, or high). Then, a backend network passes the aggregated multi-scale features through several layers and makes predictions. Lee et al. develop a similar method for music tagging [30, 32]. However, instead of an end-to-end network, an ensemble of pretrained deep CNNs (trained via supervision on the target task) is used to extract features from different scales. A combination of average and max-pooling is applied to the extracted features yielding channel-wise summary statistics, which are then concatenated and feed to a 2-layer FC network to predict relevant tags.
Inspired by these approaches, we also integrate a multi-scale method in MuSLCAT. It involves two convolution attention networks and feature concatenation. Different filter and stride sizes are used in the convolutional attention networks to target different scales. The features are extracted from the convolutional attention networks and then concatenated along the channel axis yielding a multi-scale feature representation. Importantly, we do not summarize multi-scale features before concatenating them.
Formally, given two CAN branches with and output filters and and output features respectively, our multi-scale method can be formulated as:
where . The main differences from previous multi-scale methods and our method is that ours 1) trains end-to-end (i.e., no pretraining) and 2) recalibrates concatenated feature maps instead of pooling them (i.e., no information is lost).
We implement BERT  in MuSLCAT’s backend to model interactions between multi-scale and level features. We closely follow BERT’s original implementation with only slight modifications: we remove the global absolute positional embeddings and add memory efficient unmasked relative position embeddings to each self-attention layer (as described in section 3.3.2), and we use the multi-scale and level feature embeddings extracted from the frontend as the input embeddings. We also do not pretrain BERT.
4 Experimental Setup
|Tags||188 (50)||188 (188)||161 (154)||195 (50)|
Numbers listed in parentheses show the total number of tags used in experiments. In all cases, the top-n most frequent tags were selected. The original MuMu set contains 147k clips/tracks, but we were only able to retrieve audio previews for 66k clips/tracks. Denotes the Amazon 4-level genre taxonomy. 30-second audio previews for some tracks can be downloaded from streaming services.
In this section, we detail the datasets used in our experiments (summarized in Table 1) and describe experimental settings.
We evaluate the proposed MuSLCAT architecture on four popular MIR datasets MuMu  and FMA  for multi-label genre tagging and MTAT  and MTG-Jamendo  for multi-label music tagging. We use an 80%/10%/10% training, validation, testing split for the FMA and MuMu dataset. For MTG-Jamendo, we use split-0 provided by the set’s creators444 https://github.com/MTG/mtg-jamendo-dataset. For MTAT, we follow the common split used by previous studies (see section 4.1.3 for details). In line with previous studies, all splits apply an artist filter to avoid the "artist effect" (e.g., when tracks by the same artist appear across sets) which can lead to over-optimistic model performance [16, 43, 11, 30]. For all datasets, we resample audio recordings at 16kHz.
The MuMu dataset  includes genre annotation based on the Amazon 4-level genre taxonomy. It contains approximately 135k tracks from 31k albums, arranged in a hierarchical taxonomy of 446 genres. The genre annotations are provided at the album-level and extended to the track(s) associated with each album. We discard tags with fewer than 200 annotated tracks, which reduces the total genres to 188. Since MuMu does not provide audio for tracks, we use the Spotify API555https://developer.spotify.com/documentation/web-api/ to download 30 second audio previews for tracks. We note that the audio preview is not always available, and when it is, we don’t have control over from which part in the song the preview is extracted (e.g., beginning, middle, or end). After cleaning and retrieving audio, our final MuMu dataset contains 188 genres and approximately 66k tracks. The advantage of MuMu is that it represents commercially popular music from a wide variety of genres on music streaming platforms.
The FMA, or Free Music Archive, dataset  is an open and easily accessible large-scale dataset suitable for evaluating several tasks in MIR, genre recognition, in particular. It provides 917 GiB and 343 days of Creative Commons-licensed full-length audio for 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. We use the large FMA subset for our experiments, which contains the full dataset with audio limited to 30 seconds clips extracted from the middle of tracks (the entire track is used if it’s shorter than 30 seconds). We also discard genre tags with fewer than 20 tracks. After cleaning, our final FMA dataset contains 154 genres and approximately 106k tracks.
The MagnaTagATune (MTAT)  dataset is one of the most frequently used datasets for benchmarking automatic music tagging systems. It contains multi-label annotations by genre, mood, and instrumentation for 25,877 30-second long audio segments from 5,405 tracks and 230 artists. Typically, only the 50 most frequent tags are used for the task, which include tags for genre and instrumentation labels, decades (e.g., ’80s’ and ’90s’) and moods. The dataset is split into 16 folders, and it is common to use the first 12 folders for training, the 13th for validation, and the last three for testing. Thus, we follow the same data split, and use the top 50 tags to be consistent with results reported in previous studies [48, 32, 26]666https://github.com/minzwon/sota-music-tagging-models.
4.1.4 MTG-Jamendo Dataset
The MTG-Jamendo (Jamendo)  is a newly available open and accessible large-scale music tagging dataset. It contains audio for 55,701 full songs and 195 different tags covering genres, instrumentation, moods and themes, and is built using music publicly available on the Jamendo777https://www.jamendo.com/ music platform under Creative Commons licenses. The minimum duration of each song is 30s, and they are provided in the MP3 (320Kbps bitrate) format. Overall, the average song duration is approximately 224s (3m 40s). In contrast to prior music tagging datasets, MTG-Jamendo contains significantly larger audio segments with higher encoding quality. This may be a result of the fact that Jamendo targets its business on royalty free music for commercial use, including music streaming for venues, and ensures a basic technical quality assessment for their collection. Thus, the audio quality level may be more consistent with commercial music streaming services, making it an ideal dataset for our purposes. The creators of the dataset provide multiple splits for training, validation and test. For all splits, steps were taken to ensure no track appears in more than one set and no tracks in any set are from the same artist present in other sets. Moreover, the same tags are present in all three sets across all splits, and tags are represented by at least 40 and 20 tracks from 10 and 5 artists in training and validation/testing sets, respectively. In this work we use split-0 and the 50 most frequent tags.
4.2 Training/Evaluation Details
. Thus, we train our models on batches of randomly selected short audio clips (3s of raw waveform data). During training, we minimize binary cross-entropy loss, and update trainable parameters using stochastic gradient descent (SGD) with Nesterov momentum 0.9 and initial learning rate 0.01. We reduce the learning rate by a factor of 5 after model performance on the validation dataset plateaus for three consecutive epochs, and we stop training once the learning rate drops below. We fix the batch size to 23, as in prior studies [32, 26]
to fairly compare MuSLCAT to the state-of-the-art waveform-based music tagging networks. All models are trained on a single NVIDIA 2080ti GPU and implemented in the PyTorch deep learning framework.
Since training on larger datasets is more expensive, researchers typically prototype on a smaller dataset (e.g., MTAT or MuMu) and then validate their models on a larger dataset (e.g., FMA or Jamendo) [4, 32, 26, 48]. We adopt a similar approach by exploring different MuSLCAT configurations on MuMu, a medium size dataset, and then training and evaluating the most promising configuration on the three remaining datasets: FMA and Jamendo (large datasets) and MTAT (small dataset).
We follow the approach of previous studies for evaluation [32, 26]. That is, for all models, we pass as input 3s long clips belonging to the same audio recording and average the model’s outputs on these clips to produce a final song-level tag prediction. For metrics, we report Area Under the Receiver Operating Characteristic (ROC-AUC) curve and Area Under Precision-Recall (AUC-PR) curve, as the latter can be more informative for evaluation on imbalanced datasets . We compute the average score across all tags for both metrics on each dataset respectively.
5 Results & Discussion
In this section, we present the results from our experiments and offer our interpretation of them.
5.1 Comparison of MuSLCAT components
|low + BERT||14.65||0.9064||0.4366|
|high + BERT||14.70||0.9058||0.4345|
|lowCNN + highCNN||3.34||0.9056||0.4384|
MuSLCAN is composed of low + high CANs and an AAC block to recalibrate multi-scale and level features. MuSLCAT is composed of low + high CANs and BERT to recalibrate multi-scale and level features.
We first investigate the effectiveness of different MuSLCAT components by comparing their performance on the small-scale MTAT dataset. Table 3 summarizes the evaluation results. They confirm our intuition that combining multi-scale and level features improves performance. A surprising result is that AAC outperforms BERT as a multi-scale and level representation modeling technique for this small-scale dataset. Additionally, they show that relative self-attention (used in AAC blocks and in BERT) improves MuSLCAT’s performance.
Regarding multi-level feature modeling, AAC, which recalibrates features without losing information, outperforms the popular global pooling approach. For multi-scale feature modeling, the combination of both lowCAN and highCAN features results in improved performance, even when a simple AAC block is used to recalibrate multi-scale features instead of our BERT backend. However, performance w.r.t. PR-AUC decreases without our BERT backend. This result may suggest that our BERT backend learns more robust representations for underrepresented tags than other approaches.
5.2 Comparison of MuSLCAT to state-of-the-art
|Spec-SampleCNNs (x3) ||0.9017||-||-||-|
|Harmonic CNN ||0.9127||0.4611||0.8322||0.2956|
|Short-chunk CNN + Res||0.9126||0.4614||0.8316||0.2951|
Denotes that the model used an ensemble of 3 pretrained CNNs. We include Spec-SampleCNNs (x3) because it incorporates multi-scale and level features and was shown to perform well on a large-scale dataset . ** denotes Short-Chunk CNN with residual connections (Res) .
Comparison to waveform-based models: We first investigate how MuSLCAT performs on several benchmark music tagging and genre recognition datasets compared to the state-of-the-art waveform-based models. Table 2 shows that MuSLCAT leads to systematic improvements on both auto-tagging and genre recognition tasks across medium-scale and large-scale datasets over the baseline network (SampleCNN) and the top-performing network (SampleCNN + SE). The results also suggest that AAC can be an efficient alternative to the computationally expensive SE operation with minimal impact on performance.
Comparison to spectrogram-based models: Though the focus of this work is learning directly from raw waveform input, we also compare MuSLCAN’s performance to state-of-the-art spectrogram-based models on two benchmark music tagging datasets: MTAT (a small-scale dataset) and MTG-Jamendo (a large-scale dataset). Table 4 shows that MuSLCAT yields competitive performance compared to state-of-the-art spectrogram-based models on large-scale music tagging datasets while avoiding using task specific hand-crafted features and input preprocessing.
For multi-scale and level modeling, Table 2 shows that MuSLCAN, which only contains a simple AAC block in its backend, can improve performance over the standard global pooling method. While, MuSLCAT, which replaces AAC with BERT in its backend, yields the best overall performance on large-scale datasets, and still requires fewer parameters than SampleCNN + SE. These results support our hypothesis that a negative side effect of SE and global pooling is that some useful information is removed from the signal, and we show that AAC can be used to overcome this.
Perhaps surprising to see is the impact training dataset size has on MuSLCAT’s performance. On the small-scale MTAT dataset, MuSLCAT yields competitive results compared to other waveform-based models. Yet, on medium to large-scale datasets, MuSLCAT generally outperform waveform-based models. We hypothesis this to be the result of replacing global max pooling with MHA, since MHA recalibrates features using a learned weighted average whereas global max-pooling simply reduces temporal features to one for each channel. The global max-pooling operation may force the network to attend to only the most significant quality in a given signal. Intuitively, this design choice would be helpful when training data is limited (i.e., small or medium-scale), but could become a constraint for training on large-scale datasets. This would help explain the performance gain of MuSLCAT on MTG-Jamendo and FMA respectively. We also anticipate that the implicit biases associated with MHA make MuSLCAT more flexible to the domain data, which can enhance its representational capacity at the expense of needing more training data.
In this work, we present MuSLCAT, or Multi-scale and Multi-level Convolutional Attention Transformer, a novel architecture for learning robust representations of complex tags directly from raw waveform music recordings. We also introduce MuSLCAN, or Multi-scale and Multi-level Convolutional Attention Network, a lightweight alternative to MuSLCAT. Both MuSLCAT and MuSLCAN capture features from multiple scales and levels by integrating a frontend-backend architecture. The frontend targets different frequency ranges while modeling long-range dependencies and multi-level interactions by using two convolutional attention networks with AAC blocks. The backend dynamically recalibrates multi-scale and level features extracted from the frontend by using BERT in MuSLCAT, or AAC in MuSLCAN.
We validate the effectiveness of MuSLCAT and MuSLCAN by comparing them to state-of-the-art networks across four benchmark music tagging and genre recognition datasets. The experiments show that compared to state-of-the-art waveform-based models both MuSLCAT and MuSLCAN consistently yield competitive performance and require fewer parameters. In particular, given a large-scale training dataset, MuSLCAT produces the best results and is more computationally efficient (% reduction in parameters) than the current state-of-the-art network (SampleCNN + SE). On the other hand, MuSLCAN achieves slightly lower performance but requires significantly fewer parameters (% reduction) than the current state-of-the-art network. The main contributions of MuSLCAT and MuSLCAN include:
Attending jointly to channel and temporal subspaces by incorporating AAC blocks.
Emphasizing different frequencies by using two CAN branches. This relaxes the need for each branch to represent the entire frequency spectrum in their bottom layer.
Minimizing information loss by replacing the global pooling and SE operations with MHA to recalibrate feature maps instead of summarizing them with channel-wise statistics.
Modeling long-range dependencies and multi-scale and level feature interactions by integrating AAC and/or BERT.
Training in an end-to-end fashion (i.e., no pretraining or preprocessing).
Improving computational efficiency while outperforming state-of-the-art waveform-based networks, with an approximate % (MuSLCAT) and % (MuSLCAN) reduction in parameters compared to SampleCNN + SE.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.2.1.
-  (2015) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §2.2.
-  (2019) Attention augmented convolutional networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3285–3294. Cited by: §2.2, §3.3.2, §3.3.3.
-  (2019) The mtg-jamendo dataset for automatic music tagging. In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States. External Links: Cited by: §4.1.4, §4.1, §4.2.
-  (2008) Content-based music information retrieval: current directions and future challenges. Proceedings of the IEEE 96, pp. 668–696. Cited by: §1.
-  (2016) Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733. Cited by: §1, §2.1.1, §2.2, §2.2.
-  (2016) Convolutional recurrent neural networks for music classification. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2392–2396. Cited by: §2.1.1, §2.1.2, §2.1, Table 4.
-  (2016) Automatic tagging using deep convolutional neural networks. In ISMIR, Cited by: §1, §2.1.1, §2.1.1, §2.1.2, §2.1, Table 4.
Encoding musical style with transformer autoencoders. ArXiv abs/1912.05537. Cited by: §2.2.
-  (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: §4.2.
-  (2017) FMA: a dataset for music analysis. In 18th International Society for Music Information Retrieval Conference (ISMIR), External Links: Cited by: §4.1.2, §4.1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. ArXiv abs/1810.04805. Cited by: MuSLCAT: Multi-Scale Multi-Level Convolutional Attention Transformer for Discriminative Music Modeling, item 4, §1, §2.1.1, §2.2, §3.2.2, §3.2, §3.6.
-  (2013) Multiscale approaches to music audio feature learning. In 14th International Society for Music Information Retrieval Conference (ISMIR-2013), pp. 116–121. Cited by: §3.1, §3.4.
-  (2014) End-to-end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. Cited by: §2.1.2, §2.1.2, §2.1, §3.5.
-  (2004) A theory of musical genres: two applications. Popular music: critical concepts in media and cultural studies 3, pp. 7–35. Cited by: §1, §1.
-  (2009) Album and artist effects for audio similarity at the scale of the web. In Proc. of 6th Sound and Music Computing, Cited by: §4.1.
Learning features from music audio with deep belief networks.. In ISMIR, Vol. 10, pp. 339–344. Cited by: §3.1, §3.4.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §2.1.1, §2.1.1, §2.1.2, §2.1.
-  (2018) Gather-excite: exploiting feature context in convolutional neural networks. In Advances in neural information processing systems, pp. 9401–9411. Cited by: §2.1.
-  (2018) Squeeze-and-excitation networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Cited by: §1, §2.1.2, §2.1, §3.3.1.
-  (2018) Music transformer: generating music with long-term structure. arXiv preprint arXiv:1809.04281. Cited by: §3.3.2.
-  (2020) Pop music transformer: generating music with rhythm and harmony. arXiv preprint arXiv:2002.00212. Cited by: §2.2.
-  (2013) Feature learning and deep architectures: new directions for music informatics. Journal of Intelligent Information Systems 41 (3), pp. 461–481. Cited by: §2.1.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §3.2.1, footnote 1.
-  (2019) Comparison and analysis of samplecnn architectures for audio classification. IEEE Journal of Selected Topics in Signal Processing 13 (2), pp. 285–297. Cited by: §1, §2.1.1, §2.1.2, §2.1.2, §2.2.
-  (2017) Sample-level cnn architectures for music auto-tagging using raw waveforms. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 366–370. Cited by: §1, §1, §2.1.2, §2.1.2, §2.1.2, §2.1, §3.1, §3.3.1, §3.4, §4.1.3, §4.2, §4.2, §4.2, Table 2.
-  (2008) Social tagging and music information retrieval. Journal of New Music Research 37, pp. 101–114. Cited by: §1, §1, §3.1.
-  (2009) Evaluation of algorithms using games: the case of music tagging. In 10th International Society for Music Information Retrieval Conference, ISMIR 2009, pp. 387–392. Cited by: §4.1.3, §4.1.
-  (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. Advances in neural information processing systems 22, pp. 1096–1104. Cited by: §3.1, §3.4.
-  (2017) Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Processing Letters 24, pp. 1208–1212. Cited by: §1, §1, §1, §2.1.1, §2.1.2, §3.1, §3.3.1, §3.4, §3.5, §4.1, §4.2, Table 4.
-  (2017) Multi-level and multi-scale feature aggregation using sample-level deep convolutional neural networks for music classification. ArXiv abs/1706.06810. Cited by: §3.1, §3.4.
-  (2018) Samplecnn: end-to-end deep convolutional neural networks using very small filters for music classification. Applied Sciences 8 (1), pp. 150. Cited by: §1, §1, §1, §2.1.2, §2.1.2, §2.1.2, §2.1, §3.3.1, §3.5, §4.1.3, §4.2, §4.2, §4.2, Table 2.
-  (2020) Voice and accompaniment separation in music using self-attention convolutional neural network. ArXiv abs/2003.08954. Cited by: §2.2.
-  (2019) Deep learning for audio-based music classification and tagging: teaching computers to distinguish rock from bach. IEEE Signal Processing Magazine 36, pp. 41–51. Cited by: §1, §2.1.1, §2.1.2.
Multi-label music genre classification from audio, text and images using deep features. ArXiv abs/1707.04916. Cited by: §4.1.1, §4.1.
-  (2019) A bi-directional transformer for musical chord recognition. In ISMIR, Cited by: §2.2.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, Cited by: §4.2.
-  (2017) End-to-end learning for music audio tagging at scale. arXiv preprint arXiv:1711.02520. Cited by: §1, §1, §2.1.1, §2.1.1, §2.1.2, §2.1, Table 4.
-  (2017) Designing efficient architectures for modeling temporal features with convolutional neural networks. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2472–2476. Cited by: §1.
-  (2018) Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468. Cited by: §3.3.2, §3.3.2, §3.3.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1.1.
-  (2014) The state of the art ten years after a state of the art: future research in music information retrieval. Journal of New Music Research 43 (2), pp. 147–172. Cited by: §3.1, §4.1.
-  (2017) Attention is all you need. In NIPS, Cited by: §1, §2.2, §3.3.2, §3.3.2, §3.3.2.
-  (2019) A hierarchical attentive deep neural network model for semantic music annotation integrating multiple music representations. In ICMR ’19, Cited by: §2.2.
-  (2020) Data-driven harmonic filters for audio representation learning. In Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 536–540. Cited by: §2.1.1, §2.1.1, §2.1.2, §2.1.
-  (2019) Toward interpretable music tagging with self-attention. External Links: Cited by: §2.1.1, §2.1.1, §2.1.2.
-  (2020) Evaluation of cnn-based automatic music tagging models. In Proc. of 17th Sound and Music Computing, Cited by: §2.1.1, §2.1.1, §2.1.1, §2.1.2, §2.1.2, §2.1, §4.1.3, §4.2, §4.2, Table 4.
-  (2016) Learning multiscale features directly from waveforms. In INTERSPEECH, Cited by: §2.1.2, §3.5.