As an important entertainment for human beings, music has wide appeal. Studies indicate that people attach importance to music mainly because of the connotation and essential characteristics involved in emotion [paper1]. More and more users retrieve music according to emotion in MIR systems, where MER plays a crucial role [paper2, paper3]. However, the abstraction of emotion makes it difficult to analyze. Therefore, how to construct an effective MER method has attracted extensive attention.
The existing MER methods can be divided into two categories including classification and regression according to different emotion models. The former selects some emotional adjectives to classify music. However, the limited words cannot describe human emotions exactly[paper4, paper5, paper6]. The latter uses the spatial position of emotion space to express human internal emotion. The two-dimensional valence-arousal (V-A) emotion model proposed by Russell [paper7] is one of the mainstream emotion models for regression tasks [paper8, paper9, paper10, paper11]. It represents emotion by valence and arousal, which stand for the degrees of pleasantness and bodily activation, respectively. In this paper, we aim to use a certain point in the V-A emotion space to describe the whole emotion for clip.
Feature engineering and model designing are the common solutions in MER. The first one improves recognition performance by constructing the effective feature sets [paper13, paper14, paper29]
. However, it requires a lot of manpower to design efficient feature sets for different datasets and recognition targets. For this reason, many researchers hope that the model can autonomously learn the affect-salient features. In the past, traditional machine learning methods (SVR, RF, etc.) were used to recognize music emotion[paper9, paper12]
, but gradually eliminated due to low flexibility and poor generalization. In recent years, deep learning has made amazing achievements in MER. The convolutional long-short-term-memory deep network based models proposed in[paper15, paper16] can adaptively learn affect-salient features in music. But these methods are generally suitable for a certain length and do not consider the relations among different clips. Besides audio, [paper17, paper18] also introduce lyrics and other information to conduct multi-modal learning. Unfortunately, the lack of high-quality multi-modal aligned dataset leads to a significant reduction in applicable scenarios. Moreover, some researchers try to obtain more information by manipulating audio to improve the ability of their models. For example, [paper8] tries to separate multiple sound sources in music (vocals, bass, drums, etc.) to explore the influence of different musical elements on MER. [paper28] fuses multi-scale features of different lengths to strengthen predicting performance. Nevertheless, these methods are not strictly end-to-end structures, which may lead to additional errors when data flows through various modules.
To improve the above problems, we propose an end-to-end attention-based deep feature fusion (ADFF) approach for MER. We first take log Mel-spectrogram as input, then use adapted VGGNet as spatial feature learning module (SFLM) to obtain low-to-high level spatial structures. After that, these spatial features turn into multi-level emotion-related spatial-temporal features (ESTFs) by squeeze-and-excitation (SE) attention-based temporal feature learning module (TFLM). Finally, the prediction module maps the fusion into the emotion space. A series of experiments on the PMEmo dataset [paper26] demonstrate that the ADFF model achieves an score of 0.4575 for valence and 0.6394 for arousal respectively, which is a relative improvement of 10.43% and 4.82% compared to the state-of-the-art approach. It should be noted that the prediction of valence is notoriously more challenging than arousal [paper13]. Furthermore, extended experiments on datasets with distinct scales and in multi-task also show that our method can effectively learn the affect-salient features from music clips and complete various tasks in MER.
Our contributions are as follows: (1) our proposed ADFF model for MER achieves a better performance than the state-of-the-art method; (2) we introduce SE attention that enhances the weight of emotion-related features to help our model work better; (3) we design a novel data processing to improve calculative efficiency of the model while ensuring the quality of MER.
2 Propsed Method
In this section, we mainly describe the ADFF model which consists of three modules: data processing, multi-level spatial-temporal feature learning, and prediction module.
2.1 Data processing
In this work, we use log Mel-spectrogram as input, whose dimension is (1, , ), where and represent the frame length and the number of Mel bands, respectively. The designed data processing method sequentially cut the spectrogram into parts in the time dimension, and each part will form a new channel and be stacked into an image with dimension of , as shown in the data processing module in Figure 1. We believe this method can reduce the distance among different intervals in the input and the long-term dependence. Experiments also show that our data processing method can improve the model computing efficiency while ensuring the model recognition performance.
2.2 Spatial-temporal feature learning
Learning emotional features is of vital importance to the MER task. It is commonly thought that emotion is involved in spatial and temporal features. For this reason, we design a specific spatial-temporal feature learning module for emotion learning, as shown by SFLM and TFLM in Figure 1.
2.2.1 Spatial feature leaning
The effectiveness of VGGNet in image processing has been widely validated [paper22]. In recent years, with the increasing attention of MER, some researchers have applied the VGGNet structure in it [paper19, paper21, paper23, paper24]
. However, they only explored their methods based on pipeline architecture in classification tasks. In this paper, we introduce the convolution subnetwork of VGGNet-16 as the spatial feature learning module (SFLM), which can be divided into 5 levels, and reform it by adding Relu and Batchnorm structures at the end of each level to get stronger and more robust spatial features. This adaptation makes SFLM more suitable for regression tasks. Suppose the N-th level of SFLM can be represented as, and its output is :
where , , represent the height, width and the number of channels of , respectively. If we divide according to the number of channels, it also can be expressed as .
2.2.2 SE attention-based temporal feature learning
As a kind of temporal information, the music contains emotion changing over time. To capture temporally related emotions, we employ TFLM consisting of SE Attention and Bi-LSTM to learn temporal features. The SE attention is used to obtain the importance of different channels [paper25], which alleviates the problem of focusing on temporal structure only in traditional attention mechanisms. During the learning process, the input of a single channel is transformed into multi-channel through our data processing, which will strengthen the temporal correlation among distinct channels. In this case, the SE block can suppress the features that do not contribute much to emotion by learning the importance of different channels. Figure 2 shows the transformation of SE block, represents the squeeze operation.
By (2) we get , which is the global information embedding of the N-th level. To obtain the importance among the channels of , needs to go through an excitation transform yet.
where and represent learnable weight matrix and attention matrix. Then we get the weighted feature map by rescaling with :
By (4), we get , which will be transformed to (referred to ESTF) by 2-layer Bi-LSTM to complete temporal structure learning.
2.3 Fusion strategy and emotion prediction
Deep learning network can learn high-level abstract emotion-related features for target tasks, but some low-level features useful may be lost in this process [paper20]. To make full use of all the multi-layer emotion-related information, we simply concatenate all the ESTF as to preserve the internal structure of ESTF from different levels maximumly. After that,
is fed into the prediction module to map into the emotion space. As shown in Figure.1, the prediction module is composed of several Fully Connected (FC) Layer and a Feedforward (FF) Layer. The number of the last FC layer is 1 or 2 according to single-task or multi-task, which maps the activation of FF layer to target output.
3 Experiments and Results
The PMEmo dataset contains 794 chorus clips (provided by audio) of popular songs and their corresponding valence-arousal annotations. 457 annotators from different countries and majors are invited to annotate this dataset, and each chorus received at least 10 annotations. To compare our model with existing methods and make full use of the dataset, we split the dataset into two sizes. And for each size, we have 6 different lengths of input to explore the performance of the ADFF model while dealing with various segments, that is, =[5, 10, 15, 20, 25, 30].
Simple Datasets. Each chorus will be cut into a specified fixed length randomly. Namely, each chorus and its corresponding annotation appear only once in a simple dataset, which is consistent with [paper8, paper11].
Full Datasets. To make the most of the PMEmo dataset, we cut each chorus into several segments with specified lengths in time order. If the length of the last segment is less than , it will be dropped. Otherwise, it will be extended forward until satisfying the demand. All segments cut out from the same chorus share the same static annotation.
Note that no matter which dataset is selected, if the original length of the chorus is less than specified
, the audio will be padding with zero first to ensure a fixed length. And all the valence and arousal annotations are linearly scaled to [-1, 1] to improve the robustness of the model.
3.2 Experimental details
We use the
score and root-mean-squared error (RMSE) in the regression task, and accuracy in the classification task as the evaluation metrics. Thescore ranges from negative infinity to 1, and the larger value is better. On the contrary, the smaller the RMSE is, the better. For a fairer comparison and to avoid accidental errors, we take the mean result of 5-fold cross-validation as the final result following [paper8, paper11]. The input log Mel-spectrogram is extracted by librosa 0.7.2 tool [paper27]
, with Mel bands of 128, a sampling rate of 44.1KHZ, window size and hop size of 60ms and 10ms respectively. Moreover, we use Adam optimizer for training, with a decay weight of 1e-5, learning rate of 1e-5, training epoch of 200, and batch size of 32. The decay steps are [20, 45, 80, 110, 140, 170].
3.3 Experimental results
3.3.1 Method comparison and ablation analysis
To verify the advancement of our proposed model and explore the role of different modules, we choose EmoMucs [paper8], MLEM [paper11], and two variants of the ADFF model for comparison. EmoMucs recognizes music emotion based on the source separation algorithm and announced it as the most advanced method. MLEM designes a useful method to debug the biased MER model to perform better. The two variants of the ADFF model drop SE-block and TFLM, respectively. Note that EmoMucs and MLEM both experiment on the simple dataset with the input length of 20 seconds, and we compare the best performance of each model in the same condition.
As shown in Table 1, our proposed ADFF model has obvious advantages over the others, which get lower RMSE and higher scores. Particularly, the score of valence and arousal increase relatively 10.43% and 4.82% than EmoMucs, and note that valence is more challenging to predict than arousal. If taking the deviation into account, our method is more stable than MLEM. Moreover, the ADFF model belongs to a strictly end-to-end architecture, but EmoMucs and MLEM both need source separation algorithms to process the original audio before training, which may introduce more errors.
In addition, ablation experiments show that achieving high-performance prediction for arousal doesn’t need a complicated model, this conclusion is consistent with [paper30]. And compared with the two variants, the ADFF model has a relative improvement of 5.61% and 10.11% in score for predicting valence, which means SE block enhances the weight of emotion-related features and TFLM further improves the emotion capture ability of model.
3.3.2 The influence of data processing
To explore the influence of our data processing on the model, we compared the results of ADFF on simple datasets of 20 seconds when changes in [1, 2, 4, 6, 8, 10, 12, 14, 16].
Figure 3 shows that has different optimal values for various tasks. The best and the second-best results achieve when is 1 and 6 for valence, but for arousal, the becomes 6 and 1. If considering the metric only, it’s a good choice to choose whichever of 1 or 6 for input with a length of 20 seconds. Furthermore, we observe that along with the growth of , the time cost will decrease clearly as shown in Table 2. However, if we choose a too large , the recognition performance of the model will reduce. We conjecture the reason is that the feature represented by each channel is too short to contain enough emotional information. Balancing the time cost and performance benefit, we choose of 6 in the following experiments.
3.3.3 The performance of different lengths
Previous studies have shown that too long or too short input will damage the performance of MER [paper16, paper31]. Exploring the influence of input with different lengths on the model can help to select the optimal input length and maximize the comprehensive benefits of recognition performance and speed. Therefore, we explore the performance of the proposed model when input length changes. In addition, we also explore the influence of data size on the performance of recognition.
|Simple Datasets||Full Datasets|
As shown in Table 3, different input lengths will result in the discrepancy of recognition performance. On simple datasets, the model works best when is 20. Note that input with different lengths should have respective optimal parameters, and we have not investigated one by one here. On full datasets, the recognition performance for all lengths have been improved. The score of the valence of 5 seconds and 10 seconds segments stand out particularly, which relative increased by 30.21% and 35.91% respectively than on simple datasets, even outpaced the performance of 20 seconds. This illustrates that the size of dataset will affect the performance of the model. Even if a clip is short, excellent recognition performance can also be obtained when the model is trained fully. The conclusion may help the existing music software to improve its emotion-related recommendation function.
3.3.4 The performance of multi-task and classification task
We also explore the ability of the proposed model to multi-task. As shown in Table 4, when the model predicts valence and arousal simultaneously, the performance is not much different from single-task. But doing so can avoid the trouble of model designing or training separately for various tasks. And it also demonstrates that our proposed model has an excellent ability to learn affect-salient features for both valence and arousal. Besides, we have extended our experiments on two-category and four-category tasks according to the positive and negative values of valence and arousal. Obviously, four-category tasks are more challenging than two-category, and predicting valence is more difficult than arousal. But to our surprise, different lengths of inputs have no obvious effect on the classification tasks.
4 Conclusion and Future work
In this work, we propose an end-to-end attention-based deep feature fusion method for MER. The proposed model builds a bridge from affect-salient feature to emotion space effectively by using log Mel-spectrogram only. A series of experiments prove that the proposed model outperforms the state-of-the-art method in performance and robustness, and maintains high recognition quality on different input lengths, dataset sizes, or tasks. Until now, we have only explored on PMEmo dataset, in which English songs account for the vast majority. However, there are various genres and languages of music in reality. In future work, we will investigate the ability of our model on cross-language or cross-genre datasets, and try to introduce pre-training to further improve the effectiveness of our model.