Music is a powerful means to evoke human emotions. Analysing the interactions between them is thus important in affective computing, and is one of main focuses of Music Emotion Recognition (MER) which attempts to automatically identify emotions matching a specific music [MER2012_survey]. MER is useful for many potential applications such as music recommendation system and automatic playlist generation for streaming services, and even music therapy in biomedicine[Hizlisoy_DL_class].
MER is performed differently depending on how emotions are modelised. Two main approaches currently co-exist: the categorical and continuous models. MER using the categorical modelling commonly addresses classification of the six ‘general’ emotional categories defined in Ekman’s theory (i.e., happiness, sadness, anger, fear disgust and surprise) [Ekman_BasicEmotions]. On the other hand, MER depending on the continuous modelling mostly focuses on regression to suggest ‘specific’ emotional intensities based on Russell’s circumplex model that decomposes emotions along several axes, such as arousal (level of energy) and valence (level of pleasantness) [CircularModel_Russell]
. Both of categorical and continuous MER approaches have their pros and cons. While categorical approaches can clearly identify general emotions in music, it is impossible to take into account the richness and variations of human emotions with them. For example, there are several degrees of happiness like much, moderate and little happiness, that could not be distinguished from each other with such approaches. On the other hand, continuous approaches can express fine-grained human emotions in a vector space defined by arousal and valence axes. However, it is difficult to identify general emotions because dissimilar emotions such as ‘fear’ and ‘anger’ are located close to each other in the arousal-valence space[CircularModel_Dufour]. Therefore, neither categorical nor continuous MER approach has become preponderant over the other in the literature, despite the benefits of each approach being essential for MER.
In this paper, we propose a Cross-modal Music Emotion Recognition (CMER) approach that can directly analyse the similarity between music and emotion in a common space called an embedding space.
Fig. 1 shows a comparison between a standard CMER approach and the aforementioned categorical and continuous MER approaches. While the latter respectively aim to estimate discrete general emotions and specific continuous emotions, the former projects music samples and emotions into the embedding space. Embeddings of music samples with similar features as well as embeddings of emotions with similar arousal/valence intensities are located close to each other, where an ‘embedding’ refers to the vector representation of a music sample or an emotion in the embedding space. CMER can identify general emotions because similar emotions are gathered close to each other in terms of their embeddings. It can be noted that music samples and specific emotions which are highly relevant to those music samples are projected in proximity, that is, their fine-grained relations are preserved in the embedding space. This way, CMER can treat both of general and specific emotions.
In addition, emotional intensities are inherently uncertain because they are subjectively annotated according to human perceptions which are highly influenced by many factors such as age, personality, cultural background and surrounding conditions [MER2012_survey, Yang_fazzy_class]. Most of existing MER approaches ignore this emotional uncertainty using potentially inaccurate emotional intensities as labels. Emotional uncertainty could cause embeddings to be inaccurate and thus negatively impact the performances of the recognition system.
To deal with the emotional uncertainty, we develop an approach, called CMER using Composite Loss
(CMER-CL) that trains music and emotion embeddings with a compound loss function examining two statistical characteristics. Firstly, it can be assumed that even if emotional intensities differ from user to user, they remain nevertheless correlated when listening to the same piece of music sample. Thus, Canonical Correlation Analysis (CCA) is used as a correlation-based loss function[Deepcca]. The CCA loss enables us to deal with inter-subject variations in emotional intensities by maximising the correlation between music samples and their associated emotions in the embedding space, so as to find their ‘relative’ connection. That is, the embedding space characterises how audio features change according to the increase/decrease in arousal/valence intensities and vice versa. Secondly, due to the emotional uncertainty, it might not be optimal to project a music sample or an emotion into a single point as shown in Fig. 1
. For this, each of music samples and emotions is additionally projected as a probability distribution in the embedding space, in order to cover the large intra-class variation resulting from the emotional uncertainty. Based on this idea, our composite loss function measures the Kullback-Leibler (KL) divergence between the probability distribution for a music sample and the one for an emotion in a second embedding space.
To sum up, this paper contains the following three main contributions: Firstly, we propose a cross-modal music emotion recognition approach CMER-CL that can work with both general and specific emotions since it uses the continuous model of emotions to obtain embeddings of emotions, and its embedding space maintains not only similar music samples and emotions close to each other but also dissimilar music samples and emotions far away. Thus, the embedding space serves as a bridge between music samples and emotions and offers bidirectional music emotion recognition as a by-product of CMER-CL. In other words, we can perform not only Music to Emotion (M2E) to detect emotions expressed in specific music samples but also Emotion to Music (E2M) to identify music samples matching a specific emotion. Secondly, we propose a new composite loss combining CCA and KL divergence to take into account emotional uncertainty. Finally, we perform extensive experiments on two benchmark datasets, the MediaEval Database for Emotional Analysis in Music (DEAM) dataset [DEAM] and the PMEmo dataset [PMEmo]. More specifically, we demonstrate not only the superiority of CMER-CL over one-way M2E baselines, but also the effectiveness of the proposed composite loss function. In addition, a detailed analysis of recognition results is performed to show that CMER-CL can robustly recognise samples with high similarity to the query as input.
This paper is organised as follows: Section 2 reviews the literature of exsting MER approaches grouped into several categories. Section 3 describes details of our CMER-CL approach, and Section 4 reports the experimental results showing the effectiveness of our CMER-CL approach through the comparison with baseline approaches. In addition, Section 5 show the detailed analysis of recognition results. Finally, Section 6 presents the conclusion and our future work.
2 Related work
This section provides a short review of existing MER approaches by dividing them into M2E and E2M. First, our review of existing M2E approaches is carried out by classifying them into three categories: “feature engineering” that hand-crafts emotion-related acoustic features, “feature learning” based on deep learning that automatically learns emotion-related features, and “relation modelling” to extract the relationship between emotional intensities and acoustic features obtained by feature engineering or learning. We then discuss the few past work dealing with E2M. Through this review, we clarify the novelty of the proposed CMER-CL that fuses benefits from both M2E and E2M to enhance MER.
Feature engineering for M2E: Several libraries like MIRtoolbox [Lartillot_MIRtoolbox_tool] and openSMILE [Eyben_Opensmile_tool]
are currently available to extract fundamental acoustic features such as Zero-Crossing Rate (ZCR), Root-Mean-Square (RMS) energy, Mel-Frequency Cepstral Coefficients (MFCCs), Short-Time Fourier Transform (STFT), etc. However, acoustic signal analysis alone might not be enough to account for all required acoustic characteristics[CircularModel_Dufour]. As a result, a large focus of M2E approaches has been put on feature engineering. Panda et al. [MER_AudioFeatures2020_survey] surveyed the field and distinguished several types of emotion-related acoustic features including spectral features (low-level feature), rhythm clarity (perceptual feature) and genre (high-level semantic feature). Various work in the literature proposed afterwards to explore those categories of features. Ren et al. [Ren_feat_class] proposed acoustic features consisting of modulation spectral analysis of MFCCs, short-term timbre features and long-term joint frequency features computed from a two-dimensional representation of acoustic and modulation frequencies [Sukittanon_joint-frequency]. Mo and Niu [Mo_feat_class]
presented an acoustic feature extraction technique called OMPGW that combines three signal processing algorithms, the Orthogonal Matching Pursuit (OMP), Gabor functions, and the Wigner distribution function, to provide an adaptive time-varying description of music signals with a higher spatial and temporal resolution. Pandaet al. [Panda_feat_class] proposed algorithms to extract acoustic features related to musical texture and expressive performance techniques (e.g., vibrato, tremolo, and glissando). Cho et al. [Cho_ChordFeat_reg] presented acoustic features considering temporal sequences of chords (harmonic set of musical notes).
The aforementioned work mainly focus on designing acoustic features and feature selection to effectively estimate an emotion from music samples, but it can be claimed that feature engineering does not inherently take into account emotional variation, unlike our proposed approach based on cross-modal embedding to analyse both music samples and emotions.
Feature Learning for M2E:
The advantage of feature learning is the ability to capture high-level features from raw data and hand-crafted (low-level) features. Deep learning models can also provide more effective features than feature engineering thanks to techniques like transfer learning and fine tuning. In this context, many M2E approaches based on feature learning have been proposed and reported promising performances. Schmidtet al. [Schmidt_DL_reg]
applied regression-based Deep Belief Networks (DBN)[Hinton_DBN_using_Schmidt] to predict arousal/valence directly from spectra obtained by STFT. Weninger et al. [Weninger_DL_reg]
evaluated the usefulness of Long Short-Term Memory (LSTM) to estimate arousal/valence levels from hand-crafted acoustic features[interspeech_ComParE_2013]. Li et al. [Li_DL_DBLSTM_reg] proposed a Deep Bidirectional Long Short-Term Memory (DBLSTM) to predict arousal/valence from low-level acoustic features. Malik et al. [Malik_DL_reg]
demonstrated the effectiveness of stacking Convolutional Neural Networks (CNN) and bidirectional Gated Recurrent Unit (GRU) to predict arousal/valence from acoustic features exclusively based on log mel-band energy. Donget al. [Dong_DL_reg] developed a Bidirectional Convolutional Recurrent Sparse Network (BCRSN) that uses the spectrogram of audio signals and can reduce computational complexity using a transform approach which converts the continuous arousal/valence prediction process to multiple binary classification problems. Sarkar et al. [Sarkar_DL_class] applied CNN taking log-mel spectrogram to a four-class classification problem defined by Russell’s model quadrants [CircularModel_Russell]. Hizlisoy et al. [Hizlisoy_DL_class]
proposed a Convolutional Long short term memory Deep Neural Network (CLDNN) architecture for classification into three quadrants excluding low arousal - high valence from Russell’s model quadrants[CircularModel_Russell]. Choi et al. [Choi_DL_transfer] presented a transfer learning approach for audio-related feature extraction. They trained a CNN that takes mel-spectrograms extracted from music signals as inputs for a music tagging task, and then transferred and fine-tuned it for six other tasks such as ballroom dance genre classification, music genre classification, speech/music classification, emotion prediction, vocal/non-vocal classification and audio event classification.
The aforementioned approaches used discrete emotions or arousal/valence intensities to estimate an emotion from music samples, but without taking into account emotional variation for discrete-based approaches, or emotional uncertainty for discrete-based as well as continuous-based approaches.
On the other hand, our approach uses high-level acoustic features learned by a pre-trained VGGish model [VGGish], and takes advantage of cross-modal embedding using composite loss to deal with both emotional variation and uncertainty.
Relation modelling for M2E: Past work have also investigated the relationships between music samples and emotions. Yang et al. [Yang_modeling_reg] built a group-wise MER scheme (GWMER) which divides users into various groups based on user information such as generation, gender, occupation and personality, and trains a SVR for the prediction of arousal/valence for each group. GWMER can this way partially address the problem that continuous emotions are more affected by subjective issue than discrete emotions when annotating. Yang and Chen [Yang_modeling_Ranking-based_reg]
presented a ranking-based neural network model, called RBF-ListNet, which ranks a collection of music samples by emotion and determines the emotional intensity of each music sample using cosine Radial Basis Function (RBF) as an activation function. Yang and Chen[Yang_modeling_predDist_reg] and Chin et al. [Chin_modeling_reg] developed probabilistic approaches to deal with emotional uncertainty by estimating the distribution of emotional intensities from hand-crafted acoustic features. Markov and Matsui [Markov_modeling_reg] showed that modelling with Gaussian Processes (GP) was more powerful than SVR for arousal/valence regression with hand-crafted acoustic features. Additionally, Fukayama and Goto [Fukayama_modeling_reg] evaluated the effectiveness of the aggregation of multiple GP regressions, each trained with different acoustic features. Wang et al. [Wang_modeling_reg] presented Acoustic Emotion Gaussians (AEG) that treat emotional uncertainty by modelling hand-crafted acoustic features as a parametric probability distribution (soft assignment) instead of a single point (hard assignment). Wang et al. [Wang_modeling_histogram_reg]
proposed a Histogram Density Mixture (HDM) model that quantises the arousal/valence space into grids and trains a Gaussian Mixture Model (GMM) for each cell in grid independently using hand-crafted acoustic features. Hanet al. [Han_modeling_class] proposed a geometric approach to classify each example into specific regions of the arousal/valence space by estimating its distance and angle from the origin of the space. SVR was used to estimate both distance and angle, and was shown to be more effective than SVM and GMM. Wang et al. [Wang_modeling_class] developed a MER system for 34 emotional categories based on Hierarchical Dirichlet Process Mixture Model (HDPMM) that can links emotion classes using the property of sharing components in the HDPMM.
To the best of our knowledge, relation modelling approaches have so far exclusively relied on feature engineering, and not yet been used with high-level features obtained by feature learning. On the other hand, our cross-modal approach uses high-level acoustic features based on VGGish, and projects emotional intensities into an embedding space via an emotion encoder, which enables us to extract high-level feature representations for emotions.
E2M: E2M has not been explored as extensively as M2E in the literature. Kuo et al. [Kuo_E2M_graph] developed a model for emotion-based film-music recommendation using a Music Affinity Graph which represents the relationship between the corresponding acoustic features and query emotions. Ruxanda et al. [Ruxanda_E2M_reg] developed an algorithm for dimensionality reduction of emotion-related acoustic features to effectively identify high-dimensional audio features from emotion. Deng et al. [Deng_E2M_reg_recommend] proposed an E2M approach based on the assumption that emotions expressed in music listened to in the past have an influence on the user in the present. Yang et al. [Yang_E2M_MrEmo_reg] developed an emotion-based music recognition system by extending M2E regression approaches. It firstly collects the arousal and valence predicted by a M2E model, and then identifies music samples by calculating distances between collected emotions and user’s input emotion.
One main reason for the scarcity of E2M research is that the existing acoustic features associated with emotions are high dimensional [Ruxanda_E2M_reg], and thus not easy to predict directly from emotions using regression. In addition, the above-mentioned E2M approaches use raw emotional intensities to link music and emotion, and are therefore potentially affected by emotional uncertainty. Alternatively, our approach based on CCA and KL-divergence losses can take into account emotional uncertainty.
Cross-modal recognition: Cross-modal recognition approaches using embedding have attracted much attention as a technique that can perform effective bidirectional recognition between different modalities (e.g., image, text and audio). Related to audio processing, some researchers explored cross-modal recognition between audio and image [Zeng_closs_S-DCCA_audio-image, Zeng_closs_TNN-C-CCA_audio-image] and the one between audio and text (lyrics) [Yu_closs_DCCA_audio-text]. But, to the best of our knowledge, no existing work addresses cross-modal recognition between audio and emotion except our previous study [CMR_takashima]
, where MultiLayer Perceptrons (MLPs) based on CCA loss are used to compute music and emotion embeddings. This paper is an extension of our previous study by adopting RNNs in addition to MLPs, devising a composite loss function that combines CCA and KL-divergence losses, and conducing significantly deeper analysis of experimental results.
3 The CMER-CL approach
Fig. 2 shows an overview of our CMER-CL approach. A sequence of acoustic features are firstly extracted from raw audio records using the VGGish model 11footnotetext: https://github.com/tensorflow/models/tree/master/research/audioset/vggish trained by Google [VGGish]. In addition, as shown in Fig. 2 (b), the emotion encoder takes as input a sequence which combines raw arousal and valence intensities for a music sample. Note that for the sake of simplicity is called an arousal/valence sequence in the following discussions. The music and emotion encoders which both consist in a neural network model (e.g., MLP or bidirectional GRU [Cho_GRU, Schuster_Bidirectional-RNN]) extract high-level features and , respectively. Then, and are projected into an embedding space using different Fully Connected (FC) layers with linear activation. The embeddings for and in are denoted by and , respectively. In addition, two FC layers are used to transform into a mean vector and a variance-covariance matrix . This defines an additional embedding for as a multivariate Gaussian distribution in another embedding space . Similarly, is converted into using two FC layers. Under the above-mentioned setting, CMER-CL trains the music and emotion encoders and the six FC layers by jointly minimising the CCA loss between and and the KL-divergence loss between and .
The following sub-sections describe data processing, encoding of music samples and emotions, and more details of the training process based on the composite loss.
3.1 Encoding music samples
Our CMER-CL approach uses a CNN-based VGGish model for audio feature extraction [VGGish]. VGGish firstly segments a music signal into non-overlapping frames of 0.96 seconds. After resampling them to 16 kHz mono and extracts log mel spectrogram of each frame. A CNN is then applied to the spectrogram of the th segment to extract a -dimensional acoustic feature . A sequence is fed into the music encoder.
For the music encoder, two types of neural networks were tested: a MLP and a RNN using bidirectional GRU [Cho_GRU, Schuster_Bidirectional-RNN]. When the former is used, a mean feature vector is computed by averaging the components of
, and passed as input of the neural network which performs several non-linear transformations onto output . The RNN using bidirectional GRU extracts a hidden state by recursively aggregating in , and and from the previous and next time points. Compared to , can been seen as a higher-level feature that considers not only but also features in the past and the future. Then, an overall feature is formed by concatenating and . Finally, several FC layers are used to refine into a further higher-level feature .
3.2 Encoding arousal/valence intensities
The emotion encoder processes a sequence representing the valence and arousal at each frame, that is, . Our preliminary experiments showed the effectiveness of an RNN using bidirectional GRU as the emotion encoder regardless of datasets. Therefore, the emotion feature vector is extracted the same way as when RNN is used as music encoder.
3.3 Training with a composite loss
Let be a batch consisting of pairs of acoustic and arousal/valence sequences for associated music samples and emotions. And, is a set of feature pairs obtained by applying the music and emotion encoders to in . This set is converted into a set of embeddings to compute the CCA loss [Deepcca]. In addition, a set of pairs of multivariate Gaussian distributions is obtained for the KL-divergence loss [hama_kl-divergence]. Our composite loss combines and as follows:
where is a weight parameter to balance CCA and KL. Below, details of how to compute and are described.
3.3.1 Correlation-based embedding with CCA
We firstly provide general descriptions of a CCA-based embedding approach and then specialise it to embed music samples and emotions. Let be a random vector that is sampled from the probability distribution estimated using a set of samples , and be a random vector from the probability distribution estimated using . In addition, let us assume that and are weight vectors to project and into scalars, respectively. CCA optimises and so as to maximise the following correlation between and [Deepcca]:
where is the cross-covariance matrix computed from , and and are the covariance matrices for and , respectively. In Eq. (2) the quantity to maximise is invariant in scaling of and , so it is possible to focus on the problem where the denominator is equal to 1. In other words, the objective of CCA is to maximise the numerator in Eq. (2) subject to the constraints and .
In accordance with the basic CCA described above, the general CCA finds pairs of weight vectors to maximise the sum of correlations between and . In other words, letting be a matrix where each row is , forms a -dimensional embedding. Similarly, where each row is is used to create . From this perspective, the general CCA maximises the sum of correlations each computed for one dimension of and . Past work have shown that the batch optimisation of can be done by solving the following constrained optimisation problem [Deepcca]:
|subject to :|
where the trace operation is used to sum up the correlations on each dimension of and .
The CCA-based embedding approach described above can be applied to our case as follows: First, and correspond to random vectors that are sampled from the probability distributions estimated by and , respectively. Here, is the output of a FC layer taking as input in order to enhance the expressiveness power of . Also, , and in Eq. (3) are computed using and . As a result of the optimisation in Eq. (3), a music sample and emotion are embedded into -dimensional vectors and , respectively.
3.3.2 Distribution-based embedding with KL-divergence
Although the CCA loss allows us to project while considering the correlation between music samples and emotions into the embedding space, it does not consider how similar a music sample and an emotion should be positioned close to each other and how dissimilar the music sample and the emotion should be positioned far from each other. KL is used to train music and emotion encoders and FC layers to achieve the following conditions: 1) a music sample and its associated emotion are projected as multivariate Gaussian distributions and which are similar to each other; 2) a music sample and its uncorrelated emotion are projected as dissimilar distributions and (). Similarly, an emotion and its uncorrelated music sample are transformed into dissimilar distributions and .
We use the term positive pair to indicate a pair of and obtained for an associated music sample and emotion. In addition, a negative pair expresses a pair of and or a pair of and , which are computed for an uncorrelated music sample and emotion. Note that in our implementation a music sample and an emotion whose indices are the same form a positive pair, and any other pair is regarded as a negative pair.
In accordance with these definitions, the above-mentioned condition can be formulated using a triplet :
where represents the distance between two distributions. In addition, is a margin hyper-parameter which determines how far the difference between the distance for positive pair and the one for negative pair is allowed to be. Eq. (4) uses as an anchor and checks whether its distance to in the positive pair is sufficiently smaller than its distance to in a negative pair. Similarly, another triplet can define the distance condition using as an anchor:
where and are respectively abbreviated into and for simplicity sake. The first term in becomes zero if all the negative pairs defined using as an anchor lead to distances that are greater than the distance between the positive pair by more than . The second term also checks a similar distance condition as an anchor. This way examines the validity of relative to and defined for all the negative pairs.
To compute , the Kullback–Leibler (KL) divergence is employed as the distance between two multivariate Gaussian distributions and , and is computed as follows:
where Eq. (7) uses the KL divergence formulation for multivariate Gaussian distributions. To reduce the computational complexity for the computation of the KL-divergence, we assume that dimensions for these distributions are independent. This assumption follows the standard practice of the literature [Kingma_VAE, Sanchez_GMM]. Thus, and are replaced with vectors and . Here, and respectively consist of diagonal elements in and , and is the dimensionality of the embedding space. As a result, the KL divergence in Eq. (7) is simplified as follows:
Finally, KL is defined as the sum of for all the positive pairs in . Hence, the minimisation of KL can lead both music and emotion encoders and FC layers to learn parameters so that the KL-divergence between each positive pair is minimised while maximising the KL divergence between each negative pair.
3.4 Cross-modal music emotion recognition tasks
CMER-CL formulates each of M2E and E2M as a retrieval task. In M2E, the trained music model consisting in a music encoder and three FC layers, is used to encode a query music sample into its correlation-based embedding and a multivariate Gaussian distribution . On the other hand, the trained emotion model consisting in an emotion encoder and three FC layers, is used to convert each of test emotions into its correlation-based embedding and a multivariate Gaussian distribution .
Under this setting, the similarity between the query music sample and the test emotion for M2E is computed as the weighted sum of the Pearson product-moment correlation coefficient betweenand and the negative KL divergence between and . defined in Eq. (1) is used to weight the correlation coefficient and the negative KL divergence by and , respectively. Test emotions are then sorted by decreasing similarity to the query music sample. The performance of M2E is evaluated by examining whether the test emotion relevant to the query music sample is ranked at a high position or not.
Similarly to M2E, E2M is performed by encoding a query emotion and test music samples with the trained emotion and music models, respectively. Then, test music samples are sorted by computing their similarities to the query emotion based on their correlation-based embeddings and multivariate Gaussian distributions. The rank of the music sample relevant to the query emotion is checked to measure the performance of E2M.
In this section, we evaluate our CMER-CL approach on two datasets: the MediaEval Database for Emotional Analysis in Music (DEAM) dataset [DEAM] and the PMEmo dataset [PMEmo]. We present the results of two experiments. The first compares the effectiveness of our composite loss to the CCA loss or the ranking loss with KL divergence both used alone in a context of cross-modal music emotion recognition (ablation study). The second compares our proposed cross-modal approach using the composite loss to one-way regression baselines for M2E and E2M.
In the following sub-sections, we first present the two benchmark datasets, the evaluation metrics used in our studies and the details of the hyper-parameters chosen for our models. Finally, we present the results of both experiments.
4.1.1 DEAM dataset
The DEAM dataset [DEAM] provides 1802 music samples - which are free audio source records - and their corresponding arousal/valence sequences where arousal and valence intensities lie in . Each music sample was annotated with arousal and valence intensities every 0.5 seconds by at least 5 subjects whose annotations were collected by the Amazon Mechanical Turk (MTurk). These intensities were projected into the range for each subject.
The dataset contains 1744 45-second-long music samples and 58 samples that have durations longer than 45 seconds. The authors of the dataset decided to discard the first 15 seconds of annotations after observing high instability due to a high variance in how music samples starts. Because of this and the fact that most music samples last only 45 seconds, each music sample is normalised to have a length of 30 seconds by taking the segment starting at 15 seconds and ending at 45 seconds. In addition, for each of arousal and valence, an “average sequence” is created by computing the average value over all subjects at each time point. The average sequences for arousal and valence are then concatenated into a two-dimensional arousal/valence sequence. Finally, the 30-second segment in this sequence corresponding to the paired music sample is extracted.
4.1.2 PMEmo dataset
The PMEmo dataset [PMEmo] contains 794 music samples which are the chorus parts of high quality popular pop-songs gathered from the Billboard Hot 100, the iTunes Top 100 Songs (USA) and the UK Top 40 Singles Chart. Each music sample is annotated with arousal and valence intensities between 1 (low) and 9 (high) every 0.5 seconds, and then projected into the range . Similarly to the DEAM dataset, the first 15 seconds of annotations were discarded to take into account the large variance in beginnings of music samples. Unlike for the DEAM dataset, music samples and associated arousal/valence sequences of the PMEmo dataset have variable lengths, with sequences ranging from 0.08 to 73.24 seconds (after discarding the 15 first seconds). We decided to select music samples with a total length of at least 7.0 seconds to evaluate in total 701 pairs of a music sample and its associated arousal/valence intensities. As for the DEAM dataset, the arousal/valence sequence is created for each music sample by computing the average arousal (or valence) intensity over all subjects at each time point. Moreover, we use the dataset called ”PMEmo2019” which is an updated version of the dataset provided by Zhang et al.33footnotemark: 3.
4.2 Evaluation metrics
We use Mean Reciprocal Rank (MRR) and Average Rank (AR) as evaluation metrics to measure the recognition performance of our CMER-CL approach. In MER, there is only one emotion associated with a music sample. Thus, we employ MRR and AR that are calculated based on the rank which is the rank of the sample associated with a query in the list of recognised samples sorted in descending order in terms of their similarities to . Letting be the number of queries, MRR and AR are formulated as and , respectively. In particular, MRR is the average of the reciprocals of the rank , and AR is the average of the rank
. Therefore, the higher the MRR is and the lower the AR is, the better the performance is. In our experiments, we trained and evaluated all models in each configuration 10 times, and report the mean and standard-deviation of both MRRs and ARs.
4.3 Implementation details
4.3.1 Encoder architecture
The hyper-parameters of our music and emotion encoders (e.g. number of layers, number of units per layer, etc.) were chosen by grid-search. We tested two types of encoders based either on MLP or bidirectional-GRU. Their structure was chosen as follows: The MLP consists of five FC layers, each of which is a non-linear transformation using softplus defined by as the activation function. The number of units in each layer is 256 units in the first layer, 512 units in the second and third layers, and 1024 units in the fourth and fifth layers. In addition, a dropout layer with a drop rate of is included between each layer except the output layer.
The bi-directional GRU has a single layer with a 512-dimensional hidden state, is trained in the forward and backward directions, respectively, and finally outputs a 1024-dimensional vector by combining the forward and backward hidden states. This 1024-dimensional hidden state is passed to five FC layers using softplus as activation function. The number of units per FC layer was chosen as 512 for the first three ones, and 1024 for the two last ones. A dropout layer with a dropout rate of is also is added behind all layers except the bi-directional GRU and the output layers.
We tested various combinations of MLP and bi-directional GRU for music and emotion encoders on the DEAM and PMEmo datasets, and found that for the music encoder, MLP performed the best for the DEAM dataset while bidirectional GRU was the best model for the PMEmo dataset. Furthermore we found that the bi-directional GRU is the best as emotion encoder on both datasets.
4.3.2 Training parameters
We used an embedding dimension of for both embedding spaces and , and Adam [adam_optimisation] as the optimisation algorithm with an initial learning rate of 1e-5. The optimal value of the margin depends on the weight parameter (in Eq. (1)). The ablation study in Section 4.4 is conducted by fixing at
. In addition, the split proportion to form a training and evaluation datasets was set to 8:2. Therefore, the DEAM dataset is split into 1441 and 361 samples, while the PMEmo dataset is split into 560 and 141 samples. The models were trained for 5001 and 10001 epochs on the DEAM and PMEmo datasets, respectively. We implemented all the code using TensorFlow library (version 1.15) on a 64GB RAM machine with Intel i9-9900K CPU, NVIDIA RTX 2080Ti GPU and CUDA version 10.0.
4.4 Ablation studies
4.4.1 Superiority of Composite-Loss
Our CMER-CL approach combines the CCA loss (CCA-Loss) between music and emotion enbeddings and ranking loss with KL-divergence (KL-Loss) between music and emotion distributions as the loss function. Its performances were compared to the baseline cases where either only CCA-Loss or only KL-Loss were used as loss functions. The results obtained with these three loss functions are presented in Tables I and II for the DEAM and PMEmo datasets, respectively. For the composite loss, the weight parameter was set to 0.5.
For both datasets, the results show that MRR and AR are better using KL-Loss than CCA-Loss. This may be due to the fact that CCA-Loss alone only considers the correlation between music and emotion samples, and does not necessarily make sure that the corresponding samples are placed close to each other in the embedding space. On the other hand, models trained with KL-Loss produce embeddings so that positive samples are close to each other while negative samples are far away, thus taking into account not only corresponding samples but also unrelated ones. In addition, the recognition results are significantly improved when using the composite loss compared to using only the KL-Loss or CCA-Loss. This indicates that combining KL-Loss and CCA-Loss can simultaneously train similar multivariate Gaussian distributions to be close to each other and the embeddings to be highly correlated with each other to improve the recognition performance.
Furthermore, it can be noted that the recognition performances are significantly better on the PMEmo dataset than DEAM. This could be attributed to the fact that PMEmo music samples are more standardised, for instance by including only pop songs and chorus part, which leads CMER-CL encoders to find more specialised features for the dataset they are trained on. In other words, since the DEAM music samples are more diverse than PMEmo ones, CMER is likely to be a more difficult task on the DEAM than the PMEmo dataset. In particular, it is possible that the features learned on PMEmo are not as generic as the ones learned on DEAM.
4.4.2 Evaluation of the weight parameter
The previous results were obtained with . We also carried out a study checking MRR and AR obtained for values of ranging from to with increments of . The results of 10 training sessions on the DEAM and PMEmo datasets are shown in Fig. 3 and Fig. 4, respectively. It can be noted that, when , the result is the same as using KL-Loss alone, and when , the result is the same as using CCA-Loss alone.
The results of combining CCA-Loss and KL-Loss show that the best value of depends on the dataset and were determined to be and on the DEAM and PMEmo datasets, respectively. In addition, a grid search of margin on each dataset using these lambdas showed that the best value of are 1.0 for DEAM and 0.4 for PMEmo. The results obtained in our further experiments are reported with those values.
4.5 Comparison with the baseline models
To the best of our knowledge, no existing cross-modal MER method that can be directly compared to our CMER-CL has been proposed yet. In addition, all the existing approaches using continuous emotion models only perform M2E based on a regression approach to predict real-valued characteristics of the arousal/valence sequence (e.g., the average of arousal or valence values) for a given music sample [Schmidt_DL_reg, Weninger_DL_reg, Li_DL_DBLSTM_reg, Malik_DL_reg, Dong_DL_reg, Yang_modeling_reg, Yang_modeling_Ranking-based_reg, Yang_modeling_predDist_reg, Markov_modeling_reg, Fukayama_modeling_reg, Wang_modeling_reg, Wang_modeling_histogram_reg]. Moreover, no existing method can handle E2M to predict acoustic features of the music sample for a given arousal/valence sequence. Considering the aforementioned state of the current MER research, we define the following regression-based baselines to show the effectiveness of cross-modal recognition with CMER-CL.
M2E baselines: Two M2E baselines RegBiGRU-M2E and RegMLP-M2E train a regression model that analyses a query music sample and outputs a two-dimensional emotion vector representing the average arousal and valence values for this music sample. In particular, RegBiGRU-M2E predicts by applying a RNN based on bidirectional GRU to a sequence of feature vectors extracted by VGGish, while RegMLP-M2E employs a MLP model that uses the average feature vector of this sequence over time to predict . Both baselines are trained to minimise the Mean Absolute Error (MAE) between and the ground-truth emotion vector associated with each training music sample.
In the evaluation process, the trained model is used to predict the emotion vector for the th test music sample (). We then check whether is similar to the ground-truth vector . To this end, we compute the similarities of to the ground-truth vectors for all the test music samples. Here, the Absolute Error (AE) between and is used as their similarity. Then, the ground-truth vectors are sorted in ascending order of their AEs to get the rank representing the position of . Finally, is used to calculate MRR and AR.
E2M baselines: Similarly to the M2E baselines, a bidirectional GRU model (RegBiGRU-E2M) and a MLP model (RegMLP-E2M) are used as E2M baselines to predict an audio feature vector . RegBiGRU-E2M and RegMLP-E2M take as input an arousal/valence sequence and the average of this sequence over time, respectively. RegBiGRU-E2M) and RegMLP-E2M are trained to minimise the MAE between for a query emotion and the ground-truth audio feature vector that is the average of a sequence of VGGish features over time. To compute MRR and AR as in M2E, AEs of to all the test audio features are firstly computed. The test audio features are then ranked in order of ascending AEs to compute the rank of the ground-truth audio feature of the query emotion.
|model name||epochs||units per FC layer||hidden state dimension|
|model name||epochs||units per FC layer||hidden state dimension|
Tables V and VI show the comparison results between our CMER-CL and the above-mentioned baseline models on the DEAM and PMEmo datasets, respectively. As can be seen from these tables, the recognition performance for the CMER-CL is significantly higher than those of the baseline models based on one-way regression of emotion or audio features. This highlights the superiority of cross-modal over one-way regression.
5 Detailed analysis
The results shown before indicate the superiority of our CMER-CL compared to the tested alternatives. However, MRR and AR are global metrics that only depend on the rank of the music sample or emotion relevant to a query. It is also desirable to check the top-ranked music samples (or emotions) whose relevances to the query can be measured by neither MRR nor AR. For this, we compute the average cosine similarity
which averages the cosine similarities between the ‘overall’ arousal/valence vector (or ‘overall’ VGGish feature) associated to the query music sample (or query emotion) and the ones associated to the topmusic samples (or emotions) identified by CMER-CL. Here, an overall arousal/valence vector is the average of an arousal/valence sequence and an overall VGGish feature is the average of a sequence of VGGish features.
This metric allows us to get an idea of how relevant the top-ranked music samples (or emotions) are to a query, regardless of the rank of the music sample (or emotion) associated with the query. A high average cosine similarity means that CMER-CL can recognise music samples that express emotions similar to a query emotion, or emotions derived from music samples which are highly similar to a query music sample. In what follows, we especially present the analysis for E2M since the emotion for each music sample can be easily interpreted by considering its overall arousal/valence vector as a point in Russell’s Circumplex model [CircularModel_Russell]. It should be noted that arousal/valence intensities are in and for DEAM and PMEmo respectively, meaning that average cosine similarities range between -1 and 1 on DEAM, and 0 and 1 on PMEmo.
Figs. 5 LABEL:sub@subfig:DEAM_detailed and LABEL:sub@subfig:PMEmo_detailed show the average cosine similarities for the DEAM and PMEmo datasets. Each query emotion on the horizontal axis in (i) of Figs. 5 LABEL:sub@subfig:DEAM_detailed and LABEL:sub@subfig:PMEmo_detailed is sorted based on the rank of its relevant music sample. That is, for a query emotion on the left, its ground-truth music sample is ranked at a high position in the recognition result, meaning that the emotion expressed in the music sample is well recognised.
As shown in Figs. 5 LABEL:sub@subfig:DEAM_detailed-(i) and LABEL:sub@subfig:PMEmo_detailed-(i), average cosine similarities obtained on both DEAM and PMEmo are fairly high (close to 1), showing that the top identified music samples are relevant to most query emotions regardless of the ranks of their ground-truth music samples. More specifically, the mean and standard deviation of these cosine similarities are for DEAM and for PMEmo. This indicates that even if the music sample for a query emotion was ranked at a low position, the top music samples recognised by CMER-CL express very similar emotions to the query. In addition, the box plots in Figs. 5 LABEL:sub@subfig:DEAM_detailed-(ii) and LABEL:sub@subfig:PMEmo_detailed
-(ii) show the variations in the average cosine similarities. Here, at least 75 percent of all the average cosine similarities are higher than the 25th percentile (first quartile). The fact that the 25th percentile for DEAM and PMEmo areand respectively, indicates that CMER-CL is able to robustly recognise music samples associated to highly similar emotions to a query. In other words, even if the music sample associated to the query emotion was ranked at a low position, the top 5% recognised music samples still exhibit emotions close to the query. Nevertheless, average cosine similarities for some query emotions in DEAM are low, indicating that there is room for improvement for our approach.
A similar study computing the average cosine similarity between the audio feature of a query music sample and audio features associated with the top 5% recognised emotions for M2E showed similarly good performances. Figures showing such average cosine similarities can be found on the website of CMER-CL444https://mu-lab.info/naoki_takashima/cmer-cl/-/tree/main/results, and the mean and standard-deviation of average cosine similarities on DEAM and PMEmo are and , respectively.
6 Conclusion and Future Work
In this paper, we presented a Cross-modal Music Emotion Recognition using Composite Loss (CMER-CL) approach that can consider both general and specific emotions during the recognition process. CMER-CL simultaneously trains music and emotion embedding models by minimising a composite loss defined as a weighted sum on the CCA and KL divergence losses. As a result, two embedding spaces are learned, one maximising the correlation between music samples and their associated emotions and the other projects them as similar multivariate Gaussian distributions to take the uncertainty of emotional intensities into account. The experimental results on two benchmark MER datasets - DEAM [DEAM] and PMEmo [PMEmo] - show the superiority of CMER-CL over one-way regression baselines for both M2E and E2M, as well as the effectiveness of the composite loss over the CCA and KL divergence losses. In addition, detailed analysis of the top recognised results demonstrate that CMER-CL can robustly recognise music samples where highly similar emotions to a query emotion are expressed.
To further improve the performance of CMER-CL, we aim to extend music and emotion encoders by pre-training them with self-supervised learning[contrastive_self-supervised_learning_survey] which can learn underlying feature representations using unlabeled data. In addition, we plan to adopt a self-attention layer [attention_is_all_you_need] which can capture long-terms dependencies of features and have been reported to be superior to a RNN in various areas.
Finally, the codes (and the instruction of data usage) used in this paper are available on our GitLab repository555https://mu-lab.info/naoki_takashima/cmer-cl, in order for other researchers to more easily reproduce the results and extend the current CMER-CL more easily.
This work has been supported in part by Japan Society for the Promotion of Science (JSPS) within Grant-in-Aid for Scientific Research (B) (19H04172).