Log In Sign Up

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Emotion recognition is involved in several real-world applications. With an increase in available modalities, automatic understanding of emotions is being performed more accurately. The success in Multimodal Emotion Recognition (MER), primarily relies on the supervised learning paradigm. However, data annotation is expensive, time-consuming, and as emotion expression and perception depends on several factors (e.g., age, gender, culture) obtaining labels with a high reliability is hard. Motivated by these, we focus on unsupervised feature learning for MER. We consider discrete emotions, and as modalities text, audio and vision are used. Our method, as being based on contrastive loss between pairwise modalities, is the first attempt in MER literature. Our end-to-end feature learning approach has several differences (and advantages) compared to existing MER methods: i) it is unsupervised, so the learning is lack of data labelling cost; ii) it does not require data spatial augmentation, modality alignment, large number of batch size or epochs; iii) it applies data fusion only at inference; and iv) it does not require backbones pre-trained on emotion recognition task. The experiments on benchmark datasets show that our method outperforms several baseline approaches and unsupervised learning methods applied in MER. Particularly, it even surpasses a few supervised MER state-of-the-art.


Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text

Speech emotion recognition is a challenge and an important step towards ...

Temporal aggregation of audio-visual modalities for emotion recognition

Emotion recognition has a pivotal role in affective computing and in hum...

UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition

Multimodal sentiment analysis (MSA) and emotion recognition in conversat...

MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition

Multimodal emotion recognition study is hindered by the lack of labelled...

Supervised Prototypical Contrastive Learning for Emotion Recognition in Conversation

Capturing emotions within a conversation plays an essential role in mode...

Deep Residual Local Feature Learning for Speech Emotion Recognition

Speech Emotion Recognition (SER) is becoming a key role in global busine...

Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition

Despite the recent achievements made in the multi-modal emotion recognit...

I Introduction

Emotion is a key factor driving people’s actions and thoughts, and a fundamental part of the human verbal and nonverbal communication. Automated emotion recognition is an important aspect of many applications, including social assistive robots [SPRING], smart systems to work in customer service [Burkhardt2006], health-care [dhuheir2021], education [Hammoumi2018], and automated-driving cars [Pavan2021]. However, it is a highly challenging problem due to the complex nature of emotion expression and perception, which are hard to generalize as being dependent on several factors such as age [Demenescu2014], gender [Olderbak2019], cultural background [Engelmann2013], and personality traits [Furnes2019]. Furthermore, as humans can express their emotions across various modalities (e.g., language, facial expressions, gestures, and speech), it is essential to effectively model the interactions between these modalities, containing complementary but also (possibly) redundant information [Baltrusaitis2019].

The majority of works mainly concentrated on unimodal learning of emotions [beyan2021modeling, AbdullahAhmadAl20, ShirianTripathiAl21]

, i.e., processing a single modality. Although there exist breakthrough achievements by unimodal emotion recognition, due to the aforementioned multimodal nature of emotion expression, such models remain incapable in some circumstances. On the other hand, multimodal emotion recognition (MER) holds the challenges of multimodal machine learning, e.g., representing the data to be able to exploit the complementarity and redundancy of modalities, data translation among modalities, co-learning, modality alignment (e.g., capturing temporal information) and data fusion (see


for details). Like most intelligent systems, the advancements in deep learning have enhanced MER, particularly, by utilizing the abundance of data availability. Studies in this field (e.g.,

[TsaiMaAl20, ZhangLiangqingAl19, WenYouAl21, HoYangAl20]) so far, treat the learning process with the supervised way, thus require an intense labor for annotations.

This paper addresses the problem of perceived multimodal emotion recognition when the emotions are represented as discrete categories and, more importantly, we learn the features in an unsupervised fashion. Motivated by the fact that contrastive learning has shown accurate and robust performance in many domains (e.g., [chen2020simple, Rai_2021_CVPR]

), we adapt the contrastive loss function

[Luyu2020] to perform pairwise modality feature learning. To the best of our knowledge, this is the first time contrastive loss is adapted for MER. Our approach learns feature embeddings in an end-to-end fashion (see [DaiCahyawijayaAl21] for the definition), and differs from the prior works in terms of several aspects, which are described as follows.
i) Modality exploitation. Our method leverages different modalities in a contrastive learning framework. Given a data sample represented in terms of multiple modalities, our aim is to push the embeddings of two modalities of the same sequence to be close to each other while pulling the embeddings of the same two modalities of different sequences to be apart. Note that the sequences that are being pulled apart can be from the same class. But, herein we do not use the class labels, thus we only aim to make the representations of the same sequence across modalities similar (as close as possible) to each other.
ii) Data translation co-learning. We contrast the feature embedding of one modality with another modality when both are belonging to the same data sample. This can be seen as an analogy to performing data translation and ultimately co-learning. Unlike existing contrastive learning approaches (e.g., [Haocong2021, Rai_2021_CVPR]), we do not require data spatial augmentation (e.g., random crops, blurs or color distortions). Also, different from approaches [chen2020simple, Chen2020, MaoLi2021] relying on heavy data augmentations as well as large number of batch sizes and epochs, our method is much more affordable.
iii) Modality alignment. The outputs of different sensors might have different (but fixed) sample rates. However, this is not valid for text, which makes obtaining word-aligned sequences not so obvious [WenYouAl21]. Still, multimodal data alignment is an imperative step to perform an effective MER for several methods (e.g., [Koromilas2021UnsupervisedML, ShenoySardanaAl20]), resulting in the real-world application of such methods challenging. In contrast, our method does not require perfectly aligned modalities. We considered both aligned samples and a mixture of aligned/misaligned samples in our experiments (Sec. IV-A).
iv) Data fusion. It is applied here only at inference via the concatenation of learned feature representations. This is different from the MER state-of-the-art (SOTA) applying data fusion both in training and testing [RadoiBirhalaAl21, ShenoySardanaAl20, SongCaiAl21, MittalBhattacharyaAl20].
v) Data labelling. Our method is free from data labeling cost by being an unsupervised feature learning approach. Note that there exist a few number of unsupervised approaches in the same and/or related topics, e.g., speech emotion recognition [Neumann2019, Zhang2021], facial emotion recognition [Xiao2019]

, facial expression intensity estimation

[AwiGra2018a], and multimodal sentiment and emotion analysis [Koromilas2021UnsupervisedML]. However, our method involves the deep architectures either pre-trained on tasks different from emotion recognition (e.g., action recognition) or not

pre-trained. This aspect introduces a potential to apply the proposed method to the related downstream tasks, e.g., multimodal sentiment analysis and social interaction analysis, without the need of customization. Some approaches (e.g.,

[Hu_2018_ECCV, Savchenko2021, Minji2020]), instead, could supply the desired performance (e.g., outperforming the best of all methods of comparison time) if and only if they are pre-trained on large emotion datasets having the same emotion labels as in the test set.

To validate the effectiveness of our method, experiments were realized on two multimodal emotion datasets. Results show that the proposed method outperforms prior unsupervised MER approaches and several baselines. Moreover, despite performing unsupervised feature learning, our method even surpasses some of the fully-supervised MER methods. To summarize, the main contributions of this study are: (1) presenting a novel unsupervised multimodal feature learning approach, (2) being the first study adapting the contrastive loss for MER, and (3) improving the emotion recognition results compared to unsupervised feature learning MER SOTA. The code of the proposed method is available at

Fig. 1: Summary of our approach. We first learn the multimodal features in an unsupervised fashion, then the downstream task (discrete emotion recognition) is performed. We jointly train, each possible pair of modalities’ backbone using contrastive loss in order to predict the correct pairings of a batch of training examples. The final loss is the average of all losses calculated. During inference, are extracted before the projection layers (i.e.,


) and concatenated, then feed to a linear classifier for emotion recognition.

Ii Related Work

Several methods for multimodal emotion recognition (MER) were proposed, as detailed in the recent survey papers: [SpezialettiPlacidiAl20, Sharma2021]. In this section, our summary is regarding discrete

MER research modeling text, visual and acoustic modalities, as we tested our method on that context. Early works adapt classifiers like SVMs, Linear and Logistic Regression

[Castellano2008, Sikka2013] while, by the time bigger datasets were developed, deep learning architectures were also explored. For example, [RadoiBirhalaAl21] is based on CNNs, and [ShenoySardanaAl20, SongCaiAl21] use RNNs. Some recent studies [DelbrouckTitsAl20, TsaiMaAl20, WenYouAl21] adopt Transformers.

Ghaleb et al. [GhalebPopaAl19] apply deep metric learning in which a LSTM component models the variations of the emotions as a function of time. That is different from late fusion of modalities [RadoiBirhalaAl21, SongCaiAl21] or building temporal features to extract global information by assuming that emotions are expressed simultaneously [ShenoySardanaAl20]. Late fusion is favorably applied by concatenating the learned features of all modalities in [RadoiBirhalaAl21, SongCaiAl21] or with a pairwise scheme in [ShenoySardanaAl20]. Instead, the authors of M3ER [MittalBhattacharyaAl20] propose a data-driven multiplicative fusion method to combine the modalities, which learns to emphasize the more reliable cues and suppresses the others by integrating Canonical Correlation Analysis as a pre-processing step. Differently, Zadeh et al. [Zadeh2018MultimodalLA] present Graph-MFN, which synchronizes the multimodal sequences by storing intra-modality and cross-modality interactions through time with a graph structure. Attention mechanism has been exploited by several works as well [BeardDasAl18, ChoiSongAl18, DelbrouckTitsAl20, ChauhanAkhtarAl19, AkhtarChaudanAl20, ZhangLiangqingAl19, HoYangAl20, GhalebNiehuesAl20, DaiCahyawijayaAl21, KhareParthasarathyAl21]. For example, Dai et al. [DaiCahyawijayaAl21] present MESM that is composed of sparse cross-modal attention mechanism attached to the joint learning of multimodal features.

There are a lot of attempts applying end-to-end learning [RadoiBirhalaAl21, ShenoySardanaAl20, ChangSkarber21, HuynhVan2021], but only [DaiCahyawijayaAl21]

compared a fully end-to-end method (defined as jointly optimizing feature extraction and feature learning stages

[DaiCahyawijayaAl21]) with the two-phase pipelines (i.e., feature extraction is independent from multimodal learning). Indeed, it is very common in the MER litreature to apply the feature extraction step separately. This is performed on each modality by using either hand-crafted formulations [TiwariRathodAl21, JaratrotkamjornChoksuriwong19, MittalBhattacharyaAl20, ShenoySardanaAl20, RadoiBirhalaAl21, Zadeh2018MultimodalLA]) and/or deep learning architectures [GhalebPopaAl19, ShenoySardanaAl20, RadoiBirhalaAl21]. As example of acoustic features; Log-Mel spectrogram [RadoiBirhalaAl21], pitch, voiced/unvoiced segmenting features [ShenoySardanaAl20, Zadeh2018MultimodalLA, MittalBhattacharyaAl20], MFCCs [SongCaiAl21, ShenoySardanaAl20, Zadeh2018MultimodalLA, MittalBhattacharyaAl20], features extracted from SoundNet [GhalebPopaAl19]) can be given. On the other hand, various backbones such as VGG16 [SongCaiAl21], I3D [GhalebPopaAl19], FaceNet [GhalebPopaAl19] as well as facial features; facial landmarks and facial action units extracted by OpenFace [Zadeh2018MultimodalLA, MittalBhattacharyaAl20] are among the most popular visual features. For text, Glove embeddings [pennington2014] have been frequently utilized [ShenoySardanaAl20, Zadeh2018MultimodalLA, MittalBhattacharyaAl20, DelbrouckTitsAl20, DelbrouckTitsAl20b, WenYouAl21, TsaiMaAl20], while Transformers are used as the backbone [DelbrouckTitsAl20, DelbrouckTitsAl20b, WenYouAl21, TsaiMaAl20, KhareParthasarathyAl21] or LSTMs are trained with the extracted word embeddings [Zadeh2018MultimodalLA, MittalBhattacharyaAl20].

Among the aforementioned approaches, [ChoiSongAl18, DelbrouckTitsAl20b] use text and audio, [JaratrotkamjornChoksuriwong19, GhalebNiehuesAl20, GhalebPopaAl19, RadoiBirhalaAl21, SongCaiAl21, ChangSkarber21, TiwariRathodAl21] use video and audio, and all others use text, audio and video together. It is worth noting that these techniques are all supervised. Recently, Khare et al. [KhareParthasarathyAl21] investigated the usage of large unlabeled multimodal datasets for pre-training a cross-modal transformer, which is then fine-tuned for the emotion recognition task. In detail, the VoxCeleb dataset [Chung2018VoxCeleb2DS], composed of 1.1 million videos that are associated to emotions [Albanie2018], is used to pre-train the multimodal transformer. Then, the decoder layer is removed, and an average pooling and additional fully connected layers are added to fine-tune the model for emotion recognition task. Unlike [KhareParthasarathyAl21], we do not rely on auxiliary large-scale datasets to pre-train our model, and both the feature learning and inference are performed on the same datasets, which are much smaller than the VoxCeleb dataset [Chung2018VoxCeleb2DS]. Our learned features are frozen such that we do not apply any fine-tuning as in [KhareParthasarathyAl21]. This is an important difference because some studies [UHAR_BMVC2021, li2021crossclr] have shown that, compared to using frozen features that are learned in an unsupervised fashion, fine-tuning can bring up to 17.5% improvement for the downstream task. However, following the fine-tuning approach would not keep the feature learning methodology “entirely unsupervised”, as it requires the labels of the downstream task. Moreover, our model is applicable with different modality combinations, whereas text is an anchor modality in [KhareParthasarathyAl21].

The MER litreature is very limited in terms of fully unsupervised feature learning

approaches. Very recently, a Convolutional Autoencoder architecture is presented in

[Koromilas2021UnsupervisedML]. Despite being very different from our method in terms of the architecture, [Koromilas2021UnsupervisedML] is still our “direct competitor” by having the following common aspects with the proposed method: i) performing unsupervised feature learning without fine-tuning, ii) being independent to the number of modalities and modality combinations, and iii) not being task-specific.

Iii Our Approach

An overview of our approach is given in Fig. 1. First, the multimodal features are learned with an unsupervised way (Sec. III-B). Then, the downstream task (discrete emotion recognition) is performed (Sec. III-C). Sec. III-A describes the modalities and Sec. III-D presents the implementation details.

Iii-a Modalities

The modalities and backbones we utilize are described as follows.

The word vectors are extracted from transcripts with the Glove word embeddings

[pennington2014], following the procedure in [Zadeh2018MultimodalLA]. As the backbone, we use the Transformer in [Vaswani2017], which is one of the SOTA architectures of language processing.
 Visual. We rely on two sources of visual data. One of them is the facial images extracted by MTCNN face detector [zhang2016joint] (unless faces are supplied by the dataset) from RGB video frames. As the backbone associated to the facial images, the R(2+1)D architecture [Tran2018]

pre-trained on Kinetics-400 dataset

[Kay2017] is used. The other visual data is the facial landmarks detected by the method in [bulat2017far]

(unless it is provided by the dataset used), and the associated backbone is Spatio-Temporal Graph Neural Network (ST-GCN)

 Acoustic. Mel-spectograms are extracted with the same procedure and settings in [Tachibana2018, DelbrouckTitsAl20, DelbrouckTitsAl20b] with Librosa Python Library [mcfee2015librosa] using 80 filter banks and by selecting one frame for every 16 frames. The dimension of the mel-spectograms is fixed to 128. We adapt Time Convoluted Network (TCN) [Pariente2020Asteroid] such that it takes mel-spectrograms as the input.

As seen, each modality has its own backbone, which have been chosen as being the SOTA architectures for diverse applications of language, visual and acoustic data processing.

Iii-B Unsupervised Multimodal Feature Learning

The proposed method includes separate multi-layer projection heads onto each backbone defined in Sec. III-A. All projection heads have the same structure such that they are composed of fully-connected layers (

), where the first layer is followed by a ReLU activation function (

). This structure is motivated by SimCLR [chen2020simple], which shows that a nonlinear projection head contributes to the performance more than a linear projection head, and its contribution is even more compared to not including any projection layer.

We adapt the CLIP fashion [radford2021learning] training, without using any labels of the downstream task (i.e., emotion recognition). Given a data sample represented by a sequence of observations in multiple modalities, our aim is to make the embeddings of two modalities of the same sequence (positives) close to each other, and make the embeddings of the same two modalities of different sequences (negatives) apart from each other. This is repeated for all possible pairs of modalities. Notice that negative samples might belong to the same class (i.e. exhibit the same emotion). However, herein, we assume that the class labels are not available, and we resort to instance discrimination with contrastive learning which encourages the model to produce invariant representations and align the latent spaces of all the modalities.

More formally, the contrastive loss function for a pair of modalities (,) has the following form:


where denotes the embedding after the projection, , are indices of samples in the current batch of size , is the temperature parameter (scalar), is an indicator function evaluating to 1 iff , and denotes the dot product between -normalized vectors and

(i.e., cosine similarity). Eq. (

1) is computed across all samples in the batch, resulting in . In addition, we minimize this loss for each possible pairs of modalities. Notice that, since the negatives are drawn from only one modality (see denominator in Eq. (1)), the loss is asymmetric, i.e., is not equal to . Therefore, our final loss function (Eq. (2)) includes the loss obtained from all the permutations of two elements drawn with replacement from the set of modalities :


Note that, we found empirically that only contrasting different modalities (i.e. when ) produces better representations. In addition, we perform temporal augmentations (see Sec. III-D for details) to the sequences in order to avoid overfitting and improve performance.

Iii-C Discrete Emotion Recognition

Following the common practice [chen2020simple, qian2021spatiotemporal], in order to perform the downstream task (i.e., discrete emotion recognition), we discard the projection layers (described in Sec. III-B) and use the 512-dimensional feature representation extracted from each backbone. The extracted features are concatenated (e.g., for 3 modalities, the combined vector holds 3512 number of features) and given to a prediction layer, that shares the same design with the projection heads (i.e., +RELU+) where its output is the emotion classes. The aforementioned prediction layers are trained with the emotion labels using the cross entropy loss and a variant of it (see Sec. IV-B for details).

Iii-D Implementation Details

The training is performed with the SGD optimizer with the momentum of 0.9 and the weight decay of 0.001. All models are trained with the batch size of 32 (or 64) while the batch size of our downstream task is 64 (or 128). The learning rate is initialized as 0.001. We create a linear scheduler to vary the learning rate over the training process such that at every 5 epochs for CMU-MOSEI [Zadeh2018MultimodalLA] and every 100 for RAVDESS [Livingstone2018], we multiply the learning rate with 0.9 (notice that RAVDESS dataset is much smaller than CMU-MOSEI). We do not apply any “spatial” data augmentation (e.g., random crops, blurs or color distortions), but data sampling can have overlapping sequences. For example, a video segment from to , and another video segment from to can be used in the same training. This is referred as augmentation in the temporal dimension. We set the number of epochs to 2000, but we also define a patience parameter such that: if after 100 consecutive epochs the validation performance does not change, then we stop the training. In practice, the maximum number of epochs was never been reached because the patience parameter stopped the training before. The temperature scalar is taken as 0.07.

Methods Actor Facial Acoustics Facial ACC
Split Images Landmarks (%)
Unimodal 60.80
Unimodal 58.50
Unimodal 62.05
Late Fusion 64.10
Attention Mec. 65.40
Ours 63.78
Ours 77.10
Ours 78.54
Unimodal 72.80
Unimodal 75.90
Unimodal 76.35
Late Fusion 80.72
Attention Mec. 81.80
Ours 80.32
Ours 89.50
Ours 93.17
TABLE I: Results of the proposed and baseline methods on RAVDESS dataset [Livingstone2018] in terms of accuracy (ACC).
Methods Happy Sad Anger Surprise Disgust Fear Overall
w-ACC F1 w-ACC F1 w-ACC F1 w-ACC F1 w-ACC F1 w-ACC F1 w-ACC F1
Late Fusion 59.71 60.17 54.17 27.97 54.58 34.58 50.01 3.31 54.29 34.10 54.92 22.83 54.60 30.50
Attention Mec. 61.27 61.61 55.80 36.09 54.92 37.06 50.34 5.66 55.84 44.15 57.25 43.71 55.90 38.00
Ours wout/ text 63.96 61.84 50.71 12.41 54.88 26.59 50.30 2.76 58.37 35.44 54.79 27.56 55.50 27.77
Ours 68.82 69.20 62.93 55.70 67.91 70.09 62.93 72.73 72.91 74.25 64.49 74.85 66.70 69.50
TABLE II: Results of the proposed and the baseline methods on CMU-MOSEI [Zadeh2018MultimodalLA] in terms of weighted accuracy (w-ACC) and F1 measure. wout/ text stands for the experiments when the text modality is not used while all other modalities are used.

Iv Experiments and Results

Iv-a Datasets and Evaluation Metrics

We used the speech part of RAVDESS dataset [Livingstone2018], containing 2880 audio-visual recordings acted by 24 professional actors pronouncing two lexically identical statements. Each recording was labeled in terms of one of the eight categorical emotions (anger, happiness, disgust, fear, surprise, sadness, calmness and neutral), while the emotions were expressed with two intensity (normal or strong). RAVDESS is class-balanced except the neutral class, which was elicited 50% less time than the other emotion classes. We adapted two cross-validation settings following the methods [GhalebPopaAl19, GhalebNiehuesAl20, RadoiBirhalaAl21, SongCaiAl21, ShirianTripathiAl21, BhavanChauhanAl19, BeardDasAl18, JaratrotkamjornChoksuriwong19, AbdullahAhmadAl20, TiwariRathodAl21]. The first setting considers the identities of the actors such that the training (validation) and the corresponding testing k-folds have no overlap in terms of actors (shown as actor-split=✓ hereafter). The second setting, instead, applies standard k-fold cross-validation (i.e., actor-split=✗). In both settings, k was taken as 10 and the reported results are in terms of accuracy (ACC), which is averaged over the 10-folds, supplying fair comparisons with the MER SOTA [GhalebPopaAl19, GhalebNiehuesAl20, RadoiBirhalaAl21, SongCaiAl21, BeardDasAl18, JaratrotkamjornChoksuriwong19, TiwariRathodAl21]. As the same statements are being repeated by the actors in RAVDESS dataset [Livingstone2018], the proposed method (as well as the SOTA) are based only on visual and acoustic modalities.

Methods Actor Split Feature Learning ACC ()
Human performance [Livingstone2018] - - 80.00
Ghaleb et al. [GhalebPopaAl19] Supervised 67.70
Ghaleb et al. [GhalebNiehuesAl20] Supervised 69.40
Ghaleb et al. [GhalebNiehuesAl20] (w/ATT) Supervised 76.30
Radoi et al. [RadoiBirhalaAl21] Supervised 78.70
Ours Unsupervised 78.54
Beard et al. [BeardDasAl18] Supervised 58.30
Song et al. [SongCaiAl21] Supervised 90.00
Tiwari et al. [TiwariRathodAl21] Supervised 93.30
Ours Unsupervised 93.17
TABLE III: Results of the proposed method and the SOTA MER methods tested on RAVDESS [Livingstone2018]. ATT stands for attention mechanism.

The CMU-MOSEI [Zadeh2018MultimodalLA] is the largest multimodal in-the-wild dataset in the MER domain. It consists of more than 23K utterances, belonging to more than 1000 speakers, collected from YouTube videos. Each utterance is labeled with six emotions: happiness, sadness, anger, fear, disgust, and surprise with a [0,3] Likert scale for the presence of each emotion class. Following [Zadeh2018MultimodalLA, DaiCahyawijayaAl21, DelbrouckTitsAl20, MittalBhattacharyaAl20, ShenoySardanaAl20, ChauhanAkhtarAl19, TsaiMaAl20, ZhangLiangqingAl19, WenYouAl21, HoYangAl20], the emotions were treated as either present or not present (i.e., binary classification), while more than one emotion can be present at the same time, making the task a multi-label problem. There exist ( 3000) not-correctly aligned sequences across the modalities. As our approach does not require strict data alignment, we used all sequences as supplied in CMU-MOSEI SDK [CMU_MOSEI_SDK]. In other words, we did not apply any data cleaning, e.g., as in [DaiCahyawijayaAl21]

. We also used the recommended dataset split and the evaluation metrics in

[Zadeh2018MultimodalLA], namely weighted accuracy [Tong2017] (w-ACC) and F1-measure.

Methods Happy Sad Anger Surprise Disgust Fear Overall
w-ACC F1 w-ACC F1 w-ACC F1 w-ACC F1 w-ACC F1 w-ACC F1 w-ACC F1
Unsupervised Feature Learning Methods
CAE-LR [Koromilas2021UnsupervisedML] 64.70 65.60 53.20 55.60 61.80 61.90 57.10 70.70 69.00 70.10 60.40 69.20 61.03 65.52
Ours 68.82 69.20 62.93 55.70 67.91 70.09 62.93 72.73 72.91 74.25 64.49 74.85 66.70 69.50
Fully Supervised Methods
MESM [DaiCahyawijayaAl21] 64.10 72.30 63.00 46.60 66.80 49.30 65.70 27.20 75.60 56.40 65.80 28.90 66.80 46.80
Zhang et al. [ZhangLiangqingAl19] 71.70 64.30 66.60 62.30 72.50 64.60 67.00
FE2E [DaiCahyawijayaAl21] 65.40 72.60 65.20 49.00 67.00 49.60 66.70 29.10 77.70 57.10 63.80 26.80 67.60 47.40
Graph-MFN [Zadeh2018MultimodalLA] 66.30 66.30 60.40 66.90 62.60 72.80 53.70 85.50 69.10 76.60 62.00 89.90 62.35 76.33
Delbrouck et al. [DelbrouckTitsAl20] 64.00 67.90 74.70 86.10 83.60 84.00 76.72
Huynh et al. [HuynhVan2021] 62.70 63.00 54.40 69.70 59.60 74.30 50.60 85.70 66.00 81.30 52.90 86.40 57.70 76.73
Khare et al. [KhareParthasarathyAl21] 68.10 68.20 64.30 72.40 67.30 74.80 65.10 87.70 73.60 82.40 63.00 86.60 66.90 78.68
CIA [ChauhanAkhtarAl19] 51.90 71.30 61.80 72.90 67.40 74.70 58.20 86.00 74.10 81.80 63.90 87.80 62.88 79.08
Tsai et al. [TsaiMaAl20] 71.00 71.00 75.00 72.10 78.30 75.00 90.50 86.10 83.00 82.50 91.70 87.80 81.58 79.08
Wen et al. [WenYouAl21] 72.50 72.60 75.60 70.70 77.10 74.90 90.60 86.10 85.00 83.20 91.70 87.80 82.08 79.22
Shenoy et al. [ShenoySardanaAl20] 70.00 68.40 76.10 74.50 83.10 80.90 87.40 84.00 90.30 87.30 89.70 87.00 82.77 80.35
M3ER [MittalBhattacharyaAl20] 78.00 87.30 81.60 93.20 84.40 91.80 86.05
TABLE IV: Performance comparisons among the proposed method and the SOTA MER methods tested on CMU-MOSEI [Zadeh2018MultimodalLA] dataset. The results that our method surpasses are given in yellow.

Iv-B Comparisons with the Baseline Methods

We compare the proposed approach with the following baseline methods. These baselines are all supervised such that cross-entropy and binary cross-entropy losses were used for RAVDESS [Livingstone2018] and CMU-MOSEI [Zadeh2018MultimodalLA], respectively. The corresponding results are given in Tables I and  II.
Unimodal Learning. Each modality was trained with its associated backbone (described in Sec. III-A) followed by two fully connected () layers with a ReLU activation function. The best results were obtained with the following parameter settings. For acoustic data, the learning rate was initialized with 0.001 and decreased by multiplying it with 0.9 at every 10 epochs. The batch size was 32 and number of epoch was 100. For facial images, the learning rate was 0.01, number of epoch was 150 and the momentum was 0.9. For facial landmarks, the learning rate was 0.001, momentum was 0.9 and the number of epochs was set as the proposed method with patience parameter.
Late Fusion. Recall that late fusion was applied by several SOTA methods, e.g., [RadoiBirhalaAl21, SongCaiAl21, ShenoySardanaAl20]. Given the modalities and the backbones described, we concatenated the feature embeddings of each modality, and fed them to a shallow network composed of two layers with a ReLU activation function. The batch size was taken as 32, the number of epochs was set by the patience parameter, the learning rate and momentum were taken as 0.001 and 0.9, respectively.
Attention Mechanism. As mention in Sec. II, attention mechanism has been frequently applied in MER, hence we adapted it as a baseline too. We first concatenated the feature embeddings obtained from each modality (512 features extracted from each backbone as in our method) and then applied the multi-head attention mechanism of [Vaswani2017]. The batch size was 64, the learning rate was 0.001, and the number of epochs was set to 2000 with the patience parameter described in Sec. III-D. The same scheduler as the proposed method was used.

As seen in Table I, our unsupervised feature learning method outperforms all of the supervised baselines when acoustic and facial landmarks are involved. It is notable that, in the visual domain, the facial landmarks are more effective than the facial images. Out of all baseline methods, late fusion and attention mechanism surpass the unimodal setups, while attention mechanism achieves slightly better results than the late fusion. Overall, all methods perform better in the actor-split=✗ setting compared to their actor-split=✓ counterpart. This is perhaps as a result of having more training data in the actor-split=✗ setting. With reference to Table I, we have further investigated the contribution of used modalities with respect to different emotions by inspecting the confusion matrices. Our observation is that there is no particular modality or a pair of modality which performs better for a specific emotion class(es).

Given the better performances of late fusion and attention mechanism compared to unimodal learning in Table I, we inherited them to test on CMU-MOSEI dataset [Zadeh2018MultimodalLA] when four modalities (text, facial images, acoustic and facial landmarks) are used. Additionally, in order to investigate the contribution of the text modality, we compare the results of the proposed method with the performance of the proposed method when the text is discarded (shown as wout/ text). The corresponding results can be seen in Table II. Our method outperforms the baselines for all emotion classes (especially for surprise) as well as on average (see Table II). Also, the performances of our method do not fluctuate across different emotion classes, meaning that our method generalize better than the baseline methods. In overall there exist a drop of 11.2% and 41.73% for w-ACC and F1-measure, respectively, when the text modality is discarded from the pipeline of the proposed method, showing the positive contribution of the text modality.

Iv-C Comparisons with the State-of-the-art Methods

We compare our approach with several SOTA MER methods. Concerning RAVDESS [Livingstone2018], the performances are given in Table III. The fact that “human performance” is not 100 presents the difficulty of MER task. It is remarkable that our approach surpasses several supervised competitors: [GhalebPopaAl19, GhalebNiehuesAl20, BeardDasAl18, SongCaiAl21] with a margin of 2-35% despite working in a more difficult (unsupervised) setting. It also performs on par with supervised approaches: [RadoiBirhalaAl21, TiwariRathodAl21]. The results for CMU-MOSEI [Zadeh2018MultimodalLA] are given in Table IV. There exist a very recent unsupervised feature learning approach (namely CAE-LR [Koromilas2021UnsupervisedML]) tested on CMU-MOSEI [Zadeh2018MultimodalLA] for multimodal sentiment analysis. CAE-LR [Koromilas2021UnsupervisedML] achieved the best results for multimodal sentiment analysis compared to other unsupervised counterparts. Motivated by this, we adapted the authors’ code for MER. Instead of applying Logistic Regression, we performed Linear Evaluation [UHAR_BMVC2021], which is the common protocol for unsupervised learning if the downstream task is classification (notice that we apply it for the proposed method as well, i.e., the prediction layer). For all emotion classes and on overall, our method achieves much better results than CAE-LR [Koromilas2021UnsupervisedML], showing the effectiveness of the contrastive loss in multimodal setting compared to convolutional autoencoders. It is worth noting that, on average, our method is better than several fully supervised techniques: MESM [DaiCahyawijayaAl21], FE2E [DaiCahyawijayaAl21], Graph-MFN [Zadeh2018MultimodalLA], [HuynhVan2021], CIA [ChauhanAkhtarAl19]. Considering that these methods integrate relatively complex supervised techniques; attention mechanisms, transformers, graphs, the better performance of our method is very promising.

V Conclusion

We presented an unsupervised multimodal feature learning approach, which was tested on discrete emotion recognition. Our method is a pioneer in the MER litreature, being based on pairwise contrastive learning. Experiments show that the performance of our approach is better than the supervised baselines and unsupervised counterpart, while being competitive to several complex supervised SOTA and even surpassing a few. Being an unsupervised feature learning method, the proposed approach is transferable to other domains without retraining (not even tuning) the representation model itself.

The proposed method keeps the modality pairings the same for all data (i.e., emotions) and the way we learn the features gives equal importance to each modality. An alternative could be having different modality pairings for different emotion classes. This will be further investigated as future work.


This work was supported by the EU H2020 SPRING project (No. 871245) and by Fondazione VRT.