Interpretability for Multimodal Emotion Recognition using Concept Activation Vectors

by   Ashish Ramayee Asokan, et al.
PES University

Multimodal Emotion Recognition refers to the classification of input video sequences into emotion labels based on multiple input modalities (usually video, audio and text). In recent years, Deep Neural networks have shown remarkable performance in recognizing human emotions, and are on par with human-level performance on this task. Despite the recent advancements in this field, emotion recognition systems are yet to be accepted for real world setups due to the obscure nature of their reasoning and decision-making process. Most of the research in this field deals with novel architectures to improve the performance for this task, with a few attempts at providing explanations for these models' decisions. In this paper, we address the issue of interpretability for neural networks in the context of emotion recognition using Concept Activation Vectors (CAVs). To analyse the model's latent space, we define human-understandable concepts specific to Emotion AI and map them to the widely-used IEMOCAP multimodal database. We then evaluate the influence of our proposed concepts at multiple layers of the Bi-directional Contextual LSTM (BC-LSTM) network to show that the reasoning process of neural networks for emotion recognition can be represented using human-understandable concepts. Finally, we perform hypothesis testing on our proposed concepts to show that they are significant for interpretability of this task.



page 1

page 4

page 6


Analyzing the Influence of Dataset Composition for Emotion Recognition

Recognizing emotions from text in multimodal architectures has yielded p...

Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning

Emotion recognition has become an important field of research in Human C...

Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention

Classifying group-level emotions is a challenging task due to complexity...

End-to-End Multimodal Emotion Recognition using Deep Neural Networks

Automatic affect recognition is a challenging task due to the various mo...

Towards Interpretable and Transferable Speech Emotion Recognition: Latent Representation Based Analysis of Features, Methods and Corpora

In recent years, speech emotion recognition (SER) has been used in wide ...

EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's Principle

We present EmotiCon, a learning-based algorithm for context-aware percei...

Attentive Cross-modal Connections for Deep Multimodal Wearable-based Emotion Recognition

Classification of human emotions can play an essential role in the desig...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

footnotetext: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

The research of Machine Learning (ML) systems has witnessed a rapid growth in recent years, with their presence in diverse fields ranging from day-to-day use cases such as personal assistants and search engines to highly regulated domains involving high-risk decision-making such as medical diagnosis and autonomous driving. The increasing availability of large databases and hardware resources to train such complex ML systems have resulted in state-of-the-art performance across a wide range of tasks. However, despite these advancements, ML systems still lack transparency, i.e, the internal reasoning process of these models are hidden from the user, which can prove to be a pitfall that prevents humans from verifying the decisions made by these black box models

[1]. Došilović et al.[2] highlight the fact that Deep Neural Networks (DNNs) are criticized for serving only as approximations of a decision-making system whose decisions cannot be trusted. Therefore, these black box models must satisfy several assurances such as justifiability, usability, reliability, etc., for a practicable ML system.

Interpretability for Machine Learning can be defined as the extent to which a model’s decisions can be consistently predicted or accounted for [3]. According to Carvalho et al., the taxonomy of interpretability methods is based on - (i) when the methods are applicable (Pre-Model, In-Model, Post-Model (ii) whether the model is trained with a complexity constraint or analysed post-training (Intrinsic vs Post hoc) (iii) whether the interpretation is based on the model architecture (Model-Specific vs Model-Agnostic). There is often a trade-off between model complexity and model interpretability, i.e, the more complex a model is, the harder it is to interpret the decisions made by the same. This is especially the case with Intrinsic methods where the model is trained with an additional complexity constraint to ensure effective interpretability, which affects model performance. However, Post Hoc and Post Model methods provide interpretability post-training, thereby ensuring no loss in performance.

With a better understanding of human emotions and the increasing availability of large emotion databases, emotion recognition has become an emerging research area in recent years. Emotions can be defined as a psycho-physiological process that is initiated by interaction with (or perception of) people or situations, with varying motivation and mood [4]

. Emotion Recognition can be done using various modalities such as speech, text, EEG signals and facial expressions, among which facial expressions are more widely adopted due to easier availability of these datasets. Even though a large amount of work has been done in this field, emotion recognition is often challenging due to intra-class variance, i.e, variations in emotions among different ethnicities, cultures and age groups. In practice, it is observed that multimodal approaches are more robust to intra-class variance and often adopted by clinicians and psychologists.


Multimodal Emotion Recognition finds its application in the healthcare industry to provide a preliminary assessment of a patient’s emotional state. Such systems have been used in a clinical setting for the diagnosis of medical conditions such as Schizophrenia and Autism [6]. Due to the limited exploration of interpretability for emotion recognition by prior work, we address this problem using Concept Activation Vectors (CAVs) to determine which concepts a model uses to recognize human emotions. Based on cues used by clinicians to recognize emotions, we define appropriate concepts with the publicly available IEMOCAP multimodal database and evaluate their significance on the Bi-Directional Contextual LSTM network. In summary, our contributions are as follows:

  • We extend the existing Testing with Concept Activation Vectors (TCAV) method to video, audio and text input, which is yet to be explored.

  • We propose novel human-understandable concepts for interpreting multimodal emotion recognition and evaluate the significance of the same.

Ii Related Works

This section provides an overview of the recent literature on Interpretable AI and Emotion Recognition.

Ii-a Interpretable AI

Interpretability aims to explain the reasoning process of DNNs through human-understandable terms to facilitate robustness and impartiality in decision-making. In addition to the broad classification of interpretability methods outlined in Sec. I, the sub-classes of methods also include Feature Attribution Methods, and Concept-based Methods, discussed in detail below.

Ii-A1 Feature Attribution Methods

Feature attribution methods attempt to explain each individual prediction by determining the effect (positive or negative) of each input feature on the prediction. Local Interpretable Model-Agnostic Explanations (LIME) [7] and SHapley Additive exPlanations (SHAP) [8]

are some of the most well-known general feature attribution methods. LIME attempts to construct interpretable classifiers on a perturbed dataset to interpret a given model, and SHAP proposes a method to compute an additive feature attribution score with desirable properties. A special case of feature attribution is

Pixel Attribution (Saliency Maps) that highlights relevant pixels for each individual prediction in image classification. Few of the methods discussed here are Grad-CAM, SmoothGrad and Integrated Gradients. Grad-CAM [9] highlights the important regions of an input image for an individual prediction using the gradients of the final convolutional layer to generate an activation map. SmoothGrad [10] attempts to improve the visual quality of gradient-based sensitivity maps by averaging those of noisy versions of the input image. Integrated Gradients [11] provides pixel-level attribution by computing the path integral of the gradients between a baseline input and the regular input.

Ii-A2 Concept-based methods

Concept-based methods aim to address interpretability in DNNs by extracting human-understandable concepts from a model’s latent representations. Liu et al.[12] propose a model distillation method based on unsupervised clustering that produces an Intrinsic (interpretable by design) surrogate model. Kim et al.[13] introduce Concept Activation Vectors (CAVs) that use directional derivatives to represent human-understandable concepts from a model’s activations and quantify the influence of a concept on the predictions of a single target class. Pfau et al.[14] build on TCAV by providing global and local conceptual sensitivities and accounting for the non-linear influence of concepts on a model’s predictions. Lucieri et al.[15] explore TCAV in the context of skin lesions classification using an InceptionV4 model built by the REasoning for COmplex Data (RECOD) Lab. Ghorbani et al.[16] propose Automatic Concept-based Explanations (ACE) - a novel method that uses image segmentation and clustering to extract visual concepts used by a model.

Ii-B Emotion AI

Emotion AI deals with the detection and interpretation of emotive channels involved in human communication. A considerable portion of Emotion AI research in recent years has dealt with performance improvements on the emotion recognition task through novel DNN architectures. Tripathi et al.[4] explore multimodal emotion recognition on the IEMOCAP database using speech, text and motion capture features. Mittal et al.[17] propose a novel fusion method to combine the face, text and speech modalities that is impervious to noise. Krishna et al.[18] propose a cross-modal attention mechanism that uses audio and text features for emotion recognition. Majumder et al.[19] and Poria et al.[20]

explore hierarchical contextual feature extraction for emotion recognition, which we adopt in this work. A comprehensive list of the architectures for Multimodal Emotion Recognition (MER) and Emotion Recognition in Conversation (ERC) is provided here (ERC,

111IEMOCAP ERC BenchmarkMER222IEMOCAP MER Benchmark), but our main focus is the former task.

Interpretability for multimodal emotion recognition has been explored with intrinsic and post-hoc methods primarily through EEG signals and speech input. Quing et al.[21] explore interpretable EEG-based emotion recognition using Emotional Activation Curves and evaluate their results on the DEAP and SEED dataset. Liu et al.[22] propose Gated Bi-directional Alignment Network that effectively captures speech-text relations, and an interpretable Group Gated Fusion (GGF) layer that determines the significance of each modality through contribution weights. Mayou et al.[23] propose a SincNet-based network for emotion classification with EEG signals that is interpreted by inspecting the filters learned by the model. Nguyen et al.[24] introduce a novel DNN architecture for multimodal emotion recognition and use non-linear Gaussian Additive Models to interpret the same.

A thorough survey of relevant literature showed that concept-based interpretation of multimodal emotion recognition is yet to be explored, and we attempt to address this gap by extending Concept Activation Vectors [13] to video, audio and textual data. We first define human-understandable concepts specific for emotion recognition based on inferences and observations from [6]. The CAVs are then fitted to the BC-LSTM model’s latent space at the chosen layers to compute the concept sensitivities and TCAV scores for each concept and for each target emotion.

Iii Proposed Methodology

In this section, we discuss the feature extraction method used for multimodal emotion recognition and introduce our human-understandable concepts for interpreting DNNs with Concept Activation Vectors.

Fig. 1: (a) Overview of Methodology for TCAV on Emotion Recognition (b) Bottleneck layers chosen for TCAV on Emotion Recognition

Iii-a Feature Extraction

In this work, we adopt the feature extraction method described by Poria et al.[20]. It is carried out in 2 stages: Context-Independent Extraction that extracts the features for each input mode (audio, video and text) separately, and Context-dependent Extraction that learns features across utterances for both unimodal and multimodal emotion recognition.

Iii-A1 Context-Independent Feature Extraction for Unimodal Input

Feature extraction on the unimodal input is done independent of the surrounding utterances and without any contextual information (or dependency). The steps involved in feature extraction for each input mode are discussed in detail below:

  • Text Input. The text inputs used for textual feature extraction are the transcripts for each of the utterances. As stated by Poria et al., each of the utterances in a video are represented as a combination of 300 dimensional word2vec vectors [25]

    of each word in the utterances. The CNN used for feature extraction consists of 2 convolution layers and a single max-pool layer. The result of the pooling operation is projected onto a

    -dimensional dense layer whose output serves as the input textual features for emotion recognition.

  • Video Input. The feature extraction for visual input is done using a 3D-CNN, which is capable of learning features for each frame of the video as well as temporal features across video frames. The video input to the 3D-CNN has the dimensions (), where is the number of channels (3 for RGB), and are the dimensions of each frame, and is the total number of frames. The 3D-CNN consists of a convolution layer with a 3D filter and a maxpool layer, followed by a dense layer of dimensions . The output of this 3D-CNN is a -dimensional vector that represents the utterance-level input visual features.

  • Audio Input.

    Audio feature extraction is done using the openSMILE open-source software

    [26] that automatically extracts essential audio features. OpenSMILE extracts several low-level features such as pitch, intensity, MFCC, etc. These features serve as the input audio features for the emotion recognition model.

Iii-A2 Context-Dependent Feature Extraction

The contextual features are extracted using the Contextual-LSTM architecture proposed by Poria et al., which is a part of their Bi-directional Contextual LSTM network (Fig. 1b). The intuition behind this architecture is that the surrounding utterances can provide essential information in the classification of the current utterance, thereby requiring a model that takes such dependencies into consideration. Let represent the input features for utterance and represent the output of the LSTM network for utterance . The output for the next utterance depends on as well as the output of the previous LSTM network , which represents the learning of contextual information. This contextual-LSTM module is used for unimodal and multimodal feature extraction.

Iii-B Interpretability using CAVs

Iii-B1 Testing with Concept Activation Vectors (TCAV)

To achieve interpretability in terms of human-understandable concepts, Kim et al.[13] proposed Concept Activation Vectors (CAVs) - a linear interpretability method that represents a concept with a vector in a neural network’s activation space given a set of positive and negative examples representing the concept. Given a positive examples set and a negative examples set , a binary classifier is trained to distinguish between the activations of the positive examples set and the negative examples set , where represents the neural activation at a layer of a network. This binary classifier represents the CAV for the concept at layer . The Testing with CAV (TCAV) method introduces a metric known as the TCAV score that represents a ML model’s sensitivity to a particular concept across all class labels. Given a concept , the TCAV score at a layer for examples belonging to class ( is given as:


where represents the directional derivative at layer for concept and class given by ( is the derivative of the activation at layer ). TCAV provides a quantitative measure of conceptual sensitivity across entire input classes and can be extended to input modes other than images.

Iii-B2 TCAV for Emotion Recognition

Here, we delineate the concepts used to interpret multimodal emotion recognition models with TCAV. We define a single concept for each of video, audio and text input modes - Variations in Physiognomy, Voice Pitch and Utterance Polarity, which are discussed below:

  • Variations in Physiognomy (Deviation from Neutral Expressions). Emotions such as anger and excitement capture extreme changes in facial expressions compared to the neutral resting face. They are characterized by changes in the facial features such as eye contact, lip movement, etc. According to Grabowski et al.[6], analysis of emotions in a valence/arousal spectrum allows for the distinction between neutral and extreme emotions. Specifically, emotions associated with high arousal are characterized by this concept. Positive examples are those utterances that show extreme variations in facial expressions and negative examples represent the neutral resting face.

  • Voice Pitch. Among the several sound features responsible for emotional prosody, pitch is one of the features essential for emotion recognition, which is defined as the relative highness or lowness of tone perceived by the human ear. Quinto et al.[27] hypothesise that high and low pitch are associated with specific emotions in the speech domain. For instance, anger and excitement are associated with a high pitch while sadness is associated with low pitch. In our work, we assume that anger, frustration and excitement are often associated with high pitch and thereby serve as positive examples for this concept, while negative examples have relatively lower pitch indicating emotions such as sadness or neutral.

  • Utterance Polarity. Since the text inputs used for emotion recognition are the transcripts for the utterances, we use the underlying sentiment (positive or negative) as a concept for interpretability. Each utterance from the input is assigned a polarity score from -1 to 1. Utterances associated with positive emotions such as happiness and excitement have positive polarity and utterances associated with negative emotions have negative utterance polarity.

0:  Layer ,
1:  for  to  do
2:      {data for concept }
4:      {CAV}
5:     for  to  do
9:     end for
10:  end for
Algorithm 1 TCAV for Emotion Recognition

The sequence of steps involved in computing conceptual sensitivities for emotion recognition are outlined in Algo. 1. Given the set of concepts and the concept annotations set , we wish to compute the CAV for each concept along with the TCAV scores for each concept , label and layer . Similar to the original TCAV method, the activations of the concept examples at layer are extracted from the network and a binary classifier is fitted to these activations. However, there is an additional step while extending TCAV to emotion recognition, which is the conversion of video-level activations to utterance-level activations (Steps 3,6 in Algo. 1). The raw activations have the dimensions (), where is the number of videos, is the sequence length and is the number of features. These activations are reshaped to () so that the first dimension represents the number of utterances. Therefore, the reshaped activations represent the utterance-level activations of the model at layer . A binary classifier is then fitted to the concepts sets for each concept to obtain the CAV . The conceptual sensitivities and TCAV scores are computed as explained in Sec. III.B.1.

Iv Experiments and Results

In this section, we cover the experimental setup used for training the multimodal emotion recognition model and interpreting the same using Concept Activation Vectors.

Iv-a Model

To determine the influence of our concepts for emotion recognition, we make use of the Bi-directional Contextual LSTM network (Fig. 1b) introduced by Poria et al.[20] trained on the IEMOCAP multimodal database (Sec. IV.B). The motivation behind choosing this architecture is the fact that this is one of the few simple and straightforward speaker-independent multimodal architectures for emotion recognition, which makes interpreting its decisions more convenient. The current state-of-the-art methods [28] [29] for emotion recognition (in conversation) on IEMOCAP make use of speaker-specific components to enhance performance, which is outside the scope of our work. Contextual Hierarchical Fusion [19] extends the idea of contextual information to 3 hierarchical levels but provides only a marginal improvement over BC-LSTM, thereby making BC-LSTM the appropriate choice for our work. Bi-directional LSTMs are used here to account for contextual information from the preceding and following utterances for emotion classification. Fusion of the modalities is done in a hierarchical fashion consisting of 2 levels. Level 1 extracts context-sensitive information from the context-independent features that are fed to the contextual-LSTM module. These context-sensitive unimodal features are then concatenated and fed to the final contextual-LSTM module to extract context-sensitive multimodal features. For all 6 emotion labels of the IEMOCAP database, the BC-LSTM network achieves on video input, on audio input, on text input and with all 3 inputs combined, which is in accordance to the results presented in PapersWithCode 333PapersWithCode - IEMOCAP Benchmark.

Fig. 2: (a) Unimodal & Multimodal CAV accuracy for all 3 concepts. (b) TCAV scores for PT concept. (c) TCAV scores for VP ceoncept. (d) TCAV scores for UP concept (Stars indicate insignificant concepts)

Iv-B Dataset

The dataset used to train the BC-LSTM network and define human-understandable concepts for TCAV is the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [30]. It consists of scripted acts and improvisations involving 10 speakers. Each video involves a conversation between 2 subjects divided into several utterances and each of these utterances is associated with one of the following 6 emotion labels: happy (), sad (), neutral (), angry (), excited () and frustrated (). To train the BC-LSTM network, we use a 70-30 split for the training and testing sets, i.e, the training set contains 121 videos (4290 utterances) and the testing set contains 31 videos (1208 utterances).

Iv-C Experimental Setup

All the experiments are carried out on the BC-LSTM network with concept examples from the IEMOCAP database. The hierarchical components of the BC-LSTM architecture are trained sequentially and separately, i.e, the model is not trained in an "end-to-end" fashion. The unimodal contextual-LSTM modules are trained separately and frozen while training the multimodal contextual-LSTM module. The unimodal contextual-LSTM modules are trained separately with the Adam optimizer for epochs at a learning rate of . Based on trials with all the layers of the model, we found that extracting the activations from the contextual-LSTM layer at the unimodal level and the Dense layer at the multimodal level gave the best results. The samples distribution for the 3 concepts are outlined in Table 1. The concept examples are selected from the top videos that have the maximum number of utterances from the emotion labels for positive and negative examples indicated in Table 1.

Concept No. of Samples Labels of Samples
+ve -ve +ve -ve
Variations in
Physiognomy (VP)
2200 2200 0,4,5 2
Polarity (UP)
1361 792 0,2,4 1,3,5
Pitch (PT)
620 1706 0 4 5 2
TABLE I: Sample distribution for our proposed concepts

Examples for the VP concept are collected solely based on the assumption from Sec III.B. The positive examples set consists of preprocessed utterances belonging to the happy, excited and frustrated emotion classes based on manual inspection of a small subset of videos from the IEMOCAP database. For the PT

concept, we use Self-supervised Pitch Estimation proposed by Gfeller

et al.[31] to estimate the pitch for every utterance from the concept set and assign positive/negative labels based on a threshold pitch value of , i.e, the preprocessed utterance belongs to the positive examples set for PT if the pitch of the utterance exceeds . To compute the text polarity of utterances, we use the TextBlob Python library that assigns a text polarity of -1 to 1 for each utterance based on a weighted average sentiment score of the words in the utterance.

To account for variations in the binary classifiers’ initialization and preprocessing of the concept examples [15], the training of the CAVs is repeated 30 times resulting in 30 different vectors. We evaluate the statistical significance of our concepts by training 50 random CAVs for each layer and assigning random labels. We then perform a 2-tailed -test on the TCAV score distributions of the random concepts and the proposed concepts at a significance level .

Iv-D Results

Here, we discuss the quantitative evaluation of the CAVs for our proposed concepts through the classifier accuracies, TCAV scores and hypothesis tests for concept significance. Since there is no quantitative method to compare interpretability methods and due to the lack of results for concept-based interpretation of emotion recognition, we evaluate our concepts without any comparison to prior work.

Fig. 2a shows the test accuracies of the classifiers for the CAVs at the unimodal and multimodal levels of the BC-LSTM network trained on the IEMOCAP multimodal database. The overall accuracies are relatively low due to the fact that linear classifiers are used to define the concepts. The graph also shows that there is minimal variation in the classifiers’ accuracies, indicating that the 30 different vectors are consistent for each of the concepts.

Iv-D1 Variations in Physiognomy (VP)

The examples used to represent VP are collected based on the general assumption that emotions showing medium to high arousal such as excitement, joy, and anger [6] display a greater deviation from the neutral resting face and can be distinctly identified through variations in visual features. Fig. 2b indicates the TCAV scores for VP at the unimodal and multimodal bottleneck layers. We see that the highest scores are observed for the happy and excited classes, while the scores for sad, angry and frustration classes are relatively lower, which confirms the assumption stated above. However, the scores for frustrated and neutral classes are not in line with this assumption. At the multimodal bottleneck, it is observed that VP has a strong influence on happy and sad classes but much lower influence on the rest of the classes. The consistently strong influence of VP on the happy class across the unimodal and multimodal bottlenecks is evidence for the fact that VP is essential for recognizing happiness. It is also observed that the scores for the sad, excited and frustration classes are exceptions for the general assumption stated earlier. Since the concept set is not created based on a quantitative measure for variations in facial expressions, it is possible that the inconsistencies in TCAV scores are due to the nature of the concept set given that it is only an approximation based on the general assumption. Another factor that could contribute to this discrepancy is the relatively poor performance of the BC-LSTM network on video input from IEMOCAP as mentioned in Sec. IV.A.

Iv-D2 Voice Pitch (PT)

Pitch of an individual’s voice can be used as a strong indicator of expression of specific emotions. Fig. 2b shows the TCAV scores for PT at the unimodal and multimodal bottlenecks. It is observed that PT has the highest influence (0.865 and 0.726) on the anger and frustration classes at the unimodal bottleneck. This observation is consistent with [27], in that expression of such emotions are associated with high pitch and PT can be used as a distinguishing trait. This, however, is not true for the excitement class, which ideally is characterized by high pitch. At the multimodal level, it is observed that the happy, sad, neutral, and frustrated emotion classes have high scores for PT (, , and respectively). This is a contradiction to the general presumption that only emotions with high arousal are associated with high pitch and that pitch can be used as a distinguishing factor. This discrepancy could be due to the nature of the utterances found in the IEMOCAP database. It is observed that some of the utterances for emotions with medium arousal (happy, sad, etc.) have higher pitch than some of the utterances for emotions with high arousal.

Iv-D3 Utterance Polarity (UP)

Fig. 2d shows the TCAV scores for UP at the unimodal and multimodal bottleneck layers. From the scores, it is evident that UP has a high influence on all the target emotions except the sad and neutral labels at the unimodal bottleneck, which is as expected. Phrases that depict emotions involving medium to high arousal (intensity) tend to have a high level of sentiment polarity compared to the neutral emotion. We observe that the influence of UP on the neutral and sad target emotions is relatively low, which is in line with common observations on emotion recognition using text. Emotions such as neutral and sadness are not as conveniently distinguishable as the other target emotions. Given a phrase from these emotion classes, the polarity is approximately , which makes it difficult to differentiate utterances of the neutral and sad classes. At the multimodal bottleneck layer, the scores are negligible for the neutral and angry classes and significant for the happy and anger classes. This signifies that UP is insignificant towards the classification of examples into the sad, neutral, excited and frustrated emotions labels and that VP, PT concepts play a more important role for these classes. Despite the high TCAV scores of UP for the sad and frustrated classes, hypothesis tests shows insignificance of the concept for these labels.

Iv-D4 Hypothesis Testing

To test the significance of the proposed concepts for emotion recognition, we perform a 2-tailed -test. We first generate 50 random concept sets from the training set and assign positive and negative labels in a random fashion to the activations from the unimodal and multimodal bottleneck layers. This is followed by fitting binary classifiers to these random concept sets to generate 50 random CAVs. We then perform a hypothesis test by conducting a 2 tailed -test for the distribution of the TCAV scores for the proposed concepts and the 50 random concepts at a significance level . The null and alternate hypotheses are defined as follows:


Here, represents the mean score of the random concepts distribution and represents the mean score of the proposed concepts score distributions. We consider a concept to be significant if the null hypothesis is rejected for 40 of the 50 random TCAV score distributions. It is observed that (Fig. 2b, 2c, 2d) the UP concept is significant for all emotion classes at the unimodal bottleneck and insignificant for the sad, neutral, excited and frustrated classes. The VP concept is significant for all emotion classes at both the unimodal and multimodal bottlenecks. The PT concept is significant for all emotion classes at the unimodal level but insignificant for the neutral and excited emotions at the multimodal level.

V Conclusions and Future Work

Emotion AI has been widely used in critical domains such as medical diagnosis and security, and interpretability for emotion recognition will ensure the robustness and reliability of affective computing systems. To this end, we explore concept-based interpretation of emotion recognition through Concept Activation Vectors (CAVs) to quantify the influence of emotion-related concepts for a typical multimodal emotion recognition model. We define novel concepts based on existing Emotion AI literature, and analyse the relevance of the same. Through our results, it is evident that DNNs for emotion recognition make use of human-understandable concepts for classification, just like humans.

We further evaluate the significance of our concepts through hypothesis testing on the TCAV scores. The results show that the multimodal architecture makes use of specific concepts for specific emotion classes. There is no single concept that is significant for all the emotions, which is in line with pre-established notions for this task from the human perspective. Current literature on this task shows that most of the models trained on the IEMOCAP database tend to perform better on text input than the other two input modes, which could affect the interpretation of these models. Thus, one of the focuses for future work can be the interpretation of emotion recognition models that are independent of dataset bias.

In this work, we have explored the interpretability for emotion recognition on the BC-LSTM network, which is relatively simple compared to the state-of-the-art models. The TCAV method can be extended to more complex architectures to evaluate our concepts for these models. In addition to these improvements, the discovery of emotion-related concepts in an unsupervised setting can be a possible direction for future research. This can reduce human effort in annotating concept examples for emotion classification and enhance the interpretability of DNNs by allowing models to generate their own concepts.