M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis

07/17/2021 ∙ by Xingbo Wang, et al. ∙ Singapore Management University Carnegie Mellon University 1

Multimodal sentiment analysis aims to recognize people's attitudes from multiple communication channels such as verbal content (i.e., text), voice, and facial expressions. It has become a vibrant and important research topic in natural language processing. Much research focuses on modeling the complex intra- and inter-modal interactions between different communication channels. However, current multimodal models with strong performance are often deep-learning-based techniques and work like black boxes. It is not clear how models utilize multimodal information for sentiment predictions. Despite recent advances in techniques for enhancing the explainability of machine learning models, they often target unimodal scenarios (e.g., images, sentences), and little research has been done on explaining multimodal models. In this paper, we present an interactive visual analytics system, M2Lens, to visualize and explain multimodal models for sentiment analysis. M2Lens provides explanations on intra- and inter-modal interactions at the global, subset, and local levels. Specifically, it summarizes the influence of three typical interaction types (i.e., dominance, complement, and conflict) on the model predictions. Moreover, M2Lens identifies frequent and influential multimodal features and supports the multi-faceted exploration of model behaviors from language, acoustic, and visual modalities. Through two case studies and expert interviews, we demonstrate our system can help users gain deep insights into the multimodal models for sentiment analysis.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

This section discusses the relevant research of our approach, including multimodal language analysis, post-hoc explainability techniques, and machine learning interpretation with visualization.

1.1 Multimodal Sentiment Analysis

Multimodal sentiment analysis is a vibrant topic in natural language processing (NLP). It automatically extract people’s attitudes or affective states from multiple communication channels (e.g., text, voice, and facial expressions). Moreover, it has various applications [zeng2019emoco, zeng2020emotioncues, hu2018multimodal]. The core challenge is modeling the complex intra-modal and inter-modal interactions, where multimodal features are being fused.

Early work [ngiam2011multimodal, lazaridou2015combining] concatenated features from different modalities before being input to a learning model. Conversely, some work adopted late-fusion approaches that combine the decision values from individual unimodal models using a voting scheme  [potamianos2003recent, nojavanasghari2016deep] or a learning model [glodek2011multiple, ramirez2011modeling]

. However, these methods ignore the cross-modal interactions. To address such issues, some work explicitly computed the unimodal, bimodal, and trimodal features and fused them with tensor product 

[zadeh2017tensor, liu2018efficient] and dynamic routing [tsai2020multimodal]

. Recently, neural network methods 

[zadeh2018multi, pham2019found, rajagopalan2016extending, chen2017multimodal, zadeh2018memory, tsai2019multimodal] are popular to model the complex interplay between modalities. For example, researchers [rajagopalan2016extending, chen2017multimodal] have extended LSTM cells and gates to learn temporal interaction patterns among multimodal sequences. Pham et al. [pham2019found] proposed attention-based RNNs to learn multimodal representations with a cyclic translation loss among modalities. Zadeh et al. [zadeh2018memory] designed a multi-view gated memory unit that is controlled by neural networks. It stores and predicts temporal cross-modal interactions. Tsai et al. [tsai2019multimodal] utilized transformer attention mechanisms to learn both cross-modal alignment and interactions. Although neural networks greatly improve the performance over traditional methods, their complex architecture seriously affects the model interpretability. This paper presents an explanatory interface to diagnose black-box models for sentiment analysis tasks.

1.2 Post-hoc Explainability Techniques

Post-hoc explainability techniques interpret models after the training process [arrieta2020explainable, longo2020explainable]. They generally include model-specific and model-agnostic approaches [longo2020explainable]. Model-specific methods explain particular models ranging from shallow models [tolomei2017interpretable, haasdonk2005feature] to sophisticated neural networks [krakovna2016increasing, selvaraju2017grad]. In contrast, model-agnostic methods are flexible enough to be applied to any machine learning model. Here, we discuss two main types of model-agnostic approaches: explanation by simplification and feature relevance explanation [arrieta2020explainable].

For simplification techniques, researchers often built surrogate models (e.g., rule-based learners [4734030, johansson2004accuracy, ribeiro2018anchors]

, decision trees 

[bastani2017interpretability], and linear models [ribeiro2016should]) to imitate the original model behaviors with reduced complexity. One of the most representative methods is LIME [ribeiro2016should], which builds locally linear models to approximate individual predictions based on neighbors of instances of interest. Feature relevance explanation quantifies the feature contributions to model predictions. One popular example is SHAP [lundberg2017unified], whose mathematical root is Shapley Value [shapley1953value]

—a method from cooperative game theory. SHAP computes an additive importance score for each feature to describe its influence, given a prediction result. It has desirable properties (local accuracy, missingness, and consistency) and is proved to be aligned with human intuitions. Other work used local gradients 

[robnik2008explaining], randomized feature permutations [henelius2014peek], or influence functions [koh2017understanding] to disclose feature relevance.

However, the methods above are often used to interpret specific instances of one modality (e.g., sentences, images), which cannot be directly applied to multimodal sentiment analysis. This paper aims to fill the gap by enabling multi-level explanations on the learned intra- and inter-modal interactions from global, subsets, and local levels.

1.3 Machine Learning Interpretation With Visualization

With the increasing complexity of both data and machine learning models, various visual analytics systems have been proposed to assist in understanding the model behaviors. Besides measuring the model performance with computational metrics, users also need to explore when and why a model makes specific decisions [hohman2018visual]. One of the most common and important interpretation strategies in previous work is to reveal the relationship between the input data and model predictions [hohman2018visual, arrieta2020explainable]. They can be categorized into two groups: instance exploration and feature & subset exploration.

Instance visualization shows model behavior towards individual data samples. Amershi et al. [amershi2015modeltracker] presented ModelTracker to support performance debugging with a visual summary of binary classification instances. Ren et al. [ren2016squares] extended the performance visualization to multi-class scenarios with aligned vertical axis designs, while Kahng et al. [kahng2017cti] and Alsallakh et al. [bilal2017convolutional] adopted a matrix-like design for instance summary. Apart from visualizing instance distributions, Kulesza et al. [kulesza2015principles] built an exploratory debugging prototype to enable users to explain corrections back to models. In addition, there are tools [harley2015interactive, smilkov2017direct] that allow users to interactively probe models with provided inputs.

Feature and subset visualization investigates how to surface the patterns groups of features [krause2014infuse, brooks2015featureinsight, krause2016interacting] and instances [zhang2018manifold, ahn2019fairsight, cabrera2019fairvis, wexler2019if] that affect model decisions. Brooks et al. [brooks2015featureinsight] developed FeatureInsight, which supports the feature ideation process with a visual summary of set errors. Krause et al. [krause2014infuse]

enabled exploration of the predictive power of feature candidates across different feature selection algorithms. For specific applications in CV and NLP, features are often visualized as image patches 

[olah2017feature, selvaraju2017grad, springenberg2014striving] or text segments [karpathy2015visualizing, gehrmann2019gltr]. Besides, researchers built interactive tools to facilitate group-level exploration. Zhang et al. [zhang2018manifold] conducted feature attribution comparisons to inspect discrepancies across different data subsets. Some work [ahn2019fairsight, cabrera2019fairvis, wexler2019if] used fairness metrics to partition data into groups for model diagnosis.

However, these methods do not consider exploring multimodal features and determining how much they affect model decisions. Our system facilitates multi-faceted exploration of multimodal features and generates multi-level visual explanations on their influences.

2 Background

In multimodal sentiment analysis, a machine learning model predicts sentiment based on the visual, acoustic, and language features extracted from the raw video data. This section introduces the related background about multimodal datasets, feature engineering techniques, performance metrics, and intra- and inter-modal interactions.

2.1 Dataset

There is a wide range of multimodal datasets in the community. For example, IEMOCAP [busso2008iemocap] contains 151 videos of dialogues with different emotion labels. YouTube [morency2011towards] consists of videos of product reviews extracted from the social media website, YouTube. Without loss of generality, our work focuses on the largest and widely-used benchmark dataset for multimodal sentiment analysis, i.e., CMU-MOSEI [zadeh2018multimodal]. It consists of 23,454 monologue movie review video clips from 1,000 speakers and 250 topics in YouTube. The sentiment of each video clip is labeled by three annotators with a Likert scale of , where indicates strongly positive, represents strongly negative, and means neural. Besides the sentiment label, each video is associated with the information from the three communicative channels—transcripts for language resources (), facial expressions for the visual (), and voice of speakers as the acoustic modalities ().

2.2 Multimodal Feature Engineering

Prior research on multimodal models mostly uses different feature engineering techniques for all three modalities in sentiment analysis. Here, we follow the common practice of multimodal feature extraction (also provided by CMU-MOSEI). For language features, transcripts are encoded by high-dimensional word vectors. We leverage Glove embeddings 

[pennington2014glove] to represent each word, where each word is transformed to a 300-dimension vector. For visual modality, most work focuses on facial expressions, which are often encoded by Facial Action Coding System (FACS)  [friesen1978facial]. FACS encodes the facial muscle movement with 35 facial action units. We deploy it to extract frame-level facial features. The acoustic features are engineered through a speech processing framework, COVAREP [degottex2014covarep]. The extracted features have 74 dimensions, and all of them are related to speech emotions and tones. To help users gain a quick overview of these fundamental features, we further group them into different classes, which will be introduced in subsubsection 4.2.2.

2.3 Metrics for Multimodal Sentiment Analysis

Prior work applies several metrics to evaluate the model performance for multimodal sentiment analysis, including mean absolute error (), the correlation between the model predictions and human labels (), F1 score (), 7-class accuracy (), and 2-class accuracy (). Note that considers all of the sentiment scores , while is a binary classification score that only predicts whether this video clip is positive or negative.

2.4 Intra- and Inter-modal Interactions

In practice, sentiment analysis relies on multimodal language signals (e.g., language, facial expressions, and tones). A successful multimodal sentiment analysis requires the understanding of the combinations of these signals, where two primary forms of interactions exist—intra- and inter-modal interactions [soleymani2017survey, baltruvsaitis2018multimodal].

When modeling intra- and inter-modal interactions, three typical situations arise [soleymani2017survey, baltruvsaitis2018multimodal, zadeh2018memory]:

  • One modality is dominant for sentiment analysis. For example, people may show agreement by nodding their heads, where the vision modality dominantly indicates their positive attitudes.

  • More than one modalities complement each other when people are expressing their sentiment. For example, people’s positive attitudes in words can be enhanced by a happy tone.

  • More than one modalities conflict with each other. For example, people may tell sad stories with smiles on their face.

Researchers have tried to build models to analyze the situations above for better sentiment analysis. However, most state-of-the-art models are deep-learning-based techniques with little interpretability. Model developers and users are not aware of how exactly the model utilizes information in multiple modalities in situations of dominance, complement, or conflict. Explaining multimodal model behaviors not only provides insights into the multimodal language characteristics, but also reveals the model errors and inspires new model designs. In our work, we explicitly provide global explanations on intra- and inter-modal interactions with a compact visual summary. Specifically, we categorize instances into dominance, complement, and conflict groups based on the importance of each modality computed by SHAP [lundberg2017unified]. Furthermore, we summarize influential feature sets for each group with templates to provide finer-grained explanations on model behavior.

3 Design Requirements

Our goal is to develop a visual analytics system to help users (e.g., model developers and model users) understand and diagnose the behaviors of multimodal models for sentiment analysis. Similar to the general black-box explanation tools [amershi2015modeltracker, ren2016squares, krause2016interacting, zhang2018manifold], interpreting multimodal models helps target users gain insights into the connection between the model performance (e.g., model errors) and the characteristics of multimodal data. For example, model users can examine whether a model has a bias or poor performance on some types of data and further decide if it is a proper fit for target applications. Furthermore, given the critical aspects of multimodal sentiment analysis (in subsection 2.4), it is beneficial to explain the intra- and inter-modal relationships learned by the model. For instance, model developers can adjust the fusion weights of different modalities based on their relative importance to achieve better sentiment predictions. However, it is challenging to interpret multimodal models due to the high complexity of multimodal data and inter-modal relationships.

To understand users’ general needs and formulate design requirements, we surveyed prior visualization techniques for interpreting machine learning models [brooks2015featureinsight, amershi2015modeltracker, ren2016squares, krause2016interacting, krause2014infuse, zhang2018manifold, kahng2017cti, molnar2019, carvalho2019machine, arrieta2020explainable] and multimodal language analysis [baltruvsaitis2018multimodal, ngiam2011multimodal, tsai2019multimodal, tsai2020multimodal, zadeh2018multimodal, zadeh2017tensor]. Also, we worked closely with a researcher in NLP and multimodal machine learning (who is also a co-author of this paper) for about five months to collect his feedback and iteratively refine the design requirements. We summarize the design requirements as follows.

R1: Show the model performance. Performance metrics are crucial for guiding the model analysis [amershi2015modeltracker, ren2016squares]. They provide quantitative measures of how accurate the predictions are and can help users pinpoint where the model is likely to fail. The users often want to evaluate models at different levels:

  • Q1: What are the overall error distributions for model predictions?

  • Q2: What are the instances that are predicted with large/small errors?

R2: Reveal the contributions of modalities to the model predictions. Besides performance metrics, the system should provide global explanations on how the model generally works, especially when working with huge datasets [kahng2017cti, molnar2019, carvalho2019machine, arrieta2020explainable]. In multimodal sentiment analysis, intra- and inter-modal interactions are crucial for understanding the model behaviors  [baltruvsaitis2018multimodal, ngiam2011multimodal]. Thus, it is essential to summarize the influences of individual modalities and their interplay for predictions. Specifically, the system should help users answer the following questions:

  • Q3: How does each modality influence the model predictions? Displaying the contributions of each modality helps users prioritize their efforts in diagnosing a particular modality for model predictions  [zadeh2017tensor].

  • Q4: Which modalities dominate the model predictions? Also, which modalities complement or conflict with each other for model predictions? To better reveal the characteristics of multimodal interactions captured by the model, the system should further summarize the instances according to the interaction types  [zadeh2018multimodal, tsai2020multimodal, tsai2019multimodal]. Specifically, dominant, complementary, and conflicting modalities, which depict typical interaction types, are the targets for analysis.

  • Q5: How do dominant/complementary/conflicting modalities influence the model predictions? Besides recognizing the learned interaction types, it is also essential to connect them to the model predictions for a comprehensive understanding of model behaviors  [amershi2015modeltracker, ren2016squares, zhang2018manifold]. For example, the dominance of language modality can contribute to positive or negative sentiment for different instances.

R3: Identify the influences of multimodal features for the model predictions. With a global understanding of how the model work on individual modality (R2), users need to drill down to finer-level inspection on model behaviors. Feature-based exploration is a common and effective approach for explaining machine learning models [brooks2015featureinsight, krause2016interacting, krause2014infuse]. Accordingly, the system should connect high-level modality interactions with the corresponding multimodal features. For example, users may want to know when the language modality dominates the predictions and what words people use to express their sentiments.

  • Q6: What are the feature sets that significantly contribute to positive/negative sentiment predictions? Exploring all the features of instances individually is tedious given the high volume and dimensionality of multimodal data. Summarizing the set of features with a significant predictive contribution helps reduce the efforts in exploration  [brooks2015featureinsight, krause2014infuse]. In addition, it helps users develop a high-level concept about model predictions. For example, users may want to know what types of words or facial expressions are considered important to models when dealing with positive sentiment cases.

  • Q7: What features are considered important by the model? Are they plausible for prediction? To help users analyze the individual predictions, features with a significant influence on the model performance should be presented to users and allow them to judge whether they align well with the observation of the original data.

R4: Support multi-level and multi-faceted exploration of the multimodal model behaviors. Given the multimodal settings of sentiment analysis, the visualization should empower users to explore the relationships between the model and input data from multiple aspects (e.g., language, facial expressions). To facilitate a comprehensive understanding of multimodal models, explanations should be offered on different levels, including the influences of individual modalities and their interplay, and the importance of multimodal features.

4 M2Lens

Based on the derived design requirements (section 3), we develop a visual analytics system, M2Lens (M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis), for understanding and diagnosing how models utilize multimodal information for sentiment prediction. In this section, we first provide an overview of the system architecture. Then, we will illustrate the methods for generating explanations of multimodal model behavior. Next, we describe the visual designs and interactions in detail.

4.1 System Overview

Figure 1 shows the system architecture. First, speakers’ opinion videos are transformed into visual, acoustic, and language features. The storage module saves users’ model and data with processed features. Then, the explanation engine inputs the features into the model and generates multi-level explanations of model behaviors based on the feature attribution methods (e.g., SHAP). The visual analysis module enables interactive exploration of the explanations through five main views.

Figure 1: M2Lens consists of a storage module, an explanation engine, and a visual analysis interface.

The User Panel is the entry point of the whole interface, where the descriptive statistics about the model performance and dataset (Q1) are shown. Then, Summary View, Template View, Projection View, and Instance View provide multi-level model explanations from language, visual, and acoustic modalities (R4). The Summary View presents a global summary of the influences of individual modalities and their interplay for the sentiment predictions (R2). The Template View and Projection View complement each other for subset-level explanations (R3). Specifically, Template View uses templates to summarize feature sets that frequently and significantly contribute to the model predictions. Projection View supports the multi-faceted exploration of instances that have features of interest, along with their prediction errors. The Instance View summarizes instance-level prediction information (e.g., errors) (Q2) and offers local explanations on the importance of each modality and its features (R4). In addition, it adds the audio and vision features along the spoken words and provides the corresponding raw video clips with feature annotations for further exploration.

4.2 Multi-level Explanations

To facilitate users with a comprehensive understanding of multimodal behavior, we propose methods to generate global and subset-level explanations (R2, R3). They supplement the local explanations computed by feature attribution methods.

4.2.1 Global Explanations

Since the intra- and inter-modal interactions lie at the heart of multimodal sentiment analysis, they are essential for users to understand how the multimodal model utilizes the information from different modalities (i.e., language, audio, and vision) (R2). In our work, we characterize three typical types of interactions among modalities—dominance, complement, and conflict (details are in subsection 2.4).

The dominance suggests that the influence of one modality dominates the polarity (i.e., positive or negative) of a sentiment prediction. The complement indicates that two or all three modalities affect a model prediction in the same direction (i.e., positively or negatively). Conversely, the conflict reveals that the influences of modalities differ from each other. According to the definitions above, we formulate a set of rules to identify them (Algorithm 1). Specifically, The influence of the interactions on the model output is based on the importance of each modality (), which is the summation of the importance of all its features. Then, we extract and summarize the interactions () with strong influences for all the predictions. The thresholds for our rules are determined by maximizing the distances between the interaction types while minimizing the average influences of interactions that do not belong to dominance, complement, or conflict (i.e., others):


where  () is the interaction types output by Algorithm 1 for all the instances, is the Euclidean distance between the average influences of and .

0:  ; ;
0:  Label for the interaction types, ;
1:  if  then
2:     /* important interactions */
3:     if  then
4:        ;
5:     else if  then
6:        ;
7:     else if  then
8:        ;
9:     else
10:        ;
11:  else
12:     ;
Algorithm 1 Rules for extracting important relationships of modalities.

4.2.2 Feature Templates

Compared with inspecting the impacts of individual features, exploring feature groups is more effective for analyzing complex model behaviors and data characteristics [krause2015supporting, kahng2017cti]. It helps users develop a mental model about the model decisions (Q6). For example, what types of words (e.g., adjectives) are considered important indicators for positive sentiment. To ease the exploration of influences of high-dimensional features, we organize the model’s input features introduced in subsection 2.2 into several meaningful groups. Then, we summarize frequent and influential groups with compact templates ( M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisC).

To promote the understanding of model behaviors, we first identify several feature sets based on the sentence structures for the language modality, emotion-related features for the acoustic modality, and facial expressions for the visual modality:

  • Language: part of speech (POS)111https://universaldependencies.org/docs/u/pos/ (e.g., noun, adjective, verb);

  • Audio: pitch, amplitude, glottal/voice quality, and phase;

  • Vision (i.e., Face): face parts (i.e., brow, eye, nose, lip, and chin), head movement, and face emotions.

For language modality, POS features provide a compact summary of the structure of language use. They have been widely used as a probe for natural language models 

[ribeiro2020beyond, rogers2020primer, strobelt2017lstmvis]. The audio features are grouped according to a state-of-the-art speech processing framework, COVAREP [degottex2014covarep], and speech applications [rubin2013content, wang2020voicecoach]. These sets generally relate to the emotions and tones of speech. For face-related features, we divide them into the face parts, head movement, and face emotions. They are the representative components in the facial action coding system (FACS) [ekman1997face] for describing facial expressions. For the mapping between low-level multimodal features and the feature sets, please refer to the supplementary material.

After grouping the low-level features for each modality, we construct templates for both the frequent feature sets (e.g., “ADJ”) and features (e.g., word “good”) that have a strong influence on predictions (Q6). Specifically, we create itemsets of important features and feature sets for all predictions. Then, we build FP trees [han2004mining] to find frequent patterns within the itemsets. For example, if “PRON” and “PART” or the word “not” constantly appear, they will be recorded in the templates ( M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisC).

4.3 User Interface

Based on the generated explanations, the user interface of M2Lens facilitates multi-level exploration of model behavior from the perspective of language, acoustic, and visual modalities (R4). All the views are tightly integrated with interactions to ensure a smooth transition between different levels of explanations. They share the same color encoding scheme where dark red means strong positive sentiment and dark blue represents strong negative sentiment.

Figure 2: Design choices for the Summary View. A: An augmented Sankey diagram. B: Our current design of augmented tree-like layout.

4.3.1 Summary View

The Summary View presents an overview of the intra- and inter-modal interactions that are learned by the selected model in the User Panel (R2). The influences of individual modalities and their interplay are visualized in a three-layer augmented tree-like layout (Figure 2B).

Visual designs. In the parent node, a barcode chart and a line chart show the distributions of the ground truths and model prediction errors, respectively (Q1). The vertical height of the barcode represents the total number of instances, and the color displays the sentiment. Meanwhile, the horizontal position of the line chart suggests the absolute error, and the mean error is represented as a dashed line.

The second layer presents the importance of individual modalities in bee swarm plots (Q3). They are arranged according to the influences of modalities in descending order. For each node in the layer, a blue bar is put to the left, whose horizontal length summarizes the total influences of the modality. Besides, the dots in the bee swarm plot and their projections (i.e., the barcode below) demonstrate the distribution of the influences of that modality for all the instances. The color and horizontal position of the dots encode the importance values, while the two gray lines indicate the magnitude of mean absolute importance.

The last layer summarizes the information about the four types of interactions (subsection 4.2), where the most influential one is shown at the top (Q4, Q5). For each interaction, the horizontal range of all its charts marks the number of instances in that group. To better surface the patterns of how the combinations of modalities affect the model predictions, we put the data instances close to each other if all their three modalities share similar influence patterns. Specifically, the similarity is measured by the farthest distances among three modalities between the instances. Then, a line chart and a barcode chart at the top summarize the error and prediction patterns, which are similar to the parent node. In addition, three barcode charts are attached below to present the distribution of importance of all three modalities. Their vertical orders show the total influences of the corresponding modalities, which are summed up by the blue bars to the left. The color of the bars inside the barcodes represents the importance values.

Besides, between two neighboring layers, links are drawn from the parent nodes to their child nodes. The width of a link is proportional to the importance of the child node to the model predictions.

Design choices. We have considered an alternative design (Figure 2A) based on the Sankey diagram to reveal the intra- and inter-modal interactions and their importance to the predictions. It consists of three parts, the ground truth information at the left, the influences of individual modality at the center, and the inter-modal interactions at the right. The width of a flow is proportional to the importance of the target node of the flow. The barcode chart of each node further displays the importance distribution. In addition, the orange lines of the nodes show the error distribution to guide the exploration. However, one expert commended that it would be necessary to demonstrate more detailed information on each node. For example, what modalities dominate the predictions, and what is the frequency? Therefore, we augment the nodes with graphs and further convert the Sankey diagram into a compact tree-like layout, which leads to the current design (Figure 2B).

4.3.2 Template View

To facilitate the exploration of feature sets and their influences, the Template View ( M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisC) summarizes frequent and influential templates of multimodal features in a table (Q6).

Visual designs. The Template View has four columns describing information about the template types, support, importance, and predictions and errors (R1, Q6). The first column records the names of feature sets by default. If a feature set contains frequent and important features, a green bar will be placed to the right denoting the number of children for the feature set. Users can collapse the corresponding row for detail by clicking the . The second table column displays the frequency for the templates. The distribution of the templates’ importance and prediction information is visualized in the third and fourth columns. They share the same visual representations with the Summary View (subsubsection 4.3.1). Users are enabled to sort the templates according to their support, importance, and errors. In such a way, they can prioritize their efforts in diagnosing the complex model behavior.

Figure 3: The glyph designs in the Projection View. A: Chernoff face glyph designs. The left one with darked colored rings and thick strokes of face parts indicates intense facial movement, while the right one suggests little facial movement. B: Audio glyph designs. The left one with big blue sectors indicates high pitch, while the right one suggests low pitch.

4.3.3 Projection View

To further support the subset-level exploration of model behavior (R3, Q6), the Projection View ( M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisD) connects the multimodal feature templates in the Template View with the instances. It allows users to examine the detailed information (e.g., feature values, prediction errors) about features across the instances. For example, after users select the “ADJ” template in the Template View, they may feel intrigued by what adjectives associate with large errors or with positive predictions. Then, they need to further inspect the individual instances.

Visual designs. To summarize the feature sets of a group of instances, we project the high-dimensional features onto a 2D plane using t-SNE [maaten2008visualizing]. Thus, instances with similar features will be placed close to each other. Given textual, acoustic, and visual features are heterogeneous, we design three different glyphs to encode the feature sets of the instances. Users can switch between views to see the feature distribution of each modality. Moreover, to help diagnose the model behavior (e.g., errors), we add a heatmap as the background to display the distribution of prediction errors or template importance.

  • Language: since words already carry semantic meanings, we use them to represent the textual features. In addition, we add a circle for each word, whose color encodes the sentiment prediction.

  • Vision: our glyph designs for facial features (Figure 3A) are inspired by Chernoff face [chernoff1973use], which is popular for displaying facial expressions. However, the original Chernoff face cannot reflect information such as head movement. Therefore, we add three sticks around the face to indicate the head movement in the yaw, pitch, and roll axis, respectively. The outer ring encodes the whole face information (e.g., emotion in our case), where the dark color suggests large feature values. Moreover, the stroke width of face parts (e.g., nose) and sticks mean movement intensity. The sentiment prediction is revealed by the face’s background color.

  • Audio: to help understand acoustic features, we group them into higher-level classes (subsubsection 4.2.2). As shown in Figure 3B, each colored sector represents the features of a class, where the radius relates to feature values. The sectors at the front summarized the average values of normalized features, while the small ones at the back display detailed feature values of the classes. Additionally, the inner circle color shows the sentiment prediction.

4.3.4 Instance View

The Instance View ( M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisE) provides local explanations by visualizing the important multimodal features and the context (i.e., transcripts and videos) of individual instances (Q7).

Visual designs. The left column presents a visual summary of the influences of modalities on the model predictions, as well as the prediction errors. Users can sort the instances according to different criteria (e.g., error) at the header and prioritize their efforts in instance-level exploration. In each row, the horizontal axes demonstrate the sentiment range, where the prediction and ground truth are marked. Between the two values, the thick red line suggests the error. Below the prediction mark, three colored rectangles represent the aggregated feature importance values of the three modalities. The length and color of each rectangle encode the magnitude and sign of importance. For example, the modality with negative influences on the prediction will be encoded by a blue rectangle and placed at the right. In addition, the feature table below allows users to sort and search for the importance values of features or modalities.

To promote a comprehensive understanding of the context of individual instances, the right column highlights the important features of the instances. Unlike intuitive texts, the acoustic and visual features are harder to recognize. Thus, we align them with the spoken words and draw the most important ones using orange lines. The lines above the words correspond to acoustic features, while the lines below represent the visual features. The vertical offset of the lines denotes the feature values, and hence the fluctuations indicate the feature variation. In addition, the backgrounds of texts or feature lines reflect the importance of multimodal features at a word level.

The Instance View also provides video context for instance-level exploration. When users click on the rows of the table, the corresponding video clips will pop up and play. To make the visual features more intuitive, the top-ranked facial features (sorted according to importance value) are highlighted with bounding boxes that cover the corresponding parts of the face. Users can further find the detailed facial action units and their concrete meanings by hovering on the boxes.

4.3.5 User Interactions

The M2Lens provides a rich set of interactions, which help unify the different views and facilitate multi-level and multi-faceted exploration with details on demand.

Brushing. Users can brush the barcodes in the last layer of Summary View to emit a query on the specific data instances of an interaction type. Then, the Template View and Instance View will show the related templates and local explanations, respectively.

Clicking. Many interactions in the system can be triggered and undone by clicking. For example, clicking the table rows in the Template View will filter the irrelevant instances in the Projection View and Instance View. Users can switch between feature projections of different modalities by clicking the radio buttons in the Projection View. When clicking the table rows in the Instance View, the corresponding instances in the Projection View will be shown, and its video clips will pop up and play. In addition, users can click on the header of the Template View and Projection View to undo the previous selections.

Lasso and semantic zooming. To facilitate scalable exploration, users can use lasso or semantic zoom to focus on specific instances of interest in the Projection View. Then, the detailed information will be displayed in the Instance View.

Searching, sorting, and filtering. To narrow down the exploration space, users can sort and search for the instances or features in the table of Template View and Instance View. By adjusting the sliders in the Projection View, users can filter the instances according to the sentiment predictions and the feature importance of specific modalities.

5 Evaluation

In this section, we demonstrate how M2Lens helps users understand and diagnose multimodal models for sentiment analysis through two case studies and interviews with three domain experts (E1, E2, and E3) using the CMU-MOSEI dataset. E1 and E2 are NLP researchers who have multiple top research publications on multimodal language analysis (e.g., emotion recognition). E3 is a senior software engineer who has five years’ experience in developing affective computing applications. The two cases are discovered by E1 and E2 during the system exploration in the interviews. The detailed feedback from all the experts is also collected and summarized.

5.1 Case One: Multimodal Transformer

In the first case, the expert E1 explored and diagnosed a state-of-the-art model, Multimodal Transformer (MulT) [tsai2019multimodal], for sentiment analysis using the CMU-MOSEI dataset (subsection 2.1). MulT fuses multimodal inputs with cross-modal transformers for all pairs of modalities, which learn the mappings between the source modality and target modality (e.g., vision text). Then, the results are passed to sequence models (i.e., self-attention transformers) for final predictions. All the multimodal features of the input data are aligned at the word level based on the word timestamps. Following the settings of previous work [tsai2019multimodal], we trained, validated, and evaluated MulT with the same data splits (training: 16,265, validation: 1,869, and testing: 4,643). The details about the MulT are included in the supplementary material.

During the exploration, E1 observed that the language modality often dominates the predictions, and the model cannot handle the negations in sentiment analysis very well. He further investigated the dominance of visual modality, where “Joy” and “Sadness” (two facial emotions) frequently co-occur. It was thought to be caused by the intense facial muscle movement, which was also captured by the model.

5.1.1 Dominance of Language Modality

Global summary (R1, R2)  After selecting the MulT and valid set in the User Panel, E1 felt interested in how individual modalities and their interplay contribute to the model predictions. By looking at the second layer of the Summary View (M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisB), E1 found that the language modality (indicated by the letter “L”) has the largest influence among the three modalities since it has the longest bar to the left and widest range of dots in the bee swarm plot. On the contrary, the acoustic modality (indicated by the letter “A”), which ranks at the bottom, has the least influence. Then, E1 examined the last layer, where the dominance group with the widest barcode charts is shown at the top. Within the group, he discovered that the longest bars attach to the language modality, and the color of the prediction barcode aligns well with that of the language barcode. Thus, E1 concluded that the language also plays a leading role in the dominance relationship. Furthermore, he noticed that there are a group of dense blue bars appearing at the end of the language barcode, where the errors are relatively large (as indicated by the yellow curve above the dashed line). He wondered what features or their combinations cause the high errors. Therefore, we brush the corresponding area of the blue bars.

Subset exploration (R1, R3, R4)  The Template View ( M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisC) lists all the frequent and important feature templates for the brushed instances in the Summary View. By sorting them in descending order of error, E1 found that the “PRON + PART” appears at the top with one child feature. Then, he collapsed the row and found that 21 instances contain the word “not”, where it negatively influences the predictions (blue dots in the bee swarm plot in the “importance” column). Next, he clicked “not” to see the details about this feature in the Projection View. Zooming in on the word “not”, several similar negative words (e.g., “isn’t”, “wouldn’t”) were observed ( M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisD). They were all located in a red area, indicating large errors. E1 speculated that the model cannot deal well with negations. Subsequently, he lassoed these words to closely examine the corresponding instances in the Instance View.

Figure 4: Examples of double negations. “not…sin” (in A) and “not..bad” (in B) are considered as indicators for negative sentiment by the model. However, these phrases reflect sentiments that are slightly positive.

Instance exploration (R1, R3, R4)  To further evaluate how the model handles negations, E1 started with the instances with large errors in the table ( M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisE). When exploring the top-listed examples, E1 observed that negations always have significant negative influences on the predictions, and the model fails to interpret the true sentiment. For example, E1 found a case where the language modality dominates the negative sentiment prediction, and the word “not” is highlighted in blue ( M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisE). However, the true sentiment of this sentence is positive, where the starting phrase “I really like” demonstrates the positive attitude. However, the model fails to extract the keywords and relies on the negation (i.e., “not”) to predict the negative sentiment. Moreover, E1 noticed that when double negations appear in a sentence ( Figure 4), the model tends to treat them separately and regards both of them as indicators for negative sentiment. However, in fact, these double negations reflect sentiments that are slightly positive.

5.1.2 Dominance of Visual Modality

Global summary (R1, R2)  E1 referred back to the “dominance” group in the Summary View, where a collection of red bars from the prediction barcode conform with the ones from the visual modality (highlighted red in M2Lens: Visualizing and Explaining Multimodal Models for Sentiment AnalysisB). The visual modality dominates the predictions, and the error line chart above suggests a low error rate in contrast with the previous case in Sect. 5.1.1. Motivated by this observation, E1 brushed the red bars to investigate the patterns in the visual features.

Subset exploration (R1, R3, R4)  In the Template View, “Face Emotion” has the largest support ( Figure 5A). After unfolding the row, E1 found that “Joy + Sadness” is a frequent and important combination. This intrigued him to find out how a contrary emotion pair co-occurs. After clicking the template, the corresponding glyphs are highlighted in the Projection View ( Figure 5B). Most of them are found outside the red area, which verifies that the instances with“Joy + Sadness” often have small prediction errors. He decided to inspect these instances.

Instance exploration (R1, R3, R4)  Through browsing the instances and their videos in the Instance View, “Joy” and “Sadness” are often considered important visual features with positive influences. Additionally, E1 found their co-occurrences may be due to the presence of intense and rich facial expressions in the videos. These expressions generally involve the movement of the related facial action units in “Joy” and “Sadness”. For example, after E1 clicked on the instances, he noticed all the face parts (i.e., nose, eyes, brows, mouth, and chin) of the corresponding glyphs ( Figure 5B) in the Projection View has thick strokes, which suggests intense movements. When he watched the original videos, the bounding boxes of “Joy” and “Sadness” always popped up as important visual features. Hovering on the boxes and examining the facial expressions and their explanations, E1 concluded that the extreme facial expressions triggered the movement of the action units in “Joy” and “Sadness”, and the model seemed to capture these important visual facial expressions.


During exploration, E1 discovered that MulT cannot handle double negations very well, though it is a state-of-the-art model. He commented augmenting double negation examples or preprocessing them into positive forms can further improve the performance.

Figure 5: Joy + Sadness” co-occurrence patterns. A: “Joy + Sadness” is a frequent and important feature template in the table. B: The raw video information and corresponding glyphs of three representative instances of the “Joy + Sadness” template.

5.2 Case Two: EF-LSTM

In this case, the expert E2 explored the popular RNN-based model, EF-LSTM [hochreiter1997long], for multimodal sentiment analysis using the CMU-MOSEI dataset. The dataset setup and feature processing are the same as Case One (subsection 5.1). EF-LSTM concatenates textual, acoustic, and visual features at each word. Then, it uses an LSTM model to derive the input representations for the predictions. The details of the model are provided in the supplementary material.

Through interactive explorations with M2Lens, E2 was surprised to find that EF-LSTM does not learn sentiment in text. Also, he noticed that the acoustic modality has the largest influence on the sentiment prediction results among the three modalities, and the voice pitch always plays a negative role in the sentiment predictions.

5.2.1 No Meaningful Information Learned in Text

Global summary (R2)  After selecting the valid set and EF-LSTM, E2 started with the Summary View to gain an overview of the impacts of the modalities ( Figure 2B). By comparing the range of dots in the three bee swarm plots, he was surprised to find that acoustic modality is the most influential modality, then comes the language modality. In addition, the language modality always exhibits a positive impact on the sentiment. These findings are quite counter-intuitive. Thus, E2 first explored text-related interactions by tracking the thickest links from the language modality to the third layer. He noticed that “complement” group shows at the top, and the text plays a leading role within the group. Then, he brushed the whole group to see textual feature patterns.

Subset exploration (R3, R4)  The strange thing is that no textual templates and text glyphs were spotted in the Template View and the Projection View, respectively. E2 suspected that the model does not learn any important language features (i.e., words) for sentiment analysis. Then, he referred to the Instance View to validate his doubt.

It’s run by a fantastic team of professors; they are always available for you. (Umm) this movie was excellent.

Instance exploration (R1, R3, R4)  When exploring the instances in the Instance View, E2 found that the model fails to recognize potentially-important words for sentiment analysis, such as “fantastic” (in line #1), “excellent” (in line #2). None of them is highlighted with colors in the Instance Detail. E2 also noticed every word of the sentences in the feature table has evenly low positive importance scores (less than 0.1). This explains why the language modality always has positive influences and further proves that the model does not capture the sentiment in text.

5.2.2 Negative Influences of Voice Pitch

Figure 6: Negative influences of voice pitch. A: “pitch” is the most frequent acoustic template, and it always has a negative impact (as indicated by the dots in the bee swarm plot). B: The selected group of instances with large pitch values (as indicated by the large radius of the blue sectors). C: Two high-error cases where the model captures the turning points of the pitch but wrongly associates pitch with negative influences.

Global summary (R1, R2)  E2 paid attention to the most influential modality (i.e., the acoustic modality) in the Summary View (Figure 2

B), where a negatively-skewed distribution of dots was shown. In addition, he noticed that within the“conflict” group, the acoustic modality plays a negative role (blue bars) throughout the time. Thus,

E2 brushed this group to investigate the negative influence of acoustic features.

Subset exploration (R1, R3, R4)  E2 found the “pitch” is the most frequent acoustic template in the Template View ( Figure 6A). Moreover, E2 noticed that pitch always has a negative impact given the negatively-skewed distribution of dots in the third column. After clicking the row, he switched to the Projection View to see the pitch value distribution ( Figure 6B). He discovered that the acoustic glyphs are spread along a left-slanting line, where the radius of the blue sectors (i.e., pitch values) generally increases from left to right. Then, he selected a group of instances with the large pitch at the right corner for further inspection.

Instance exploration (R1, R3, R4)  By browsing the instances and videos in the Instance Summary ( Figure 6C), E2 observed that pitch is always the top important acoustic feature and is associated with negative influences. Although some important pitch variation signals in the videos are captured by the model, he believed that the model is not reliable since it always regards the pitch as a strong negative sentiment indicator and he found many counterexamples. To name a few, in two cases (Figure 6C), he found pitch ranks the first with negative importance in the feature table. And he noticed that some backgrounds of the orange lines (i.e., pitch values) are colored light blue (i.e., negative). By examining the offsets of all the orange lines, he thought the highlighted ones seem to be the turning points of pitch values. He speculated that the model captures the important signals in audio. He further checked the original video and verified the observations. However, the speakers sound high-spirited, and the pitch should reflect positive sentiment.


Through the case study, E2 found that EF-LSTM seems not able to capture the sentiment in text. He reasoned that the simple early feature fusion may lead to textual information loss. He speculated that some more advanced model designs (e.g., transformer) can be incorporated into the model to facilitate text understanding. Given the negative impacts of voice pitch, E2 thought that removing the pitch feature may increase the model accuracy.

5.3 Expert Interviews

We collected the feedback from the one-on-one interviews with the aforementioned three domain experts (E1, E2, E3). None of them have tried the system before the interviews. We first introduced the background and system designs. Then we asked the experts to use M2Lens to diagnose two state-of-the-art models (i.e., multimodal transformer and EF-LSTM) on the CMU-MOSEI dataset. After a 50-minute exploration, we collected their feedback about the system workflow, system designs, application scenarios, and improvement suggestions.

System workflow. All the experts confirmed the effectiveness of the system workflow of M2Lens in providing explanations for multimodal sentiment analysis models. They mentioned that they usually rely on performance metrics or instance-level feature importance measures for model evaluation, which does not provide many details and is unable to support an in-depth analysis. Our system supplements them with global- and subset-level explanations, which facilitates a comprehensive and systematic understanding of model behaviors. E1 and E3 praised that the interaction summaries (i.e., dominance, complement, and conflict) are impressive and very useful for revealing both the model behaviors and the multimodal data characteristics. E3 mentioned if he finds some modalities are influential in predicting sentiment using M2Lens, he can consider reducing the number of modalities without losing much performance when deploying the model to low-end devices. E1 added that the feature templates help generalize the model error patterns. E2 summarized that the system assisted him in discovering interesting insights into the models. For example, he was surprised that EF-LSTM seems to not capture any sentiment information from the text.

Visual designs and interactions. Overall, the experts confirmed that the visualizations are useful and still easy to understand, and interactions are smooth. The Summary View is most favored by the experts for a quick overview of the learned intra- and inter-modal interactions. The designs of Projection View are also appreciated by the experts. E3 really liked the heatmap for showing the error and feature importance patterns. E1 thought the face glyphs are very intuitive, and the interactions such as lasso and zoom are really helpful for the exploration of a large amount of data. Moreover, he valued the video playback and the realtime highlighting of face parts for raw video browsing. Nevertheless, E1 and E2 said that the Instance View is a little complex, visualizing lots of information. Additionally, the experts responded that it took them a while ( about 20 minutes) to fully grasp all the components and functions in the system.

Improvements. The experts offered constructive suggestions for improvements. E3 requested a bookmark function to save user interaction histories (e.g., selection of templates) for further review. E1 suggested that the system can add a comparison module for exploring and comparing different models at the same time. During the exploration, E2 and E3 observed that some large model errors are caused by dataset errors (e.g., a mismatch between the video and transcript). They recommended that the system should support correcting dataset errors.

6 Discussion

Here, we discuss M2Lens regarding generalizability, scalability, multi-level and multi-faceted exploratory analysis, and learning curve.

Generalizability. M2Lens was developed to visualize and explain multimodal models for sentiment analysis. We demonstrated our system through case studies on two state-of-the-art models using the CMU-MOSEI dataset. However, M2Lens can also be used to explain other multimodal models on different sentiment datasets based on the feature importance computed by post-hoc explainability techniques. Furthermore, the interaction types (i.e., dominance, complement, and conflict) and feature templates can summarize multimodal features from the global and subset levels in other multimodal language analyses. For example, for the multimodal emotion recognition task, the system can explain what are the dominant modalities when “angry” is predicted. The feature templates can summarize the frequent and influential feature sets for “angry” and facilitate the exploration of model behaviors.

Scalability. Our approach also has some scalability issues, which come from the automated algorithms and visual designs. The bottleneck of our computational cost is the feature attribution methods. We use SHAP to compute the feature importance. It took about 25 minutes to process 2,000 instances of the CMU-MOSEI validation set. To speed up the process, we can employ techniques such as feature clustering, data sampling, and parallel computing. For the visual designs, the visual clutter can occur in the Projection View, where multimodal instances are encoded with different glyphs. To reduce this issue, M2Lens enables filtering instances according to the feature importance and sentiment predictions. Moreover, users can use semantic zoom to focus on instances of interest, which alleviates the overlapping issues.

Multi-level and multi-faceted exploratory analysis. M2Lens provides multi-level and multi-faceted explanations on the behaviors of multimodal models for sentiment analysis. A general workflow for our target users (e.g., model users and researchers) starts with the Summary View, where the global summary of the influences of individual modalities and their interplay is displayed. Then, users can specify an interaction type. Its influential and frequent multimodal features will be summarized in the Template View and Projection View. Users can examine their error and importance patterns, which helps prioritize their efforts for the instance exploration in the Instance View.

Learning curve. According to the feedback from the expert interviews, the experts pointed out that it took them some time (usually a 20-min trial) before smoothly using our system since our system contains a few components. However, they said that M2Lens is very helpful for them to explore the models. Moreover, they have derived comprehensive insights into the model behaviors and are eager to use M2Lens for model understanding and diagnosis in the future.

7 Conclusion and Future Work

In this paper, we presented M2Lens, a visual analytics system to help users understand and diagnose multimodal models for sentiment analysis. M2Lens provides multi-level explanations on model behaviors from language, acoustic, and visual modalities. It features an augmented tree-like layout for a global understanding of learned intra- and inter-modal interactions. Moreover, the feature templates and visualization glyphs of multimodal features facilitate the exploration of a group of frequent and influential feature sets. Through two case studies and expert interviews, we demonstrated M2Lens can provide deep insights into the state-of-art multimodal models for sentiment analysis.

In the future, we plan to enhance our system usability by adding functions, such as model comparison, data error correction. Also, we would like to extend our system to other multimodal applications (e.g., emotion recognition). Further, more domain experts can be invited to further validate the usability and effectiveness of M2Lens with more datasets and models for sentiment analysis.

The authors wish to thank anonymous reviewers for their feedback. This research was supported in part by grant FSNH20EG01 under Foshan-HKUST Projects.