Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark

by   Xun Gao, et al.

Recognizing the emotional state of people is a basic but challenging task in video understanding. In this paper, we propose a new task in this field, named Pairwise Emotional Relationship Recognition (PERR). This task aims to recognize the emotional relationship between the two interactive characters in a given video clip. It is different from the traditional emotion and social relation recognition task. Varieties of information, consisting of character appearance, behaviors, facial emotions, dialogues, background music as well as subtitles contribute differently to the final results, which makes the task more challenging but meaningful in developing more advanced multi-modal models. To facilitate the task, we develop a new dataset called Emotional RelAtionship of inTeractiOn (ERATO) based on dramas and movies. ERATO is a large-scale multi-modal dataset for PERR task, which has 31,182 video clips, lasting about 203 video hours. Different from the existing datasets, ERATO contains interaction-centric videos with multi-shots, varied video length, and multiple modalities including visual, audio and text. As a minor contribution, we propose a baseline model composed of Synchronous Modal-Temporal Attention (SMTA) unit to fuse the multi-modal information for the PERR task. In contrast to other prevailing attention mechanisms, our proposed SMTA can steadily improve the performance by about 1%. We expect the ERATO as well as our proposed SMTA to open up a new way for PERR task in video understanding and further improve the research of multi-modal fusion methodology.



There are no comments yet.


page 2

page 5

page 6

page 8


Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context

Predicting the emotional impact of videos using machine learning is a ch...

MovieCuts: A New Dataset and Benchmark for Cut Type Recognition

Understanding movies and their structural patterns is a crucial task to ...

Multi-modal Residual Perceptron Network for Audio-Video Emotion Recognition

Audio-Video Emotion Recognition is now attacked with Deep Neural Network...

Overview of Tencent Multi-modal Ads Video Understanding Challenge

Multi-modal Ads Video Understanding Challenge is the first grand challen...

DialogueTRM: Exploring the Intra- and Inter-Modal Emotional Behaviors in the Conversation

Emotion Recognition in Conversations (ERC) is essential for building emp...

Video Relation Detection with Trajectory-aware Multi-modal Features

Video relation detection problem refers to the detection of the relation...

Multi-Modal Chorus Recognition for Improving Song Search

We discuss a novel task, Chorus Recognition, which could potentially ben...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Sentiment analysis plays important roles in video understanding. The typical sentiment analysis involves the tasks such as Facial Expression Recognition (FER) (Pantic et al., 2005; Mollahosseini et al., 2017; Dhall et al., 2013; Zafeiriou et al., 2016)

, which classifies the facial expression into certain categories; Group Emotion Recognition (GER)

(Dhall et al., 2018, 2015a, 2016, 2012a), which predicts the overall emotional state for a group of people; and even the Audience Affective Responses Recognition, which predicts the audiences’ emotional state with Valence-Arousal measures when watching the videos (Kossaifi et al., 2017; Baveye et al., 2015).

Herein we propose a new task named Pairwise Emotional Relationship Recognition (PERR), which is defined as identifying the category of the emotional relationship between two interactive characters in a given video clip. The emotional relationship can have two scales: one is coarse, including positive, neutral and negative, and the other is fine-grained, including hostile, tense, mild, intimate and neutral. The given clips contain video, audio, as well as subtitles information. Fig.1 gives an illustration of this task. A man and a woman are talking to each other before the bed in a sickroom. Although the woman seems to be sick, the man is comforting her and the overall emotional relationship is intimate by their interactions such as facial expression, behaviors, dialogues and even the background music. Identifying the emotional relationship for those interactions can advance understanding the video content, especially in the evolvement of the character relations in the storyline, how the story is going on, etc.

Figure 1. Intimate or hostile? Obviously, the woman’s facial expression is negative due to the illness. However, their emotional relationship is intimate (positive) because the man comforts the woman by mildly touching and talking. Best viewed in color.

PERR task is substantially different from the existing emotion-related or relation-related recognition tasks, such as FER, GER and Social Relation Detection. FER is generally to predict a single person’s facial expression and GER is aiming to predict the overall emotional state of a group of people. Their labels are the emotion categories or the emotional polarity, not the relationship. Social relationship generally refers to the relation such as colleagues, couples, etc., while PERR aims at predicting the emotional relationship such as intimate, hostile. The emotional relationship is relative straightforward compared with those implicit social relationship(Gallagher and Chen, 2009; Goel et al., 2019; Li et al., 2017; Sun et al., 2017; Zhang et al., 2015). What’s more, PERR is a multi-modal task originating from dramas or movie videos. Because of the multi-camera shooting and professional editing techniques, the video clip generally has multiple shots and overlapping appearance of characters, which make the task more challenging.

To facilitate the PERR task, we develop a new dataset called Emotional RelAtionship of inTeractiOn (ERATO). ERATO is a large-scale high-quality dataset that has emotional relationship annotations. Specifically, it consists of more than 30000 interactive clips extracted from movies or dramas, with annotations of the two people in the interaction, emotional relationship categories as well as subtitles. Consistent with the PERR task, the ERATO dataset is interaction-centric, with multi-shot, multi-modality and high-quality. The data collection and annotation process are under strict quality control. All the above aspects make ERATO a reliable start point and benchmark for PERR task. On merit of the ERATO, we propose a simple but effective Synchronous Modal-Temporal Attention (SMTA) unit for PERR task. SMTA extends LSTM unit using the attention mechanism, including dot-product (Wu et al., 2019) , multi-head attention (Vaswani et al., 2017) or even Transformer encoder(Vaswani et al., 2017) to fuse the multi-modal and temporal context information at the same time.

The contributions of this work are two aspects. We initialize a new task called Pairwise Emotional Relationship Recognition (PERR) as well as a new dataset called ERATO. PERR is the first task that attempts to predict the emotional relationship for pairwise interaction. As a start point, we propose a benchmark method using Synchronous Modal-Temporal Attention (SMTA) unit to fuse the multi-modal information in the task. Compared to the prevailing attention mechanism (Wu et al., 2019; Wang et al., 2018; Vaswani et al., 2017), our SMTA unit can stably improve the performance of PERR on ERATO. Code and dataset are available at

2. Related Work

2.1. Emotion Recognition Datasets

Early datasets for FER such as JAFFE (Lyons et al., 1998), CK (Tian et al., 2001; Lucey et al., 2010), MMI (Pantic et al., 2005), and MultiPie (47) were captured in a lab-controlled environment. Thus, the datasets collected in the wild condition, which contains the people’s naturalistic emotion states (D. Kollias, P. Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou (2018); S. Zafeiriou, A. Papaioannou, I. Kotsia, M. Nicolaou, and G. Zhao (2016); 57; A. Dhall, R. Goecke, S. Lucey, and T. Gedeon (2011a); A. Dhall, R. Goecke, S. Lucey, and T. Gedeon (2011b); A. Mollahosseini, B. Hasani, M. J. Salvador, H. Abdollahi, D. Chan, and M. H. Mahoor (2016); D. Mcduff, R. Kaliouby, T. Senechal, M. Amr, J. Cohn, and R. Picard (2013); C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez (2016); A. Mollahosseini, B. Hasani, and M. H. Mahoor (2017)) have attracted much more attention. Specifically, the AFEW (Dhall et al., 2011a) includes video clips extracted from movies and SFEW (Dhall et al., 2011b) is built using static images from the subset video clips of AFEW. They all use annotations of 7 emotional categories (6 basic emotions plus neutral). The group emotion dataset, i.e. HAPPEI/Group Affect Database (Dhall et al., 2015b, a) consisted of the images from the social-networking websites such as Flickr and Facebook, with the annotation of the affect conveyed by the group of people in the image. Recently, datasets for GER in terms of videos are proposed, including CAER (Lee et al., 2019), VGAF (Sharma et al., 2019) and GroupWalk (Mittal et al., 2020).

However, the human emotional state is complex in the real world. The more descriptive measure: Valence and Arousal (Russell, 1980) are introduced to the emotion datasets. In ACM-faces (Panagakis et al., 2015), AffectNet (Mollahosseini et al., 2017) and AFEW-VA (Kossaifi et al., 2017), the values of Valence-Arousal were provided to depict the facial expression in still images. The datasets (Dhall et al., 2012b; Cowie et al., 2000; Douglas-Cowie et al., 2007; Schröder et al., 2009) were created on laboratory or controlled environment using VA value in the level of video clips. LIRIS-ACCEDE (Baveye et al., 2015) annotated the audience responses for the video clips of movies using Valence and Arousal description.

2.2. Social Relationship Datasets

Early studies on predicting social relationships (Gallagher and Chen, 2009; Goel et al., 2019; Li et al., 2017; Sun et al., 2017; Zhang et al., 2015) are based on images, including PIPA (Zhang et al., 2015), PISC (Li et al., 2017), etc. PIPA contains 16 social relations and PISC has a hierarchy of 3 coarse-level relationships and 6 fine-level relationships. Sun et al. (Sun et al., 2017) extended the PIPA dataset and grouped 16 social relationships into 5 categories. Researchers further constructed video-based dataset for social relation prediction. SRIV (Lv et al., 2018) contains about 3000 video clips collected from 69 movies with 8 subjective relations. ViSR (Liu et al., 2019) defines 8 types of social relations derived from the Burgental’s domain-based theory (Bugental and Blunt, 2000), containing over 8,000 video clips. MovieGraphs (Vicol et al., 2018) consists of 7637 movie clips annotated with persons, objects, visual or inferred properties, and their interaction including social relationships. In dramas, the social relationship might be implicit and it might need too much additional information beyond that clip to infer it. However, the emotional relationship would be more straightforward based on the characters’ expression, behavior, dialogue as well as background music.

2.3. Methods in Video Understanding

Herein, we survey the multi-modal approaches for video understanding. Most of the current multi-modal methods are model-agnostic approaches (D’Mello and Kory, 2015), which can be divided into early, late and hybrid fusion (Atrey et al., 2010). The difference between the early and late fusion depends on whether the fusion is at the feature level or prediction level. For instance, Castellano et al. (Castellano et al., 2008) used early fusion to combine facial expression, body gestures and speech for emotion recognition. In (Kukleva et al., 2020), the visual and textual modalities were aggregated after the prediction of each modal. While those model-agnostic approaches are relatively easy to implement, they are generally not effective in mining the multi-modal data.

Recently, the model-based approaches explicitly designed the architecture based on different properties of the tasks. The Attention mechanism is the main choice to fuse the multi-modal information. The Non-local (Wang et al., 2018) operator can capture the long-range dependencies and has been verified in the video action recognition. Alcázar et al. (Alcázar et al., 2020) applied the non-local operation in the task of active speaker detection. The non-local operation ensembled the audio-visual context while the subsequent LSTM fused the temporal context information. Transformer or Bert (Vaswani et al., 2017; Devlin et al., 2018) also have excellent performance on multi-modal tasks in video understanding (Khan et al., 2021), such as multi-modal representation (Lu et al., 2019; Lee et al., 2020; Sun et al., 2019) and video captioning (Zhou et al., 2018). In VideoBERT (Sun et al., 2019) and ViLBERT (Lu et al., 2019), a modality-specific transformer encoded long-term dependencies for individual modality, then a multi-modal transformer was followed to exchange the information across visual-linguistic modalities.

Since the proposed PERR task is a video-based multi-modal problem, the fusion methods should consider two dimensions, one is for different modalities, and the other is for time dependency. In this aspect, (Alcázar et al., 2020) dealt with modality fusion first and then considered time dependency, while (Sun et al., 2019; Lu et al., 2019) took the reverse order. To provide a baseline, we proposed a method that uses Synchronous Modal-Temporal Attention unit, considering the time dependency when fusing the multi-modal information.

3. PERR and ERATO Dataset

3.1. Problem Description

We expect to recognize the emotional relationship between the two interactive characters in the video of dramas or movies. There is no category definition in the emotional relationship. Motivated by the related work of GER and the existing human emotion researches Emotional Intimacy (Dahms, 1972) and Emotional Climate (De Rivera, 1992), we proposed two emotional relationship category definitions. The first one is neutral, positive and negative, coarsely describing the emotional relationship. The second definition is fine-grained, including mild/intimate/tense/hostile considering that they are commonly listed in above literatures. Mild and intimate, tense and hostile are fined-grained labels for positive and negative in the coarse category definition shown in Table 1.

For untrimmed video like dramas and movies, the audiences can easily identify who are the main interactive characters in a certain period and further recognize whether they are intimate or hostile. However, it is much more challenging for machines since it involves person detection, main character identification, person tracking, interaction detection, multi-modal feature extraction, multi-modal information fusion, etc. To pay more attention on the multi-modal fusion, we define the PERR task as follows: For a given interaction-centric video clip, we have known the main interaction characters and their textual dialogues. What we want to do is to classify the emotional relationship between the specified two main characters in this clip into the defined emotion relationship class.

Category Description
Negative Hostile Antagonistic relationship, usually in fierce quarrel or even physical conflict
Tense Disagree with each other or under argument
Positive Mild Heart-warming, pleasant dialogue with smile
Intimate Close interactions with an affectionate or loving manner, including intimate physical contact
Neutral Neutrality, normal dialogue without emotional disposition
Table 1. The category definition in PERR.
(a) the distribution of clip duration
(b) the distribution of the shot number
(c) the distribution of appearance ratio
(d) the distribution of the subtitle ratio
Figure 2. Statistics of ERATO dataset.

3.2. Dataset Construction

Data Preparation. To extract the pairwise interactive video clips, we select dramas and movies that have different genres to ensure data diversity. In total, we have 487 dramas belonging to 8 genres, with 1161 episodes. We first divide the video into multiple shots using PySceneDetect in (1). Then by combining continuous shots, each episode can be roughly divided into story segments like (Tapaswi et al., 2014). Note that the resulting story segments generally have a long time duration with many pieces of pairwise interactions. We manually locate the pairwise interactions from those story segments such that each video clip corresponds to one pairwise interaction.

After the pairwise interaction retrievement, we apply several basic models to extract the basic elements to improve the final annotation efficiency. Specifically, to help the annotators to identify the two interaction characters, we first detect the faces by MTCNN (Zhang et al., 2016) and track the faces according to the discriminative face features extracted by ArcFace (Deng et al., 2019). Then we drop the clips that include only one or more than 6 characters. The top 3 faces (if it has more than 2 faces) with the largest occurrence number are selected as the candidates for the main interactive characters in this clip. Those 3 faces would be all displayed for the annotators to manually select the main interactive character pair. Second, we use CTPN (Tian et al., 2016) and CRNN (Fu et al., 2017) to detect and recognize the text in the video subtitles. The duplicate subtitles in the adjacent frames are removed based on Levenshtein Distance (Levenshtein, 1966). Those textual dialogue information would also be provided to the annotators to fix the potential errors of the detection and recognition. In this way, we produce more than 40000 video clips, with the candidate main interactive characters and the candidate subtitles for the following annotations.

Annotation. To guarantee the quality of the dataset, we adopt a series of quality control mechanisms during the annotation. The annotation is divided into two periods: the training period and annotating period. Six paid annotators are involved in the process. In the training period, we first provide annotation guidance and demos for them. Then the annotators will annotate the training video clips and cross-check each other’s annotations until reaching a consensus. In annotating period, we adopt a multi-stage strategy, which is relatively time-consuming but can ensure the quality of the dataset. First, annotators need to watch the videos and fine-tune the detected subtitles. This stage, on the one hand, can correct the potential errors by the text detection and recognition models, on the other hand, will give the annotators an overall impression of the content of the video clip. If the pattern of the interaction is complicated (no salient interactive characters, etc.), the annotators are allowed to discard this clip. Second, the annotators need to select the two main interactive characters from the detected top 3 candidate face images by watching the video again. Third, according to the interaction between the paired characters, such as facial expression, pose, dialogue information, background music, etc., annotators are asked to choose one of the detailed 5 tags that best reflects the emotional relationship. During the annotation period, a weekly cross-check would be carried out to ensure the consensus. The percent agreement (Glen, 2016) was 88.3% and 82.6% for the coarse and the fine-grained categories respectively in the early stage, and then gradually increased to 94.8% and 90.1% respectively.

3.3. Dataset Statistics

Duration Statistic. ERATO dataset consists of 31,182 annotated video clips, which lasts 203 hours in total. They are extracted from 487 dramas, involving 1161 episodes. Fig.  2 (a) depicts the histogram of the clip durations. The length of clip varies from 5 to 35 seconds with an average of 23.5 seconds. Most of the video clips last for 27 to 35 seconds, which is believed to have enough information to reflect the emotional relationship.

Category Distribution. Table 2

shows the distribution of the emotional relationship categories in ERATO dataset. Overall, the neutral category contains more than half of the samples, which means there exists unbalancing category phenomenon in ERATO. Furthermore, the positive and negative relationships have close ratios, around 19-20% respectively. The stronger categories such as hostile and intimate are less than the intermediate categories, i.e. mild and tense. Note that the unbalanced category phenomenon mentioned above is unavoidable, which always exists in real-world life. To reflect this, we propose to use Micro-F1 and Macro-F1 as the model evaluation metrics, which will be described in Sec.


Distribution of shot number. Multi-Shot is the unique characteristic of TV shows and movies caused by multi-camera shooting and professional editing techniques. From Fig.  2 (b), it can be observed that the shot number of most video clips ranges from 1 to 9, and the maximum can reach 25. The frequent shot changes in the video clip will bring more challenges for the task since the characters and scene will be changed in time. More advanced methods modeling this interactive phenomenon would be needed.

Appearance ratio of the paired characters. The Appearance ratio is the percentage of the occurrence time (in second) between the paired interactive characters, with the smaller value as the numerator and the other as the denominator. Fig.  2 (c) shows the histogram of the appearance ratios. We can see that in about 10% of the clips, the main interactive characters have equivalent appearance times. For the left, the appearance time between them is unequal, most varied from 0.2 to 0.9. The inequivalence of the two characters might require the model to consider more explicitly the different weights or contributions of the two characters in the emotional relationship.

Category Number Ratio
Negative Tense 6088 19.52%
Hostile 1421 4.56%
Positive Mild 3969 12.73%
Intimate 1211 3.88%
Neutral 18493 59.31%
Table 2. Sample statistics per category.
Figure 3. Examples for different interactive modes in multi-shot video clip. The red and blue line indicate the time span of the two characters, and the sub-caption consists of interaction mode and its percentage in ERATO.

Distribution of Interaction Mode. We also categorize the interaction into different modes by Intersection over Minimum (). calculates the intersection of the two characters’ time span over the minimum span of the two characters. According to the value of , we divide the interaction mode into three categories, =0, 0¡¡1 and =1. As Fig.  3 shows, the sample of =0 accounts for 9.81%, which means that the time span of the pairwise characters does not overlap. The proportion of 0¡¡1 is 37.87%, representing that there is a partial interaction between the two paired characters. The situation with the =1 is that the appearance time between one of the two characters is inclusive with respect to the other. The ratio is about 52.32%.

Distribution of subtitle text. Dialogue is important information in ERATO. Fig.  2 (d) shows the histogram of the ratios of the dialogue duration over the length for each video. This ratio can be calculated by counting the duration of the subtitles based on the observation that the dialogues and subtitles are synchronized. We can see that in ERATO, about 4.7% of the video clips lack the dialogue information, which means the interaction only happens through visual modality: human facial expression and behaviors. Besides that, the most frequent ratio is from 50% to 70%. The imbalanced duration of text and visual modalities in ERATO might encourage the development of multi-modal attention or alignment method.

Figure 4. Overall framework of our proposed method. The structure of STMA is shown in Fig.  5.

Training and test splitting. We randomly sample 20% of each category as test set. In this way the class-wise distribution of the training and testing set is consistent.

3.4. Dataset Properties

Interaction-centric. Interaction is the unique and fundamental difference between PERR and other emotion recognition tasks. Unlike the tasks such as FER and GER, PERR focuses on interacting characters, not the emotional states of the single character or the group of people. As shown in Fig.3, the samples in ERATO have interactive patterns with multiple shots. One character first appears in one shot and the other character will appear in different shot. Sometimes both of them appear at the same time. This requires new methods to deal with these interactive features.

Multi-modality. ERATO is a video-based dataset thus it is the inherently multi-modal dataset. ERATO contains visual, audio as well as textual information with start and end timestamps, which are detected by models and refined by annotators. Most of the existing datasets for emotion recognition do not have textual information. Besides facilitating the prediction of the interactive emotional relationship, ERATO can push forward the research on vision-language multi-modal fusion methods.

High-quality and Diversity. Besides the comprehensive quality control mechanism depicted in Sec.3.2, the video clips have high-resolution (e.g. 720P and 1080P) and range over large genres, including ”crime”, ”war”, ”action”, ”romance”, ”fantasy”, ”life”, ”comedy”, and ”science fiction”.

3.5. Evaluation

To evaluate the performance of PERR, we employ 2 variants of F1-scores, i.e. Macro-F1 and Micro-F1, considering the imbalanced categories. Macro-F1 is the arithmetic mean of the per-class F1-scores, which gives equal weights to each class. Micro-F1 is the harmonic mean of precision and recall calculated over all the samples without considering the class weights. By comparing the two metrics, we would evaluate the model performances on the overall test set as well as on each class, especially the minor class.

4. Approach

In this section, we present a baseline method for recognizing the emotional relationship based on the ERATO. The proposed method utilizes a Synchronous Modal-Temporal Attention unit to carry out multi-modal fusion.

4.1. Overall structures

Following the definition of PERR in Sec.3.1, we utilize visual, audio as well as textual features to predict the emotional relationships. Formally, for each video clip with class , denote as the feature of modal at time with dimension , and , where is the number of modalities. The modalities we use are the character’s facial expression, body gesture, subtitle textual features, audio features of the videos. We extract total of frames for each video clip. The details for extracting multi-modal features will be described in Sec. 5.1. Note that the features are not available for all the time as we analyzed in Sec.3.3

, we use zero-padding for the missing features.

Having the features , we propose a framework that consists of two stages, the first one is multi-modal feature fusion, the other is time-dependency fusion. As shown in Fig.4, the multi-modal feature fusion stage takes as input the raw multi-modal features and outputs the fused features representation at different time steps for each modality. The fused representation considers both the information of the history and different modalities. The second stage takes as input the modal-wise fused representation at different timesteps and outputs the final prediction.

4.2. Synchronous Modal-Temporal Attention

In the multi-modal fusion stage, we propose a Synchronous Modal-Temporal Attention (SMTA) unit that can fuse the raw multi-modal features considering the time dependency. SMTA is a stackable unit extending the LSTM cell. Fig.5 gives the detailed structure of SMTA. In detail, SMTA at timestep takes as input raw features , hidden state and cell state one step before. Besides the modal-fused features , it also outputs the updated hidden state and , which can be the input of SMTA in the next step. SMTA consists of two major components, LSTM cell and modal fusion component. LSTM takes the hidden state , cell state and the sum of raw features of each modality at time , i.e. as the input, outputing the hidden state and cell state in the next step:


The modal fusion component takes as input the raw modal features and the updated hidden state of LSTM . It outputs modal-fused feature representations . We expect the module to consider both the modalities and time-dependency. To achieve this, we first stack all the input together, i.e. . Then the attention matrix is designed to represent the attention weight considering all the modalities as well as the temporal context . can be implemented in many ways, including the Dot-product attention, Multi-head attention as well as the Transformer encoder. The implementation details are in the appendix and the results are in Table 3. Then we have the intermediate fused features . By taking the first vectors of corresponding to and concating with , we have the final fused features, i.e. . Note that direct concatenating with the intermediate fused features takes the idea of residual learning in ResNet (12).

Figure 5. The general structure of Synchronous Modal-Temporal Attention (SMTA) unit.

4.3. Temporal Fusion

Having obtained the final fused features , we apply LSTM to further fuse the temporal information for each fused modality.


where is the hidden state, The final prediction of this clip is


where is the hidden state of modality at final timestep , and indicates fully connected layer. In addition, we also calculate the prediction only using one modality, i.e.


The overall objectives of the method can be written as follows:


where indicates the cross-entropy loss.

5. Experiments

5.1. Implementation Details

Multi-modal Features We extract visual, audio as well as textual features for the PERR task. The visual features include background features, person-wise features. The person-wise features consist of facial expression features, posture features as well as the relative position to the other person. Combining the textual feature and audio feature, there are 5 modal features as the input. The detailed feature extractions are in the appendix.

Method 5 Category 3 Category
Micro-F1 Macro-F1 Micro-F1 Macro-F1
LSTM 63.80 44.43 64.98 56.23
Dot-product (Wu et al., 2019) 64.35 45.80 66.55 58.89
Non-local (Wang et al., 2018) 63.72 47.16 67.49 59.74
Multi-head (Vaswani et al., 2017) 64.11 47.15 67.30 60.20
Transformer (Vaswani et al., 2017) 64.82 48.90 68.59 58.98
SMTA 65.45 47.92 69.18 60.20
SMTA 65.33 48.00 69.66 60.80
SMTA 65.60 47.71 68.95 61.35
SMTA 66.55 49.81 70.05 61.73
Table 3. Performance of different methods on ERATO.


The network is implemented using PyTorch

(Paszke et al., 2019). We use the Adam (Kingma and Ba, 2014) as the optimizer, with the initial learning rate as

and the annealing ratio every 15 epochs as 0.1. The model is trained for 50 epochs and the mini-batch size is 16.

5.2. Results and Comparisons

Methods to compare. We compare two typical approaches with our proposed method. All the methods will use the same multi-modal features presented above to assure fairness. The main difference is in the first multi-modal fusion stage (the second part of Fig.4). 1) Normal LSTM. This method only uses the LSTM layer without attention mechanism as the first multi-modal fusion stage. The LSTM cells take as input the raw summed multi-modal features, without considering the modal-wise attention and the temporal context. 2) Existing attention methods. Those methods use attention modules (e.g., dot-product, non-local block, multi-head attention, transform encoder) to calculate attention weight by themselves in the first fusion stage, without temporal context.

Results. The performances in terms of Micro-F1 and Macro-F1 on ERATO for coarse and fine-grained categories are shown in Table 3. As we could see, our proposed method SMTA with transformer encoder as attention module reaches the best result. Specifically, it achieves Micro-F1 of 66.55%, Macro-F1 of 49.81% for the task of 5 categories, and Micro-F1 of 70.05%, Macro-F1 of 61.73% for the task of 3 categories. Compared with LSTM, the result of SMTA with attention mechanism has a margin of at least 4% on all metrics. The use of temporal context information also increases Macro-F1 by about 1%, which is shown in Table 3.

(a) the F1-score on task of 5 categories
(b) the F1-score on task of 3 categories
Figure 6. Performance comparison between SMTA with transformer encoder and LSTM baseline.
Index Text Audio Visual 5 Category 3 Category
Background Person-wise Micro-F1 Macro-F1 Micro-F1 Macro-F1
1 63.80 44.91 66.55 56.56
2 66.27 48.93 69.48 60.52
3 64.94 47.02 67.96 58.87
4 65.19 46.24 68.24 58.55
5 66.55 49.81 70.05 61.73
Table 4. Evaluation of different modal features.
Figure 7. Visualization of the attention weights for each modal feature, including the person-wise features, background (B), audio (A) and subtitle text (T). We could see that the intense dialogue makes the attention weight biased towards text and audio. For the shot when the human behavior matters, the weights for characters and background features are increasing.
Figure 8. Bad case example: Human can naturally infer the emotional relationship is positive from the dialogue content, but the model is difficult since the visual clue is opposite and the textual information is implicit.

5.3. Analysis

How important are the paired interactive characters? The goal of our work is to recognize the pairwise emotional relationship. Intuitively, the paired interactive characters should have a great impact on the experimental results. Comparing the results of the first and fifth row in Table 4, the PERR performance decreases sharply by 5.17% on Macro-F1 and 3.5% on Micro-F1 for the 3 categories’ subtask. Also, the same drop in metrics is found in the subtask of 5 categories, which indicates that paired interactive characters play an important role in PERR.

Is multi-modality necessary for PERR task? Multi-modal information is helpful for pairwise emotional relationship recognition, such as kissing as a visual cue and quarreling as an audio cue. Table  4 shows that the best results can be achieved using different combinations of multi-modal features, including visual, audio and text. Comparing each row in Table 4 with the final result, we could roughly rank the importance of each modal feature by the performance gap on Macro-F1 in the 3 categories’ subtask: the feature of person-wise (5.17%), the textual feature (3.18%), the audio feature (2.86%) and the background(1.21%), which is intuitively consistent with the way of expressing affection in dramas or movies.

Does the temporal information affect PERR? As shown in Table 3, no matter which attention mechanism is used, the performance of SMTA outperforms a margin of about 1%. Especially in SMTA, combining temporal context and transformer encoder (last row in Table 3), the Macro-F1 on 3 categories’ subtask is 2.75% higher than only transformer encoder. The main reason might be SMTA utilizes temporal context early in the modal-fusion stage, making the multi-modal information can be exchanged in the time dimension.

What is the role of the attention mechanism in PERR? The performance improvement of the attention mechanism for each category is represented in Fig. 6. From Fig. 6(a), the performance of all categories except neutral has been significantly improved compared with the baseline model without attention. For harder categories with fewer samples such as hostile and intimate, the relative increasement for hostile is 29.7% and for intimate is about 73.9%. On the subtask with 3 categories, SMTA can acquire performance gain for each class: 4.1% for neutral, 11.5% for negative and 18.6% for positive. Also, Fig. 7 shows the attention weight for each modal in the SMTA unit. We can see that when there are many dialogues, the attention weights for text and audio become large. When the shot is focusing on the character’s facial expression, the weight for the person-wise feature is larger. This illustrates how attention works in fusing the different modal features. More examples are given in the appendix.

Is SMTA applicable for other tasks? Our proposed SMTA model leverages the Modal-Temporal information simultaneously in multi-modal learning. Considering the PERR is a new task and cannot be fairly compared with existing methods, we report the experimental results of SMTA on the LIRIS-ACCEDE (Baveye et al., 2015) dataset, which is a multi-modality video dataset for predicting the audiences’ emotional states in terms of Valence-Arousal metrics. Our model achieves the best performance: MSE with 0.066, PCC with 0.454 on Valence prediction compared with the state-of-art. Detailed results are in the appendix.

Challenging Sample of PERR. There are many issues unsolved for PERR, such as large variation of character appearance, hidden but key clues investigation, etc. Fig. 8 gives an example that the current model gives the wrong prediction. In this clip, humans can easily infer that their emotional relationship is positive from the dialogue. However, the visual features such as the facial expression, background music are sad or repressive. Only several key words in the dialogue give a strong indication of the real relationship. It’s hard for the current model to identify this case and we hope ERATO could help to develop more sophisticated methods to solve these issues.

6. Conclusion

In this paper, we propose a new task named Pairwise Emotional Relationship Recognition as well as the dataset called ERATO. The task is to predict the emotional relationship between the two main characters in a video clip. Built on top of dramas and movies, ERATO provides interaction-centric video clips with multi-shots, varied video length and multiple annotated modalities including video, audio and text. We also propose a baseline model with Synchronous Modal-Temporal Attention (SMTA) unit that considers the multi-modal fusion and temporal relation at the same time. The comprehensive experiments of our proposed SMTA on ERATO validates the effectiveness of the method. However, there is still a large room for PERR to improve, such as how to model the interactive sequences of the two characters explicitly, is matching the content of dialogue with the human expression helpful, is there a better way to handle the unbalanced modal information, etc. We expect the PERR task can inspire more insights in both the field of affective computing and multi-modal learning.


  • [1] Note: Cited by: §3.2.
  • J. L. Alcázar, F. Caba, L. Mai, F. Perazzi, J. Lee, P. Arbeláez, and B. Ghanem (2020) Active speakers in context. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 12465–12474. Cited by: §2.3, §2.3.
  • P. K. Atrey, M. A. Hossain, A. E. Saddik, and M. S. Kankanhalli (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Systems 16 (6), pp. 345–379. Cited by: §2.3.
  • Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen (2015) LIRIS-accede: a video database for affective content analysis. IEEE Transactions on Affective Computing 6 (1), pp. 43–55. Cited by: §1, §2.1, §5.3.
  • C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez (2016) EmotioNet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • Bugental and D. Blunt (2000) Acquisition of the algorithms of social life: a domain-based approach.. Psychological Bulletin. Cited by: §2.2.
  • G. Castellano, L. Kessous, and G. Caridakis (2008) Emotion recognition through multiple modalities: face, body gesture, speech. Affect and Emotion in Human-Computer Interaction. Cited by: §2.3.
  • R. Cowie, E. Douglas-Cowie, S. Savvidou*, E. McMahon, M. Sawey, and M. Schröder (2000) ’FEELTRACE’: an instrument for recording perceived emotion in real time. In ISCA tutorial and research workshop (ITRW) on speech and emotion, Cited by: §2.1.
  • S. K. D’Mello and J. Kory (2015) A review and meta-analysis of multimodal affect detection systems. Acm Computing Surveys 47 (3), pp. 1–36. Cited by: §2.3.
  • A. M. Dahms (1972) Emotional intimacy: overlooked requirement for survival. 1st edition, Pruett Pub. Co;, . External Links: ISBN 0871081849 Cited by: §3.1.
  • J. De Rivera (1992) Emotional climate: social structure and emotional dynamics.. In A preliminary draft of this chapter was discussed at a workshop on emotional climate sponsored by the Clark European Center in Luxembourg on Jul 12–14, 1991., Cited by: §3.1.
  • [12] (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision & Pattern Recognition, Cited by: §4.2.
  • J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019)

    Arcface: additive angular margin loss for deep face recognition

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.3.
  • A. Dhall, R. Goecke, J. Joshi, J. Hoey, and D. Gedeon (2016) EmotiW 2016: video and group-level emotion recognition challenges. In the 18th ACM International Conference, Cited by: §1.
  • A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon (2013) Emotion recognition in the wild challenge 2013. In Proceedings of the 15th ACM on International conference on multimodal interaction, Cited by: §1.
  • A. Dhall, J. Joshi, I. Radwan, and R. Goecke (2012a)

    Finding happiest moments in a social context

    In Asian Conference on Computer Vision, Cited by: §1.
  • A. Dhall, J. Joshi, K. Sikka, R. Goecke, and N. Sebe (2015a) The more the merrier: analysing the affect of a group of people in images. In IEEE International Conference Workshops on Automatic Face and Gesture Recognition, Cited by: §1, §2.1.
  • A. Dhall, A. Kaur, R. Goecke, and T. Gedeon (2018) EmotiW 2018: audio-video, student engagement and group-level affect prediction. Cited by: §1.
  • A. Dhall, R. Goecke, and T. Gedeon (2015b) Automatic group happiness intensity analysis. IEEE Transactions on Affective Computing 6 (1), pp. 13–26. Cited by: §2.1.
  • A. Dhall, R. Goecke, S. Lucey, and T. Gedeon (2011a) Acted facial expressions in the wild database. Australian National University, Canberra, Australia, Technical Report TR-CS-11 2, pp. 1. Cited by: §2.1.
  • A. Dhall, R. Goecke, S. Lucey, and T. Gedeon (2011b) Static facial expression analysis in tough conditions: data, evaluation protocol and benchmark. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2106–2112. Cited by: §2.1.
  • A. Dhall, R. Goecke, S. Lucey, and T. Gedeon (2012b) Collecting large, richly annotated facial-expression databases from movies. IEEE Annals of the History of Computing 19 (03), pp. 34–41. Cited by: §2.1.
  • E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. McRorie, J. Martin, L. Devillers, S. Abrilian, A. Batliner, et al. (2007) The humaine database: addressing the collection and annotation of naturalistic and induced emotional data. In International conference on affective computing and intelligent interaction, pp. 488–500. Cited by: §2.1.
  • X. Fu, E. Ch’ng, U. Aickelin, and S. See (2017)

    CRNN: a joint neural network for redundancy detection

    In 2017 IEEE international conference on smart computing (SMARTCOMP), pp. 1–8. Cited by: §3.2.
  • A. C. Gallagher and T. Chen (2009) Understanding images of groups of people. IEEE. Cited by: §1, §2.2.
  • S. Glen (2016) Inter-rater reliability irr: definition, calculation. External Links: Link Cited by: §3.2.
  • A. Goel, K. T. Ma, and C. Tan (2019) An end-to-end network for generating social relationship graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11186–11195. Cited by: §1, §2.2.
  • S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah (2021) Transformers in vision: a survey. arXiv preprint arXiv:2101.01169. Cited by: §2.3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • D. Kollias, P. Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou (2018) Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. Cited by: §2.1.
  • J. Kossaifi, G. Tzimiropoulos, S. Todorovic, and M. Pantic (2017)

    AFEW-va database for valence and arousal estimation in-the-wild

    Image and Vision Computing. Cited by: §1, §2.1.
  • A. Kukleva, M. Tapaswi, and I. Laptev (2020) Learning interactions and relationships between movie characters. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
  • J. Lee, S. Kim, S. Kim, J. Park, and K. Sohn (2019) Context-aware emotion recognition networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10143–10152. Cited by: §2.1.
  • S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, and Y. Song (2020) Parameter efficient multimodal transformers for video representation learning. arXiv preprint arXiv:2012.04124. Cited by: §2.3.
  • V. I. Levenshtein (1966) Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10, pp. 707–710. Cited by: §3.2.
  • J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli (2017) Dual-glance model for deciphering social relationships. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2659. Cited by: §1, §2.2.
  • X. Liu, W. Liu, M. Zhang, J. Chen, and T. Mei (2019) Social relation recognition from videos via multi-scale spatial-temporal reasoning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Cited by: §2.2.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265. Cited by: §2.3, §2.3.
  • P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, and I. Matthews (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In Computer Vision Pattern Recognition Workshops, Cited by: §2.1.
  • J. Lv, W. Liu, L. Zhou, B. Wu, and H. Ma (2018) Multi-stream fusion model for social relation recognition from videos. In International Conference on Multimedia Modeling, pp. 355–368. Cited by: §2.2.
  • M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba (1998) Coding facial expressions with gabor wavelets. In Proceedings Third IEEE international conference on automatic face and gesture recognition, pp. 200–205. Cited by: §2.1.
  • D. Mcduff, R. Kaliouby, T. Senechal, M. Amr, J. Cohn, and R. Picard (2013) Affectiva-mit facial expression dataset (am-fed): naturalistic and spontaneous facial expressions collected. IEEE. Cited by: §2.1.
  • T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha (2020) EmotiCon: context-aware multimodal emotion recognition using frege’s principle. Cited by: §2.1.
  • A. Mollahosseini, B. Hasani, and M. H. Mahoor (2017) Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10 (1), pp. 18–31. Cited by: §1, §2.1, §2.1.
  • A. Mollahosseini, B. Hasani, M. J. Salvador, H. Abdollahi, D. Chan, and M. H. Mahoor (2016) Facial expression recognition from world wild web. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 58–65. Cited by: §2.1.
  • [47] (2010) Multi-pie. Image Vis Comput 28 (5), pp. 807–813. Cited by: §2.1.
  • Y. Panagakis, M. A. Nicolaou, S. Zafeiriou, and M. Pantic (2015) Robust correlated and individual component analysis. IEEE transactions on pattern analysis and machine intelligence 38 (8), pp. 1665–1678. Cited by: §2.1.
  • M. Pantic, M. Valstar, R. Rademaker, and L. Maat (2005) Web-based database for facial expression analysis. In Proc IEEE International Conference on Multimedia & Expo, Cited by: §1, §2.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    arXiv preprint arXiv:1912.01703. Cited by: §5.1.
  • J. A. Russell (1980) A circumplex model of affect.. Journal of personality and social psychology 39 (6), pp. 1161. Cited by: §2.1.
  • M. Schröder, E. Bevacqua, F. Eyben, H. Gunes, D. Heylen, M. ter Maat, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, et al. (2009) A demonstration of audiovisual sensitive artificial listeners. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, pp. 1–2. Cited by: §2.1.
  • G. Sharma, S. Ghosh, and A. Dhall (2019) Automatic group level affect and cohesion prediction in videos. In 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 161–167. Cited by: §2.1.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473. Cited by: §2.3, §2.3.
  • Q. Sun, B. Schiele, and M. Fritz (2017) A domain based approach to social relation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3481–3490. Cited by: §1, §2.2.
  • M. Tapaswi, M. Bauml, and R. Stiefelhagen (2014) Storygraphs: visualizing character interactions as a timeline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 827–834. Cited by: §3.2.
  • [57] (2016) Tega: a social robot. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Cited by: §2.1.
  • Y. Tian, T. Kanade, and J. F. Cohn (2001) Recognizing action units for facial expression analysis. IEEE Transactions on pattern analysis and machine intelligence 23 (2), pp. 97–115. Cited by: §2.1.
  • Z. Tian, W. Huang, T. He, P. He, and Y. Qiao (2016) Detecting text in natural image with connectionist text proposal network. In European conference on computer vision, pp. 56–72. Cited by: §3.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §1, §2.3, Table 3.
  • P. Vicol, M. Tapaswi, L. Castrejon, and S. Fidler (2018) Moviegraphs: towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8581–8590. Cited by: §2.2.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: §1, §2.3, Table 3.
  • J. Wu, L. Wang, L. Wang, J. Guo, and G. Wu (2019) Learning actor relation graphs for group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9964–9974. Cited by: §1, §1, Table 3.
  • S. Zafeiriou, A. Papaioannou, I. Kotsia, M. Nicolaou, and G. Zhao (2016) Facial affect ”in-the-wild”: a survey and a new database. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: §1, §2.1.
  • K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §3.2.
  • N. Zhang, M. Paluri, Y. Taigman, R. Fergus, and L. Bourdev (2015) Beyond frontal faces: improving person recognition using multiple cues. IEEE Computer Society, pp. 4804–4813. Cited by: §1, §2.2.
  • L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong (2018)

    End-to-end dense video captioning with masked transformer

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8739–8748. Cited by: §2.3.