Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey

The wide popularity of digital photography and social networks has generated a rapidly growing volume of multimedia data (i.e., image, music, and video), resulting in a great demand for managing, retrieving, and understanding these data. Affective computing (AC) of these data can help to understand human behaviors and enable wide applications. In this article, we survey the state-of-the-art AC technologies comprehensively for large-scale heterogeneous multimedia data. We begin this survey by introducing the typical emotion representation models from psychology that are widely employed in AC. We briefly describe the available datasets for evaluating AC algorithms. We then summarize and compare the representative methods on AC of different multimedia types, i.e., images, music, videos, and multimodal data, with the focus on both handcrafted features-based methods and deep learning methods. Finally, we discuss some challenges and future directions for multimedia affective computing.



There are no comments yet.



A Survey of Multimedia Technologies and Robust Algorithms

Multimedia technologies are now more practical and deployable in real li...

Multimedia Technology Applications and Algorithms: A Survey

Multimedia related research and development has evolved rapidly in the l...

Efficient Multimedia Similarity Measurement Using Similar Elements

Online social networking techniques and large-scale multimedia systems a...

Novel Metaknowledge-based Processing Technique for Multimedia Big Data clustering challenges

Past research has challenged us with the task of showing relational patt...

Nemoz - A Distributed Framework for Collaborative Media Organization

Multimedia applications have received quite some interest. Embedding the...

Show me the material evidence: Initial experiments on evaluating hypotheses from user-generated multimedia data

Subjective questions such as `does neymar dive', or `is clinton lying', ...

Learning in High-Dimensional Multimedia Data: The State of the Art

During the last decade, the deluge of multimedia data has impacted a wid...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Users are increasingly recording their daily activities, sharing interesting experiences, and expressing personal viewpoints using mobile devices on social networks, such as Twitter111, Facebook222, and Weibo333, etc. As a result, a rapidly growing volume of multimedia data (i.e., image, music, and video) has been generated, as shown in Figure 1, which results in a great demand for the management, retrieval, and understanding of these data. Most existing work on multimedia analysis focus on the cognitive aspects, i.e., understanding the objective content, such as object detection in images (han2018advanced), speaker recognition in speech (hansen2015speaker), and action recognition in videos (herath2017going). Since what people feel have a direct influence on their decision making, affective computing (AC) of these multimedia data is of significant importance and has attracted increasing attention (chen2014object; zhang2016exploring; yao2019attention; zhan2019zero; zhao2019pdanet). For example, companies would like to know how customers evaluate their products and can thus improve their services (jansen2009twitter); depression and anxiety detection from social media can help understand psychological distress and thus potentially prevent suicidal actions (shen2017depression).

While the sentiment analysis in text (pang2008opinion) has long been a standard task, AC from other modalities, such as image and video, has just begun to be considered recently. In this article, we aim to review the existing AC technologies comprehensively for large-scale heterogeneous multimedia data, including image, music, video, and multimodal data.

Affective computing of multimedia (ACM) aims to recognize the emotions that are expected to be evoked in viewers by a given stimuli. Similar to other supervised learning tasks, ACM is typically composed of three steps: data collection and annotation, feature extraction, and mapping learning between features and emotions 

(zhao2017approximating). One main challenge for ACM is the affective gap, i.e., “the lack of coincidence between the features and the expected affective state in which the user is brought by perceiving the signal” (hanjalic2006extracting)

. In the early stage, various hand-crafted features were designed to bridge this gap with traditional machine learning algorithms, while more recently researchers have focused on end-to-end deep learning from raw multimedia data to recognize emotions. Existing ACM methods mainly assign the dominant (average) emotion category (DEC) to an input stimuli, based on the assumption that different viewers have similar reactions to the same stimuli. We can usually formulate this task as a single-label learning problem.

However, emotions are influenced by subjective and contextual factors, such as the educational background, cultural diversity, and social interaction (peng2015mixed; zhao2016predicting; yang2017learning). As a result, different viewers may react differently to the same stimuli, which creates the subjective perception challenge. Therefore, the perception inconsistency makes it insufficient to simply predict the DEC for the highly subjective variable. As stated in (zhao2016predicting), we can perform two kinds of ACM tasks to deal with the subjectivity challenge: predicting personalized emotion perception for each viewer and assigning multiple emotion labels for each stimuli. For the latter one, we can either assign multiple labels to each stimuli with equal importance using multi-label learning methods, or predict the emotion distributions which tries to learn the degrees of each emotion (yang2017learning).

In this article, we concentrate on surveying the existing methods on ACM and analyzing potential research trends. Section 2 introduces the widely-used emotion representation models from psychology. Section 3 summarizes the existing available datasets for evaluating ACM tasks. Section 4, Section 5, Section 6, and Section 7 survey the representative methods on AC of images, music, videos, and multimodal data, respectively, including both handcrafted features-based methods and deep learning methods. Section 8 provides some suggestions for future research, followed by conclusion in Section 9.

To the best of our knowledge, this article is among the first that provide a comprehensive survey of affective computing of multimedia data from different modalities. Previous surveys mainly focus on a single modality, such as images (zhao2018affective; joshi:emotion-survey), speech (el2011survey), music (Kim2010; yang2012machine), video (wang2015video; Baveye2018), and multimodal data (soleymani2017survey). From this survey, readers can more easily compare the correlations and differences among different AC settings. We believe that this will be instrumental in generating novel research ideas.

2. Emotion Models from Psychology

There are two dominant emotion representation models deployed by psychologists: categorical emotions (CE), and dimensional emotion space (DES). CE models classify emotions into a few basic categories, such as

happiness and anger, etc. Some commonly used models include Ekman’s six basic emotions (ekman1992argument) and Mikels’s eight emotions (mikels2005emotional). When classifying emotions into positive and negative (polarity) (zhao2018personality; zhao2019personalized), sometimes including neutral, “emotion” is called “sentiment”. However, sentiment is usually defined as an atitude held toward an object (soleymani2017survey). Emotions are usually represented by DES models as continuous coordinate points in a 3D or 2D Cartesian space, such as valence-arousal-dominance (VAD) (schlosberg1954three) and activity-temperature-weight (lee2011fuzzy). VAD is the most widely used DES model, where valence represents the pleasantness ranging from positive to negative, arousal represents the intensity of emotion ranging from excited to calm, and dominance represents the degree of control ranging from controlled to in control. Dominance is difficult to measure and is often omitted, leading to the commonly used two dimensional VA space (hanjalic2006extracting).

The relationship between CE and DES and the transformation from one to the other are studied in (sun2009improved). For example, positive valence relates to a happy state, while negative valence relates to a sad or angry state. CE models are easier for users to understand and label, but the limited set of categories may not well reflect the subtlety and complexity of emotions. DES can better describe detailed emotions with subtle differences flexibly, but it is difficult for uses to distinguish the absolute continuous values, which may also be problematic. CE and DES are mainly employed in classification and regression tasks, respectively, with discrete and continuous emotion labels. If we discretize DES into several constant values, we can also use it for classification (lee2011fuzzy). Ranking based labeling can be applied to ease DEC comprehension difficulties in raters.

Although less explored in this context, one of the most well-known theories that explains the development of emotional experience is appraisal theory. According to this theory, cognitive evaluation or appraisal of a situation or content in case of multimedia results in emergence of emotions (ortony88emotion; citeulike:3014462). According to Ortony, Clore and Collins (OCC) (ortony88emotion), emotions are experienced following a scenario comprising a series of phases. First, there is a perception of an event, object or an action. Then, there is an evaluation of events, objects or action according to personal wishes or norms. Finally, perception and evaluation result in a specific emotion or emotions arising. Certain appraisal dimensions such as novelty and complexity can be labeled and detected from content. For example, Soleymani2015 automatically recognized image novelty and complexity that are related to interestingness. There are also domain specific emotion taxonomy and scales. Geneva Emotional Music Scale (Zentner2008) is a music specific emotion model for describing emotions induced by music. It consists of a hierarchical structure with 45 emotions, nine emotional categories and three superfactors that can describe emotion in music.

Model Ref Type Emotion states/dimensions
Ekman (ekman1992argument) CE happiness, sadness, anger, disgust, fear, surprise
Mikels (mikels2005emotional) CE amusement, anger, awe, contentment, disgust, excitement, fear, sadness
Plutchik (plutchik1980emotion) CE ( 3 scales) anger, anticipation, disgust, joy, sadness, surprise, fear, trust
Clusters (hu2007exploring) CE 29 discrete labels are consistently grouped into 5 clusters at a similar distance level
Sentiment CE positive, negative, (or neutral)
VA(D) (schlosberg1954three) DES valence-arousal(-dominance)
ATW (lee2011fuzzy) DES activity-temperature-weight
Table 1. Representative emotion models employed in ACM.

Another relevant concept worth mentioning is that emotion in response to multimedia can be expected, induced or perceived emotion. Expected emotion is the emotion that the multimedia creator intends to make people feel, perceived emotion is what people perceive as being expressed, while induced/felt emotion is the actual emotion that is felt by a viewer. Discussing the difference or correlation of various emotion models is out of the scope of this article. The typical emotion models that have been widely used in ACM are listed in in Table 1.

3. Datasets

3.1. Datasets for AC of Images

The early datasets for AC of images mainly come from the psychology community with small-scale images. The International Affective Picture System (IAPS) is an image set that is widely used in psychology to evoke emotions (lang1997international)

. Each image that depicts complex scenes is associated with the mean and standard deviation (STD) of VAD ratings in a 9-point scale by about 100 college students. The

IAPSa dataset is selected from IAPS with 246 images (mikels2005emotional), which are labeled by 20 undergraduate students. The Abstract dataset consists of 279 abstract paintings without contextual content. Approximately 230 people peer rated these paintings. The Artistic dataset (ArtPhoto) includes 806 artistic photographs from a photo sharing site (machajdik2010affective) with emotions determined by the artist uploading the photos. The Geneva affective picture database (GAPED) is composed of 520 negative, 121 positive, and 89 neutral images (dan2011geneva). Besides, these images are also rated with valence and arousal values, ranging from 0 to 100 points. There are 500 abstract paintings in both MART and devArt datasets, which are collected from the Museum of Modern and Contemporary Art of Trento and Rovereto (alameda2016recognizing), and the “DeviantArt” online social network (alameda2016recognizing), respectively.

Dataset Ref # Images Type # Annotators Emotion model Labeling Labels
IAPS (lang1997international) 1,182 natural 100 (half f) VAD annotation average
IAPSa (mikels2005emotional) 246 natural 20 (10f,10m) Mikels annotation dominant
Abstract (machajdik2010affective) 279 abstract 230 Mikels annotation dominant
ArtPhoto (machajdik2010affective) 806 artistic Mikels keyword dominant
GAPED (dan2011geneva) 730 natural 60 Sentiment, VA annotation dominant, average
MART (alameda2016recognizing) 500 abstract 25 (11f,14m) Sentiment annotation dominant
devArt (alameda2016recognizing) 500 abstract 60 (27f,33m) Sentiment annotation dominant
Tweet (borth2013large) 603 social 9 Sentiment annotation dominant
FlickrCC (borth2013large) 500,000 social Plutchik keyword dominant
Flickr (yang2014your) 301,903 social 6,735 Ekman keyword dominant
Emotion6 (peng2015mixed) 1,980 social 432 Ekman+neutral annotation distribution
FI (you2016building) 23,308 social 225 Mikels annotation dominant
IESN (zhao2016predicting) 1,012,901 social 118,035 Mikels, VAD keyword personalized
FlickrLDL (yang2017learning) 10,700 social 11 Mikels annotation distribution
TwitterLDL (yang2017learning) 10,045 social 8 Mikels annotation distribution
Table 2. Released and freely available datasets for AC of images, where ‘Ref’ is short for Reference, ‘# Images’ and ‘# Annotators’ respectively represent the total number of images and annotators (f: female, m: male), ‘Labeling’ represents the method to obtain labels, such as human annotation (annotation) and keyword searching (keyword), and ‘Labels’ means the detailed labels in the dataset, such as dominant emotion category (dominant), average dimension values (average), personalized emotion (personalized), and emotion distribution (distribution).

Recent datasets, especially the large-scale ones, are constructed using images from social networks. The Tweet dataset (Tweet) consists of 470 and 113 tweets for positive and negative sentiments, respectively (borth2013large). The FlickrCC dataset includes about 500k Flickr creative common (CC) images which are generated based on 1,553 adjective noun pairs (ANPs) (borth2013large). The images are mapped to the Plutchnik’s Wheel of Emotions with 8 basic emotions, each with 3 scales. The Flickr dataset contains about 300k images (yang2014your) with the emotion category defined by the synonym word list which has the most same words as the adjective words of an image’s tags and comments. The FI dataset consists of 23,308 images which are collected from Flicker and Instagram by searching the emotion keywords (you2016building) and labeled by 225 Amazon Mechanical Turk (MTurk) workers. The number of images in each Mikels emotion category is larger than 1,000. The Emotion6 dataset (peng2015mixed) consists of 1,980 images collected from Flickr with 330 images for each Ekman’s emotion category. Each image was scored by 15 MTurk workers to obtain the discrete emotion distribution information. The IESN dataset that is constructed for personalized emotion prediction (zhao2016predicting)

contains about 1M images from Flickr. Lexicon-based methods and VAD averaging 

(warriner2013norms) are used to segment the text of metadata from uploaders for expected emotions and comments from viewers for personalized emotions. There are 7,723 active users with more than 50 involved images. We can also easily obtain the DEC and emotion distribution for each image. FlickrLDL and TwitterLDL datasets (yang2017learning) are constructed for discrete emotion distribution learning. The former one is a subset of FlickrCC, which are labeled by 11 viewers. The latter one consists of 10,045 images which are collected by searching various sentiment key words from Twitter and labeled by 8 viewers. These datasets are summarized in Table 2.

3.2. Datasets for AC of Music

A notable benchmark for music recognition is music mood classification (AMC) task, organized by annual Music Information Retrieval Evaluation eXchange444 (MIREX) (mirex07)

. In MIREX mood classification task, initially 600 songs were shared with the participants. Starting from 2013, 1,438 30 seconds excerpts from Korean pop songs have been added to MIREX. MIREX benchmark aims to automatically classify songs into five emotion clusters derived from cluster analysis of online tags. MIREX mood challenge emotional representation has been debated in the literature due to its data-driven origin rather than psychology of emotion. For example, in 

(laurier07mirex), semantic and acoustic overlaps have been found between clusters. MIREX mood challenge considers only one label for the whole song and disregards the dynamic time evolving nature of music.

Computer Audition Lab 500 (CAL500) is a dataset of 500 popular songs which is labeled by multiple tags including emotions (turnbull2007towards). The dataset is labeled in the lab by 66 labelers. Soundtracks datasets (Eerola2011) for music and emotion is developed by Eerola2011 and contains instrumental music from soundtrack of 60 movies. The expert annotators selected songs based on five basic discrete categories (anger, fear, sadness, happiness, and tenderness) and dimensional VA representation of emotions. Although not developed with music content analysis in mind, the Database for Emotion Analysis using Physiological Signals or DEAP dataset (koelstra2012tac) also includes valence, arousal and dominance ratings for 120 one-minute music video clips of western pop music. Each video clip is annotated by by 14–16 participants who were asked to report their felt valence, arousal, and dominance on a 9-point scale. AMG1608 (chen15icassp) is another music dataset that contains arousal and valence ratings for 1,608 Western songs in different genres and is annotated through MTurk.

Music datasets with emotion labels usually consider one emotion label per song (static). MoodSwings dataset (speck11ismir) was the first to annotate music dynamically over time. MoodSwings was developed by Schmidt et al. and includes 240 15s excerpts of western pop songs with per-second valence and arousal labels, collected on MTurk. The MediaEval “Emotion in Music” challenge was organized in years 2013–2015 in MediaEval Multimedia Evaluation initiative555 MediaEval is a community-driven benchmarking campaign dedicated to evaluating algorithms for social and human-centered multimedia access and retrieval (mediaeval). Unlike MIREX, “Emotion in Music” task focused on dynamic emotion recognition in music tracking arousal and valence over time (Soleymani1000songs; deam). The data from MediaEval tasks were compiled in MediaEval Database for Emotional Analysis in Music (DEAM) which is the largest available dataset with dynamic annotations, at 2Hz, with valence and arousal annotations for 1,802 songs and song excerpts licensed under Creative Commons license. PMEmo is a dataset of 794 songs with dynamic and static arousal and valence annotations in addition to electrodermal responses from ten participants (Zhang2018PMEmo).

These datasets are summarized in Table 3. For a more detailed review of available music datasets with emotional labels, we refer the readers to (panda_thesis).

Dataset Ref # Songs Type # Annotators Emotion model Labeling Labels
MIREX mood (mirex07) 2,038 western and kpop 2–3 Clusters annotation dominant, distribution
CAL500 (turnbull2007towards) 500 western ¿2 annotation dominant
Soundtracks (Eerola2011) 110 instrumental 110 self-defined, VA annotation distribution
MoodSwings (speck11ismir) 240 western Unknown VA annotation distribution
DEAP (koelstra2012tac) 120 western 14–16 VAD annotation average
AMG1608 (chen15icassp) 1,608 western 15 VA annotation distribution
DEAM (deam) 1,802 diverse 5–10 VA annotation average, distribution
PMEmo (Zhang2018PMEmo) 794 western 10 VA annotation distribution
Table 3. Released and freely available datasets for music emotion recognition, where ‘# Songs’ and ‘# Annotators’ respectively represent the total number of songs and annotators per song, ‘Labeling’ represents the method to obtain labels, such as human annotation (annotation), and ‘Labels’ means the detailed labels in the dataset, such as dominant emotion category (dominant), average dimension values (average), personalized emotion (personalized), and emotion distribution (distribution).

3.3. Datasets for AC of Videos

The target of video affective content computing is to recognize the emotions evoked by videos. In this field, it is necessary to construct a large benchmark dataset with precise emotional tags. However, the majority of existing research evaluate their proposed methods on their own collected datasets. The scarce video resources in those self-collected datasets, combined with the copyright restrictions result in limited accessibility for other researchers to reproduce existing work. Therefore, it is beneficial to summarize some publicly available datasets in this field. In general, publicly available datasets can be classified into two types: datasets consisting of videos only, such as movie clips or user generated videos, and datasets including both videos and audience’s information.

3.3.1. Datasets consisting of videos only

The LIRIS-ACCEDE dataset (dellandrea2019datasets) is one of the largest datasets in this area. Because it is collected under Creative Commons licenses, there are no copyright issues. The LIRIS-ACCEDE dataset is a living database in development. In order to fulfill requirements of different tasks, new data, features and tags are included. The LIRIS-ACCEDE dataset includes the Discrete LIRIS-ACCEDE collection and the Continuous LIRIS-ACCEDE collection in 2015 and was used for the MediaEval Emotional Impact of Movies tasks from 2015 to 2018.

The Discrete LIRIS-ACCEDE collection (baveye2015liris) includes 9,800 clips, which is derived from 40 feature films and 120 short films. Specifically, the majority of the 160 films are collected from the video platform VODO. The duration of all 160 films is about 73.6 hours in total. All of the 9,800 video clips last about 27 hour in total and the duration of each clip is between 8 to 12 seconds, which is long enough for viewers to feel emotions. In this collection, all the 9,800 video clips are labeled by values of valence and arousal.

The Continuous LIRIS-ACCEDE collection (baveye2015deep) differs from the Discrete LIRIS-ACCEDE collection in annotation type. Roughly speaking, the annotations for movie clips in the Discrete LIRIS-ACCEDE collection are global. It means that a whole 8 to 12 second video clip is represented by a single value of valence and arousal. This annotation type limits the possiblity for tracking emotions. To address this issue, 30 longer films are selected from the 160 films mentioned above. The total duration of all the selected films is about 7.4 hours. There are emotional annotations according to valence and arousal of each second of the films in the collection.

The MediaEval Affective Impact of Movies collections between 2015 and 2018 are used for the MediaEval affective Impact of Movies tasks in each corresponding year. Specifically, the MediaEval 2015 Affective Impact of Movies (sjoberg2015mediaeval) includes two sub-tasks: affect detection and violence detection. The Discrete LIRIS-ACCEDE collection was used as the development set. And 1,100 additional video clips were extracted from 39 new movies and included. Indeed, all the new collected data were shared under Creative Commons licenses. In addition, three values were used to label the 10,900 video clips: a binary signal representing the presence of violence, a class tag of the excerpt for felt arousal and an annotation for felt valence.

The MediaEval 2016 Affective Impact of Movies Task (dellandrea2016mediaeval) also includes two sub-tasks: Global emotion prediction and Continuous emotion prediction. The Discrete LIRIS-ACCEDE collection and the Continuous LIRS-ACCEDE collection were used as the development sets for the first and second sub-tasks, respectively. In addition, 49 new movies were chosen as the test sets. 1,200 short video clips from the new movies were extracted for the first task, and 10 long movies were selected for the second task. For the first sub-task, the tags include scores of valence and arousal for each whole movie clip. And for the second sub-task, scores of valence and arousal for each second of the movies are evaluated.

The MediaEval 2017 Affective Impact of Movies Task (DBLP:conf/mediaeval/DellandreaH0BS17) is focused on long movies for two sub-tasks: valence/arousal prediction and Fear prediction. The Continuous LIRIS-ACCEDE collection was selected as the development set, and an additional 14 new movies were collected as the test set. The annotations contain a valence value and an arousal value. In addition, there are a binary value to represent whether the segment is supposed to induce fear or not for each 10-second segment.

The MediaEval 2018 Affective Impact of Movies task (DBLP:conf/mediaeval/DellandreaH0BXS18) is also dedicated to valence/arousal prediction and fear prediction. The Continuous LIRIS-ACCEDE collection and the test set of the MediaEval 2017 Emotional Impact of Movies task were used as the development set. In addition, 12 other movies selected from the set of the 160 movies mentioned in the Discrete LIRIS-ACCEDE part were used as test set. Specifically, for the first sub-task, there are annotations containing valence and arousal values for each second of the movies. And the beginning and ending times of each sequence in movies that induce fear are recorded for the second sub-task.

The VideoEmotion dataset (jiang2014predicting) is a well-designed user-generated video collection. It contains 1,101 videos downloaded from web platforms, such as YouTube and Flickr. The annotations of the videos in this dataset are based on Plutchik’s wheel of emotions (plutchik1980emotion).

Both the YF-E6 Dataset and the VideoStory-P14 Dataset are introduced in (xu2016heterogeneous). In order to collect the YF-E6 emotion dataset, six basic emotion types are used as keywords to search videos on YouTube and Flickr. There are 3,000 videos collected in the YF-E6 dataset totally. Then there were 10 annotators performing the labeling tasks. Only when all tags for a video clip were more than 50 percent consistent, the video clip was added to the dataset. Finally, the dataset includes 1,637 videos labeled with six basic emotion types. The VideoStory-P14 Dataset is based on the VideoStory dataset. Similar to the VideoEmotion Dataset, the keywords in Plutchik’s Wheel of Emotions were used for the search process of the construction of the VideoStory dataset. Finally, there are 626 videos in the videoStory-P14 dataset with each having a unique emotion tag.

3.3.2. Datasets including both videos and audience’s reactions

The DEAP dataset (koelstra2012tac) includes the EEG and peripheral physiological signals that are collected from 32 participants during watching 40 one-minute long excerpts of music videos. In addition, frontal face videos collected from 22 of the 32 participants are gathered. Annotators labeled each video according to the level of like/dislike, familiarity, arousal, valence, and dominance. Though the DEAP dataset is publicly available, it should be noted that it does not include the actual videos because of the licensing issues, but the links of videos are provided.

The MAHNOB-HCI (soleymani2012multimodal) is a multimodal dataset including multi-class information recorded in response to video affective stimuli. Particularly, speeches, face videos, and eye gazes are recorded. In addition, two experiments were conducted to record both peripheral and central nervous system physiological signals from 27 subjects. In the first experiment, subjects were assigned to report their emotional responses to 20 affective induced videos, including the level of arousal, valence and dominance, and predictability as well as emotion categories. In the second experiment, the participants evaluated whether they agreed with the displayed labels after watching short videos and images. The dataset is available for academic use through a web-interface.

Dataset Ref #Clips Hours Type # Annotators Emotion model Labeling Labels
Discrete LIRIS-ACCEDE (baveye2015liris) 9,800 26.9 film - VA annotation dominant
Continuous LIRIS-ACCEDE (baveye2015deep) 30 7.4 film 10 (7f,3m) VA annotation average
MediaEval 2015 (sjoberg2015mediaeval) 1,100 - film - 3 discrete VA values annotation dominant
MediaEval 2016 (dellandrea2016mediaeval) 1,210 - film - VA annotation distribution, average
MediaEval 2017 (DBLP:conf/mediaeval/DellandreaH0BS17) 14 8 film - VA, fear annotation average
MediaEval 2018 (DBLP:conf/mediaeval/DellandreaH0BXS18) 12 9 film - VA, fear annotation average
VideoEmotion (jiang2014predicting) 1,101 32.7 user-generated 10 (5f,5m) Plutchik annotation dominant
YF-E6 (xu2016heterogeneous) 1,637 50.9 user-generated 10(5f,5m) Emkan annotation dominant
VideoStory-P14 (xu2016heterogeneous) 626 - user-generated - Plutchik keyword dominant
DEAP (koelstra2012tac) 120 2 music video - VAD annotation personalized
MAHNOB-HCI (soleymani2012multimodal) - - multiple types - VAD, Ekman+neutral annotation personalized
DECAF (abadi2015decaf) 76 - music video/movies - VAD annotation personalized
AMIGOS (DBLP:journals/corr/CorreaASP17) 20 - movies collection - VAD, Ekman annotation personalized
ASCERTAIN (subramanian2018ascertain) 36 - movies collection 58 (21f,37m) VA annotation personalized
Table 4. Released and freely available datasets for video emotion recognition, where ‘#Clips’ and ‘Hours’ respectively represent the total number and hours of video clips, ‘Type’ means the genre of the videos in the dataset, ‘Emotion model’ represents the labeling type, ‘Labeling’ represents the method to obtain labels, such as human annotation (annotation) and keyword searching (keyword), and ‘Labels’ means the detailed labels in the dataset, such as dominant emotion category (dominant), average dimension values (average), personalized emotion (personalized), and emotion distribution (distribution).

The DECAF dataset (abadi2015decaf) consists of Infra-red facial video signals, Electrocardiogram (ECG), Magnetoencephalogram (MEG), horizontal Electrooculogram (hEOG) and Trapezius Electromyogram (tEMG), recorded from 30 participants watching 36 movie clips and 40 one-minute music videos, which are derived from the DEAP dataset (koelstra2012tac). The subjective feedback is based on valence, arousal, and dominance space. In addition, time-continuous emotion annotations for movie clips are also included in the dataset.

The AMIGOS dataset (DBLP:journals/corr/CorreaASP17) includes multi-class affective data, individual and groups of viewers’ responses to both short and long videos. The EEG, ECG, GSR, frontal, and full body video were recorded in two experimental settings, i.e., 40 participants watching 16 short emotional clips and 4 long clips. The duration of each selected short videos is between 51 and 150 seconds, and the duration of each long excerpt is about 20 minutes. Finally, participants annotated the affective level of valence, arousal, control, familiarity, liking, and basic emotions.

Big-five personality scales and affective self-ratings of 58 users together with their EEG, ECG, GSR, and facial activity data were included in the ASCERTAIN dataset (subramanian2018ascertain) . The number of videos used as the stimulus is 36 and the length of each video clip is between 51 and 128 seconds. It is the first physiological dataset that is useful for both affective and personality recognition.

The publicly available datasets for video affective content analysis are summarized in Table 4.

3.4. Datasets for AC of Multimodal Data

In addition to audiovisual content and viewers’ reactions, other modalities, such as language, also contain significant information for affective understanding of multimedia content.

Visual sentiment is the sentiment associated with the concepts depicted in images. Two datasets were developed through mining images associated with the adjective-noun pair (ANP) representations that have affective significance (borth2013sentibank). ANPs in (borth2013sentibank) were generated by first using seed terms from Plutchik’s Wheel of Emotion (plutchik1980emotion) to query Flickr666 and YouTube777 After mining the tags associated with visual content on YouTube and Flickr, adjective and noun candidates were identified through part-of-speech tagging. Then adjective and nouns were paired to create ANP candidates which were filtered by sentiment strength, named entities, and popularity. The Visual Sentiment Ontology (VSO), (borth2013sentibank)888, is the results of this process. Sentibank resulted in the creation of a set of photo-tweet sentiment dataset, with both visual and textual data with polarity labels, collected on Amazon Mechanical Turk999 This work was later extended to form a multilingual ANP set and its dataset, in (Jou:MVSO; dalmia2016columbia)101010, containing 15,630 ANPs from 12 major languages and 7.37M images (senticart:icmr16). My Reaction When (MRW) dataset contains 50,107 video-sentence pairs crawled from social media, depicting physical or emotional reactions to the situations described in sentences (Song_2019_CVPR). The GIFs are sourced from Giphy111111 Even though there is no emotional labels, the language and visual associations are mainly based on sentiment which makes this dataset an interesting resource for affective content analysis.

CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) is a collection of multiple datasets for multimodal sentiment analysis and emotion recognition. This collection includes more than 23,500 sentence utterance videos from more than 1,000 people from YouTube (zadeh2018multi)121212 All the videos are transcribed and aligned with audiovisual modalities. A multimodal multi-party dataset for emotion recognition in conversation (MELD) was primarily developed for emotion recognition in multiparty interaction purposes (poria-etal-2019-meld). MELD contains visual, audio, and textual modalities and includes 13,000 utterances from 1,433 dialogues from the TV-series Friends, with each utterance labeled with emotion and sentiment.

Dataset Ref #Samples Modalities Type Emotion model Labeling Labels
SentiBank tweet (borth2013sentibank) 603 images, text images Sentiment annotation dominant
MVSO (Jou:MVSO) 7.36M image, metadata photos Sentiment automatic average
CMU-MOSEI (zadeh2018multi) 23,500 video, audio, text YouTube videos Sentiment annotation average
MELD (poria-etal-2019-meld) 13,000 video, audio, text TV series Sentiment, Disc. annotation dominant
COGNIMUSE (Zlatintsi2017) 3.5h video, audio, text movies VA annotation, self-report dominant
VR (Li_VR) 73 video, audio VR videos VA self-report average
Table 5. Released and freely available datasets for multimodal multimodal emotion recognition. Disc. for MELD corresponds to six Ekman emotions in addition to neutral. ‘Labeling’ represents the method to obtain labels, such as human annotation (annotation), self-reported felt emotion and keyword searching (keyword), ‘Labels’ means the detailed labels in the dataset, such as dominant emotion category (dominant), average dimension values (average), personalized emotion (personalized), and emotion distribution (distribution).

COGNINMUSE is a collection of videos annotated with sensory and semantic saliency, events, cross-media semantics, and emotions (Zlatintsi2017). A subset of 3.5h extracted from movies, including textual modality, are annotated on arousal and valence. Li_VR collected a dataset of 360 degrees virtual reality videos that can elicit different emotions (Li_VR). Even though the dataset consists of 73 short videos, on average 183s long, it is one of the first datasets of its kind whose content understanding stays limited. These multimodal datasets are summarized in Table 5.

4. Affective Computing of Images

In the early stages, AC researchers mainly worked on designing handcrafted features to bridge the affective gap. Recently, with the advent of deep learning especially convolutional neural networks (CNNs), current methods have shifted to an end-to-end deep representation learning. Motivated by the fact that the perception of image emotions may be dependent on different types of features 

(zhao2014affective), some methods employ fusion strategies to jointly consider multiple features. In this section, we summarize and compare these methods. Please note that here we classify the directly extracted CNN features based on pre-trained deep models into handcrafted features category.

4.1. Handcrafted Features-Based Methods for AC of Images

Low-level Features

are difficult to be understood by viewers. These features are often directly derived from other computer vision tasks. Some widely extracted features include GIST, HOG2x2, self-similarity and geometric context color histogram features as in 

(patterson2012sun), because of their individual power and distinct description of visual phenomena in a scene perspective.

Compared with the above generic features, some specific features derived from art theory and psychology have been designed. For example, machajdik2010affective (machajdik2010affective) extracted elements-of-art features, including color and texture. The MPEG-7 visual descriptors are employed in (lee2011fuzzy), which include four color-related ideas and two texture-related ideas. How shape features in natural images influence emotions is investigated in (lu2012shape) by modeling the concepts of roundness-angularity and simplicity-complexity. sartori2015s (sartori2015s) designed two kinds of visual features to represent different color combinations based on Itten’s color wheel.

Mid-level Features contain more semantics, are more easily interpreted by viewers than low-level features, and thus are more relevant to emotions. patterson2012sun (patterson2012sun) proposed to detect 102 attributes in 5 different categories, including materials, surface properties, functions or affordances, spatial envelop attributes, and object presence. Besides these attributes, eigenfaces that may contribute to facial images are also incorporated in (yuan2013sentribute). More recently, in (rao2016multi)

, SIFT features are first extracted as basic features, which are fed into bag-of-visual-words (BoVW) to represent the multi-scale blocks. Another mid-level representation is the latent topic distribution estimated by probabilistic latent semantic analysis.

Harmonious composition is essential in an artwork. Several compositional features, such as low depth of field, are designed to analyze such characteristics of an image (machajdik2010affective). Based on the fact that figure-ground relationships, color patterns, shapes and their diverse combinations are often jointly employed by artists to express emotions in their artworks, wang2013interpretable (wang2013interpretable) proposed to extract interpretable aesthetic features. Inspired by princiles-of-art, zhao2014exploring (zhao2014exploring) designed corresponding mid-level features, including balance, emphasis, harmony, variety, gradation, and movement. For example, Itten’s color contrasts and the rate of focused attention are employed to measure emphasis.

Feature Ref Level Short description # Feat
LOW_C (patterson2012sun) low GIST, HOG2x2, self-similarity and geometric context color histogram features 17,032
Elements (machajdik2010affective) low color: mean saturation, brightness and hue, emotional coordinates, colorfulness, color names, Itten contrast, Wang’s semantic descriptions of colors, area statistics; texture: Tamura, Wavelet and gray-level co-occurrence matrix 97
MPEG-7 (lee2011fuzzy) low color: layout, structure, scalable color, dominant color; texture: edge histogram, texture browsing 200
Shape (lu2012shape) low line segments, continuous lines, angles, curves 219
IttenColor (sartori2015s) low color co-occurrence features and patch-based color-combination features 16,485
Attributes (patterson2012sun) mid scene attributes 102
Sentributes (yuan2013sentribute) mid scene attributes, eigenfaces 109
Composition (machajdik2010affective) mid level of detail, low depth of field, dynamics, rule of thirds 45
Aesthetics (wang2013interpretable) mid figure-ground relationship, color pattern, shape, composition 13
Principles (zhao2014exploring) mid principles-of-art: balance, contrast, harmony, variety, gradation, movement 165
BoVW (rao2016multi) mid bag-of-visual-words on SIFT, latent topics 330
FS (machajdik2010affective) high number of faces and skin pixels, size of the biggest face, amount of skin w.r.t. the size of faces 4
ANP (borth2013large) high semantic concepts based on adjective noun pairs 1,200
Expressions (yang2010exploring) high automatically assessed facial expressions (anger, contempt, disgust, fear, happiness, sadness, surprise, neutral) 8
Table 6. Summary of the hand-crafted features at different levels for AC of images. ‘# Feat’ indicates the dimension of each feature.

High-level Features that represent the semantic content contained in images can be easily understood by viewers. We can also well recognize the conveyed emotions in images through these semantics. In the early years, simple semantic content including faces and skins contained in images are extracted in (machajdik2010affective). For the images that contain faces, facial expressions may directly determine the emotions. yang2010exploring (yang2010exploring) extracted 8 kinds of facial expressions as high-level features. They built compositional features of local Haar appearances by a minimum error based optimization strategy, which are embedded into an improved AdaBoost algorithm. For the images detected without faces, the experessions are simply set as neutral

. Finally, they generated a 8 dimensional vector with each element representing the number of corresponding facial expressions.

Ref Feature Fusion Learning Dataset Task Result
(machajdik2010affective) Elements, Composition, FS early NB IAPSa, Abstract, ArtPhoto cla 0.471, 0.357, 0.495
(lee2011fuzzy) MPEG-7 KNN unreleased cla 0.827
(lu2012shape) Shape, Elements early SVM, SVR IAPSa; IAPS cla; reg 0.314; V-1.350, A-0.912
(li2012context) Segmented objects SL IAPS, ArtPhoto cla 0.612, 0.610
(yuan2013sentribute) Sentributes SVM, LR Tweet cla 0.824
(wang2013interpretable) Aesthetics NB Abstract, ArtPhoto cla 0.726, 0.631
(zhao2014exploring) Principles SVM, SVR IAPSa, Abstract, ArtPhoto; IAPS cla; reg 0.635, 0.605, 0.669; V-1.270, A-0.820
(zhao2014affective) LOW_C, Elements, Attributes, Principles, ANP, Expressions graph MGL IAPSa, Abstract, ArtPhoto, GAPED, Tweet ret 0.773, 0.735, 0.658, 0.811, 0.701
(sartori2015s) IttenColor SL MART, devArt cla 0.751, 0.745
(rao2016multi) BoVW MIL IAPSa, Abstract, ArtPhoto cla 0.699, 0.636, 0.707
(alameda2016recognizing) IttenColor MC MART, devArt cla 0.728, 0.761
(zhao2016predicting) GIST, Elements, Attributes, Principles, ANP, Expressions graph RMTHG IESN cla_p 0.582
(zhao2015predicting) GIST, Elements, Principles - SSL Abstract dis_d 0.134

GIST, Elements, Attributes, Principles, ANP, deep features from AlexNet

weighted WMMSSL Abstract, Emotion6, IESN dis_d 0.482, 0.479, 0.478
(yang2017learning) ANP, VGG16 - ACPNN Abstract, Emotion6, FlickrLDL, TwitterLDL dis_d 0.480, 0,506, 0,469, 0.555
(zhao2017learning) GIST, Elements, Attributes, Principles, ANP, AlexNet weighted WMMCPNN Abstract, Emotion6, IESN dis_d 0.461, 0.464, 0.470
(zhao2017continuous) GIST, Elements, Attributes, Principles, ANP, AlexNet MTSSR IESN dis_c 0.436
Table 7. Representative work on AC of images using hand-crafted features, where ‘Fusion’ indicates the fusion strategy of different features, ‘cla, reg, ret, cla_p, dis_d, dis_c’ in the Task column are short for classification, regression, retrieval, personalized classification, discrete distribution learning, continuous distribution learning (the same below), respectively, ‘Result’ is the reported best accuracy for classification, mean squared error for regression, discounted cumulative gain for retrieval, F1 for personalized classification, and KL divergence for distribution learning (the first line (zhao2015predicting) is the result on sum of squared difference) on the corresponding datasets.

More recently, the semantic concepts are described by adjective noun pairs (ANPs) (borth2013large; Chen2014DeepSentiBank), which are detected by SentiBank (borth2013large) or DeepSentiBank (Chen2014DeepSentiBank)

. The advantages of ANP are that it turns a neutral noun into an ANP with strong emotions and makes the concepts more detectable, compared to nouns and adjectives, individually. A 1,200 dimensional vector representing the probability of the ANPs can form a feature vector.

Table 6 summarizes the above-mentioned hand-crafted features at different levels for AC of images. Some recent methods also extracted CNN features from pre-trained deep models, such as AlexNet (zhao2017approximating; zhao2017learning) and VGGNet (yang2017learning).

To map the extracted handcrafted features to emotions, Machine Learning Methods

are commonly employed. Some typical learning models include Naive Bayes (NB) 

(machajdik2010affective; wang2013interpretable)

, support vector machine (SVM) 

(lu2012shape; yuan2013sentribute; zhao2014exploring), nearest neighbor (KNN) (lee2011fuzzy), sparse learning (SL) (li2012context; sartori2015s)

, logistic regression (LR) 

(yuan2013sentribute), multiple instance learning (MIL) (rao2016multi), and matrix completion (MC) (alameda2016recognizing) for emotion classification , support vector regression (SVR) (lu2012shape; zhao2014exploring) for emotion regression, and multi-graph learning (MGL) (zhao2014affective) for emotion retrieval.

Instead of assigning the DEC to an image, some recent methods began to focus on the perception subjectivity challenge, i.e., predicting personalized emotions for each viewer or learning emotion distributions for each image. The personalized emotion perceptions of a specified user after viewing an image is predicted in (zhao2016predicting; zhao2018predicting), associated with online social networks. They considered different types of factors that may contribute to emotion recognition, including the images’ visual content, the social context related to the corresponding users, the emotions’ temporal evolution, and the images’ location information. To jointly model these factors, they proposed rolling multi-task hypergraph learning (RMTHG), which can also easily hanlde the data incompleteness issue.

Generally, the distribution learning task can be formulated as a regression problem, which slightly differs for different distribution categories (i.e.

, discrete or continuous). For example, if emotion is represented by CE, the regression problem targets predicting the discrete probability of each emotion category with the sum equal to 1; if we represent emotion based on DES, the regression problem is typically transformed to the prediction of the parameters of specified continuous probability distributions. For the latter one, we usually need to firstly determine the form of continuous distributions, such as exponential distribution and Gaussian distribution. Some representative learning methods for emotion distribution learning of discrete emotions include shared sparse learning (SSL) 

(zhao2015predicting), weighted multimodal SSL (WMMSSL) (zhao2017approximating; zhao2018discrete), augmented conditional probability neural network (ACPNN) (yang2017learning), and weighted multi-model CPNN (WMMCPNN) (zhao2017learning)

. Both SSL and WMMSSL can only model one test image each time, which is computationally inefficient. After the parameters are learned, ACPNN and WMMCPNN can easily predict the emotion distributions of a test image. Based on the assumption that the VA emotion labels can be well modeled by a mixture of 2 bidimensional Gaussian mixture models (GMMs),

zhao2017continuous (zhao2017continuous) proposed to learn continuous emotion distributions in VA space by multi-task shared sparse regression (MTSSR). Specifically, the parameters of GMMs are regressed, including the mean vector and covariance matrix of the 2 Gaussian components as well as the mixing coefficients.

Table 7 summarizes some representative work based on hand-crafted features. Generally, high-level features (such as ANP) can achieve better recognition performance for images with rich semantics, mid-level features (such as Principles) are more effective for artistic photos, while low-level features (such as Elements) perform better for abstract paintings.

4.2. Deep Learning-Based Methods for AC of Images

To deal with the situation where images are weakly labeled, a potentially cleaner subset of the training instances are selected progressively (you2015robust)

. First, they trained an initial CNN model based on the training data. Second, they selected the training samples with distinct sentiment scores between the two classes with a high probability based on the prediction score of the trained model on the training data itself. Finally, the pre-trained AlexNet on ImageNet is fine-tuned to classify emotions into 8 categories by changing the last layer of the CNN from 1000 to 8 

(you2016building). Besides using the fully connected layer as classifier, they also trained an SVM classifier based on the extracted features from the second to the last layer of the pre-trained AlexNet model.

Multi-level deep representations (MldrNet) are learned in (rao2016learning) for image emotion classification. They segmented the input image into 3 levels of patches, which are input to 3 different CNN models, including Alexnet, aesthetics CNN (ACNN), and texture CNN (TCNN). The fused features are fed into multiple instance learning (MIL) to obtain the emotion labels. zhu2017dependency (zhu2017dependency) proposed to integrate the different levels of features by a Bidirectional GRU model (BiGRU) to exploit their dependencies based on MldrNet. They generated two features from the Bi-GRU model and concatenated them as the final feature representations. To enforce the feature vectors extracted from each pair of images from the same category to be close enough, and those from different categories to be far away, they proposed to jointly optimize a contrastive loss together with the traditional cross-entropy loss.

More recently, yang2018retrieving (yang2018retrieving) employed deep metric learning to explore the correlation of emotional labels with the same polarity, and proposed a multi-task deep framework to optimize both retrieval and classification tasks. By considering the relations among emotional categories in the Mikels’ wheel, they jointly optimized a novel sentiment constraint with the cross-entropy loss. Extending triplet constraints to a hierarchical structure, the sentiment constraint employs a sentiment vector based on the texture information from the convolutional layer to measure the difference between affective images. In (yang2018weakly; she2019wscnet), yang2018retrieving proposed a weakley supervised coupled convolutional neural network to exploit the discriminability of localized regions for emotion classification. Based on the image-level labels, a sentiment map is firstly detected in one branch with the cross spatial pooling strategy. And then the holistic and localized information are jointly combined in the other branch to conduct a classification task. The detected sentiment map can easily explain which regions of an image determine the emotions.

The above deep methods mainly focused on the dominant emotion prediction. There are also some work on emotion distribution learning based on deep models. The very first work is a mixed bag of emotions, which trains a deep CNN regressor (CNNR) for each emotion category in Emotion6 (peng2015mixed) based on the AlexNet architecture. They changed the number of output nodes to 1 to predict a real value for each emotion category and replaced the Softmax loss with Euclidean loss. To ensure the sum of different probabilities to be 1, they normalized the predicted probabilities of all emotion categories. However, CNNR has some limitations. First, the predicted probability cannot be guaranteed to be non-negative. Second, the probability correlations among different emotions are ignored, since the regressor for each emotion category is trained independently. In (yang2017joint), yang2017joint designed a multi-task deep framework based on VGG16 by jointly optimizing the cross-entropy loss for emotion classification and Kullback-Leibler (KL) divergence loss for emotion distribution learning. To match the single emotion dataset to emotion distribution learning settings, they transformed each single label to emotion distribution with emotion distances computed on Mikels’ wheel (zhao2016predicting; zhao2018predicting). By extending the size of training samples, this method achieves the state-of-the-art performance for discrete emotion distribution learning.

Ref Base net Pre #Feat Cla Loss Dataset Task Result
(you2015robust) self-defined no 24 FlickrCC cla 0.781
(you2016building) AlexNet yes 4,096 SVM FI, IAPSa, Abstract, ArtPhoto cla 0.583, 0.872, 0.776, 0.737
(rao2016learning) AlexNet, ACNN, TCNN yes 4,096, 256, 4,096 MIL FI, IAPSa, Abstract, ArtPhoto, MART cla 0.652, 0.889, 0.825, 0.834, 0.764
(zhu2017dependency) self-defined no 512 contrastive FI, IAPSa, ArtPhoto cla 0.730, 0.902, 0.855
(yang2018retrieving) GoogleNet-Inception yes 1,024 sentiment FI, IAPSa, Abstract, ArtPhoto cla; ret 0.676, 0.442, 0.382, 0.400; 0.780, 0.819, 0.788, 0.704
(yang2018weakly; she2019wscnet) ResNet-101 yes 2,048 FI, Tweet cla 0.701, 0.814
(peng2015mixed) AlexNet yes 4,096 Euclidean Emotion6 dis_d 0.480
(yang2017joint) VGG16 yes 4,096 KL Emotion6, FlickrLDL, TwitterLDL dis_d 0.420, 0,530, 0,530
Table 8. Representative work on deep learning based AC of images, where ‘Pre’ indicates whether the network is pre-trained using ImageNet, ‘# Feat’ indicates the dimension of last feature mapping layer before the emotion output layer, ‘Cla’ indicates the classifier used after the last feature mapping with default Softmax, ‘Loss’ indicates the loss objectives (besides the common cross-entropy loss for classification), and ‘Result’ is the reported best accuracy for classification, discounted cumulative gain for retrieval, and KL divergence for distribution learning on the corresponding datasets.

The representative deep learning based methods are summarized in Table 8. The deep representation features generally perform better than the hand-crafted ones, which are intuitively designed for specific domains based on several small-scale datasets. However, how the deep features correlate to specific emotions is unclear.

5. Affective Computing of Music

Music emotion recognition (MER) strives to identify emotion expressed by music and subsequently predict listener’s felt emotion from acoustic content and music metadata, e.g., lyrics, genre, etc. Emotional understanding of music have applications in music recommendation and is particularly useful for producing music retrieval. An analysis of search queries from creative professionals showed that 80% contain emotional terms, showing emotions prominence in that field (inskip2012). A growing number of work have tried to address emotional understanding of music from acoustic content and metadata (see (yang2012machine; Kim2010) for earlier reviews on this topic).

Earlier work on emotion recognition from music relied on extracting acoustic features similar to the ones used in speech analysis, such as audio energy and formants. Acoustic features describe attributes related to musical dimensions. Musical dimensions include melody, harmony, rhythm, dynamics, timbre (tone color), expressive techniques, musical texture, and musical form (panda_thesis), as shown in Table 9. Some also add energy as a musical feature which is important for MER (yang2012machine). Melody is a linear succession of tones and can be captured by features representing key, pitch and tonality. Among others, chroma is often used to represent melodic features (yang2012machine). Harmony is how the combination of various pitches are processed during hearing. Understanding harmony involves chords or multiple notes played together. Examples of acoustic features capturing harmony include chromagram, key, mode, and chords (aljanaki2016emotion). Rhythm consists of repeated patterns of musical sounds, i.e., notes and pulses that can be describes in terms of tempo and meter. Higher tempo songs often induce higher arousal and fluent rhythm is associated with higher valence and firm rhythm is associated with sad songs (yang2012machine). Mid-level acoustic features, such as onset rate, tempo and beat histogram, can represent rhythmic characteristics of music. Dynamics of music involve the variation in softness or loudness of notes which include change of loudness (contrast) and emphasis on individual sounds (accent) (panda_thesis). Dynamics of music can be captured by changes in acoustic features related to energy such as root mean square (RMS) energy. Timbre is the perceived sound quality of musical notes. Timbre is what differentiates different voices and instruments playing the same sound. Acoustic features capturing timbre, such as MFCC and spectrum shape, describe sound quality (Yang2018). Acoustic features describing timbre include MFCC, spectral features (centroid, contract, flatness), and zero crossing rate (aljanaki2016emotion). Expressive techniques are the way a musical piece is played including tempo and articulation (panda_thesis). Acoustic features, such as tempo, attack slope, and time, can be used to describe this dimension. Musical texture is how rhythmic, melodic, and harmonic features are combined in music production (panda_thesis). It is related to the range of tones played at the same time. Musical form describes how a song is structured, such as introduction verse and chorus (panda_thesis). Energy whose dynamics are described in music dynamic features is strongly associated with arousal perception.

Musical dimension Acoustic features
Melody Pitch
Harmony chromagram, chromagram peak, key, mode, key clarity, harmonic, change, chords
Rhythm tempo, beat histograms, rhythm regularity, rhythm strength, onset rate
Dynamics and loudness RMS energy, loudness, timpral width

MFCC, spectral shapres (centroid, shape, spread, skewness, kurtosis, contrast and flatness), brightness, rolloff frequency, zero crossing rate, spectral contrast, auditory modulation features, inharmonicity, roughness, dissonance, odd to even harmonic ratio

Musical form Similarity Matrix (similarity between all possible frames) (panda_thesis)
Texture attack slope, attack time
Table 9. Musical dimensions and acoustic features describing them.

There are a number of toolboxes available for extracting acoustic features from music that can be used for music emotion recognition. Music Analysis, Retrieval and Synthesis for Audio Signals (Marsyas) (tzanetakis2000marsyas)

is an open source framework developed in C++ that supports extracting a large range of acoustic features with music information retrieval applications in mind, including time-domain zero-crossings, spectral centroid, rolloff, flux, and Mel-Frequency Cepstral Coefficients (MFCC)

etc. MIRToolbox (mirtoolbox) is an open source toolbox implemented in MATLAB for music information retrieval applications. MIRToolbox offers the ability to extract a comprehensive set of acoustic features at different levels including features related to tonality, rhythm, and structures. Speech and music interpretation by large-space extraction or OpenSMILE (eyben2010; eyben2013) is an open source software developed in C++ with the ability to extract a large number of acoustic features for speech and music analysis in real-time. LibROSA (mcfee2015librosa) is a Python package for music and audio analysis. It is mainly developed with music information retrieval application in mind and supports importing from different audio sources and extracting musical features such as onsets chroma and tempo in addition to the low-level acoustic features. ESSENTIA (ESSENTIA) is an open source library developed in C++ with Python interface that is developed for audio analysis. ESSENTIA contains an extensive collection of algorithms supporting audio input/output functionality, standard digital signal processing blocks, statistical characterization of data, and a large set of spectral, temporal, tonal and high-level music descriptors.

Music emotion recognition either attempts to classify songs or excerpt into categories (classification) or estimate their expressed emotions on continuous dimensions (regression). The choice of machine learning model in music emotion recognition depends on the emotional representation used. Mood clusters (mirex07), dimensional representations such as arousal, tension and valence as well as music specific emotion representation can be used. An analysis of the methods proposed for MediaEval “Music in Emotion” task submissions revealed that using deep learning accounted for the superior performance for emotion recognition much more than the choice features (deam). Recent methods for emotion recognition in music rely on deep learning and often use spectrogram features that are converted to images (aljanaki2018data). aljanaki2018data proposed learning musically meaningful mid-level perceptual features that can describe emotions in music (aljanaki2018data)

. They demonstrated that perceptual features such as melodiousness, modality, rhythmic complexity and dissonance can describe a large portion of emotional variance in music both in dimensional representation and MIREX clusters. They also trained a deep convolutional neural network to recognize these mid-level attributes. There have been also work attempting to use lyrics in addition to acoustic content for recognizing emotion in music 

(7536113). However, lyrics are copyrighted and not easily available which hinders further work in this direction.

6. Affective Computing of Videos

Currently, the features used in affective video content analysis are mainly from two categories (shukla2018multimodal; shukla2017evaluating). One is considering the stimulus of video content and extracting the features reflecting the emotions conveyed by the video content itself. And the other is extracting features from the viewers. Features extracted from the video content are content-based features, and features formed from the signals of the viewers’ responses are viewer-related features.

6.1. Content-related Features

Generally speaking, the video content comprise of a series of ordered frames as well as corresponding audio signals. Therefore, it is natural to extract features from these two modalities. The audiovisual features can further be divided into low-level and mid-level according to their ability to describe the semantics of video content.

6.1.1. Low-level features

Commonly, the low-level features are directly computed from the raw visual and audio content, and usually carry no semantic information. As for visual content, color, lighting, and tempo are important elements that can endow the video with strong emotional rendering and further give viewers direct visual stimuli. In many cases, computations are conducted over each frame of the video, and the average values of the computational results of the overall video are considered as visual features. Specifically, the color-related features often contain the histogram and variance of color (shukla2018multimodal; chen2018identifying; zhu2019hybrid), the proportions of color (nemati2016incorporating; Niu2017TemporalFV), the number of white frame and fades (zhu2019hybrid), the grayness (chen2018identifying), darkness ratio, color energy (zhao2013flexible; Wang2015Emotion), brightness ratio and saturation (niu2017novel; Niu2017TemporalFV), etc. In addition, the differences of dark and light can be reflected by the lighting key, which is used to evoke emotions in video and draw the attention of viewers by creating an emotional atmosphere (nemati2016incorporating). As for the tempo-related features, properties of shot can reinforce the expression of video, such as shot change rate and shot length variance (niu2016novel; niu2017novel; chen2018identifying) according to movie grammar. To better take advantage of the temporal information of the video, the motion vectors have been computed as features in (zhong2019video). Since the optical flow can characterize the influence of camera motions, the histogram of optimal flow matrix (HOF) has been computed as features in (yi2018multi). Additionally, yi2018multi (yi2018multi) traced motion key points at multiple spatial scales and computed the mean motion magnitude of each frame as features.

To represent audio content, pitch, zero crossing rates (ZCR), Mel frequency cepstrum coefficients (MFCC), and energy are the most popular features (zhao2013flexible; Wang2015Emotion; han2015arousal; nemati2016incorporating; acar2017comprehensive). In particular, the MFCC (hu2016multi; zhu2016video; mcduff2017large; ben2018deep; yi2018multi; zhu2019hybrid) and its are used to characterize emotions in video clips frequently; while the derivatives and statistics (min, max,mean) of MFCC or are also explored widely. As for pitch,  (niu2016novel) shows that pitch of sound is associated closely with some emotions, such as anger with higher pitch and sadness with lower standard deviation of pitch. Similar situation can also occur in the energy (niu2016novel; Niu2017TemporalFV). For example, the total energy of anger or happiness is higher than the counterpart of unexciting emotions. ZCR (Niu2017TemporalFV) is used to separate different types of audio signals, such as music, environmental sound and speech of human. Besides these frequent related features, audio flatness (zhu2019hybrid), spectral flux (zhu2019hybrid), delta spectrum magnitude, harmony (Niu2017TemporalFV; sivaprasad2018multimodal; zhu2019hybrid), band energy ratio, spectral centroid (hu2016multi; zhu2019hybrid), and spectral contrast (Niu2017TemporalFV) are also utilized.

Evidently, the aforementioned features are mostly handcrafted. With the emergence of deep learning, features can be automatically learned through deep neural networks. Some pre-trained convolutional neural networks (CNNs) are used to learn static representations from every frame or some selected key frames, while a Long-short term memory (LSTM) is exploited to capture dynamic representations existing in videos.

For instance, in (xu2016heterogeneous), an AlexNet with seven fully-connected layers trained on 2600 ImageNet classes is used to learn features. A Convolutional Auto-Encoder (CAE) is designed to ensure the CNNs can extract the visual features effectively in (pang2015mutlimodal). ben2018deep (ben2018deep) first used the pre-trained ResNet-152 to extract feature vectors. And then, these vectors are fed into an LSTM according to their temporal order to extract high-order representations. Pre-trained model, SoundNet, is utilized to learn audio features. Because the expressive emotions of video are induced and communicated by the protagonist in video in many cases, the features of protagonist are extracted from the key frame by a pre-trained CNN and used in video affective analysis in (zhu2016video; zhu2019hybrid). In addition to the protagonist, other objects in each frame of video also give insights into emotional expression of video. For example, in (shukla2018looking), shukla2018looking removed the non-gaze regions from video frames (Eye ROI) and built the coarse grained scene structure remaining gist information by Guassian filter with variance. After the operations above in (shukla2018looking), the next video affective analysis may pay more attention to important information and reduce unnecessary noise.

6.1.2. Mid-level features

Unlike low-level features, mid-level features often contain semantic information. For example, EmoBase10 feature depicting audio cues is computed in (yi2018multi). hu2016multi (hu2016multi) proposed a method of combining the audio and visual features to model contextual structures of the key frames selected from video. This can produce a kind of so-called multi-instance sparse coding (MI-SC) for next analysis. In addition, the lexical features (muszynski2019recognizing) are extracted from the dialogues of speakers by using a natural language toolkit. These features can reflect the emotional changes in videos and can also represent a certain emotional expression in overall videos. muszynski2019recognizing (muszynski2019recognizing) used aesthetic movie highlights related to occurrences of meaningful movie scenes to define some experts. These features produced by experts are more knowledgeable and abstract for video affective analysis, especially movies. HHTC features, which are computed on the basis of combination of Huang Transform in visual-audio and cross-correlation features, are proposed in (niu2017novel).

6.2. Viewer-related Features

Besides the content-related features, viewers’ facial expressions and changes of physiological signals evoked by content of videos are the most common sources for extracting viewer-related features. mcduff2017large (mcduff2017large) coded the facial actions of viewers for further affective video analysis. Among various physiological signals, electrocardiography (ECG), galvanic skin response (GSR), electroencephalography (EEG) are the mostly frequently ones and their statistical measures, such as mean, median, spectral power bands, etc., are often recommended as features. Wang2015Emotion (Wang2015Emotion) used EEG signals to construct a new EEG feature with the assistance of the relationship among video content by exploiting canonical correlation analysis (CCA). In (gui2018implicit), some viewer-related features are extracted from the whole pupil dilation ratio time-series without the differences among pupil diameter in human eyes, such as its average and derivation for global features as well as the four spectral power bands for local features.

In addition to viewers’ responses mentioned above, the comments or other textual information produced by viewers can also reflect their attitudes or emotional reactions toward the videos. In the light of this, it is reasonable to consider users’ textual comments or other textual information to extract features. In (nemati2016incorporating), the “sentiment analysis” module using Unigrams and Bigrams is built to learn comment-related features of the collected data according to the YouTube link provided by the DEAP dataset.

6.3. Machine Learning Methods

After feature extracting, a classifier or a regressor is used to obtain emotional analysis results. For classification, there are several frequently used classifiers, including support vector machines (SVM) (nemati2017evidential; acar2017comprehensive; yi2018multi; wang2017content; gupta2016quality; shukla2017evaluating), Naive Bayes (NB) (gupta2016quality; nemati2017evidential), Linear Discriminant Analysis (shukla2018looking), logistic regression (LR) (xu5731visual), and ensemble learning (acar2017comprehensive), etc.

Recent work show that the SVM-based methods are very popular for affective video content analysis due to its simplicity, max-margin training property, and use of kernels  (wang2015video). For example, yi2018multi’ work  (yi2018multi) demonstrated that linear SVM is more suitable for classification than RBM, MLP, and LR . In (shukla2017evaluating), LDA, linear SVM (LSVM), and Radial Basis SVM (RSVM) classifiers are employed in emotion recognition experiments, and the RSVM obtained the best F1 scores. In (gupta2016quality), both Navie Bayes and SVM are used as classifiers in unimodal and multimodal conditions. In the unimodal experimental condition, NB is not better than SVM. And the fusion results showed that SVM is much better than NB in multimodal situations. However, SVM also has its shortages, such as the difficulty of selecting suitable kernel functions. Indeed, SVM is not always the best choice. In (acar2017comprehensive), the results demonstrated that ensemble learning outperforms SVM in terms of classification accuracy. Ensemble learning has acquired a lot of attention in many fields because of its accuracy, simplicity, and robustness. In addition, in (xu5731visual)

, LR is adopted as the classifier for its effectiveness and simplicity. In fact, LR is used frequently in many transfer learning tasks.

However, all the classifiers mentioned above are not able to capture the temporal information. Some other methods try to use temporal information. For example, gui2018implicit (gui2018implicit) combined SVM and LSTM to predict emotion labels. Specifically, global features and sequence features are proposed to represent the pupillary response signals. Then a SVM classifier is trained with the global features and a LSTM classifier is trained with the sequence features. Finally, a decision fusion strategy is proposed to combine these two classifiers.

Ref Feature Fusion Learning Dataset Task Result
(sivaprasad2018multimodal) Mel frequency spectral; MFCC, Chroma and their derivatives and statistics; Audio compressibility; Harmonicity; Shot frequency; HOF and statistics; Histogram of 3D HSV and statistics; Video compressibility; Histogram of facial area decision LSTM Dataset described by Malandrakis (5946961) reg : : : :
(gui2018implicit) Average, standard deviation and four spectral power bands of pupil dilation ratio time-series decision SVM, LSTM MAHNOB-HCI cla :0.730, :0.780
(ben2018deep) Audio and visual deep features from pretrained model feature CNN, LSTM, SVM PMIT cla :0.0.2122
(acar2017comprehensive) MFCC; Color values; HoG; Dense trajectory descriptor; CNN-learned features decision CNN, SVM, Ensemble DEAP cla : 0.81, 0.49
(mo2018novel) HHTC features feature SVR Discrete LIRIS-ACCEDE reg :0.294, : 0.290
(gupta2016quality) Statistical measures (such as mean, median, skewness kurtosis) for EEG data, power spectral features, ECG, GSR, Face/Head-pose decision SVM/NB music excerpts (morreale2013robin) cla F1 (v: 0.59, 0.58, a: 0.60, 0.57)
(guo2019affective) Time-span visual and visual features feature CNN Opensmile toolbox music excerpts (morreale2013robin) cla :0.082
(han2015arousal) tempo; pitch; zero cross; roll off; MFCCs; Saturation; Color heat; Shot length feature; General preferences; Visual excitement; Motion feature; fMRI feature feature DBM SVM TRECVID cla -
(zhu2019hybrid) Colorfulness; MFCC; CNN-learned features from the keyframes containing protagonist decision CNN, SVM, SVR LIRIS-ACCEDE, PMSZU cla/reg -
(zhong2019video) Multi-frame motion vectors decision CNN SumMe, TVsum, Continuous LIRIS-ACCEDE reg -
(hu2016multi) The median of the L values in Luv space; means and variances of components in HSV space; texture feature; mean and standard deviation of motions between frames in a short; MFCC; Spectral power; mean and variance of the spectral centroids; Time domain zero crossings rate; Multi-instance sparse coding feature SVM Musk1, Musk2, Elephant, Fox, Tiger cla : 0.911, 0.906, 0.885, 0.627, 0.868
(nemati2016incorporating) Lighting key; Color; Motion vectors; ZCR; energy; MFCC; pitch; Textual features decision SMO, Navie Bayes DEAP cla F1:0.849 0.811 : 0.911 0.883
(chen2018identifying) Key lighting; Grayness; Fast motion; Shot chanage rate; Shot length variation; MFCC; CNN-learned features; power spectral density; EEG; ECG; respiation; galvanic skin resistance feature SVM DEAP cla :0.7 0.7 0.7125, :0.6876 0.7 0.8 F1 (A:0.664 0.687 0.789)
(nemati2017evidential) MFCC; ZCR; energy; pitch, color histograms; lighting key; motion vector decision SVM, Navie Bayes DEAP cla F1: 0.869, 0.846 : 0.925, 0.897
(shukla2017evaluating) CNN feature, low-level audio visual features, EEG decision LDA, LSVM, RSVM Dataset introduced by the authors cla -
(yi2018multi) MKT; ConvNets feature; EmoLarge; IS13; MFCC; EmoBase10; DSIFT; HSH decision SVM, LR, RBM, MLP MediaEval 2015, 2016 Affective Impact of Movies cla :0.574, :0.462
(shukla2018looking) CNN feature - SVM, LDA dataset in (shukla2017affect) cla -
(baveye2015deep) CNN feature - SVR LIRIS-ACCEDE reg : 0.021, : 0.027
Table 10. Representative work on AC of videos using kinds of features, where , , , , , , and indicates the Pearson correlation coefficients of arousal and valence, the mean sum error of arousal and valence, the accuracy of arousal and valence, the average accuracy and mean average precision respectively. ‘statistics’ means (min, max, mean).

A regressor is needed when mapping the extracted features to the continuous dimensional emotion space. Recently, one of the most popular regression method is support vector regression (SVR) (mo2018novel; baveye2015deep; zhu2019hybrid). For example, in (baveye2015deep), video features like audio, color, aesthetic are fed into SVR in the SVR-Standard experiment. And in the SVR-Transfer learning experiment, the pre-trained CNN is treated as a feature extractor. The CNN’s outputs are used as the input to the SVR. The experimental results showed that the SVR-Transfer learning outperforms other methods. Indeed, the various kernel functions in SVR provide a stronger adaptability.

6.4. Data Fusion

In total, there are two fusion strategies for multimodal information: feature-level fusion and decision-level fusion. Feature-level fusion means that the multimodal features are combined and then used as the input of a classifier or a regressor. Decision-level fusion fuses several results of different classifiers, and the final results are computed according to the fusion methods.

One way of feature-level fusion is implemented by feature accumulation or concatenation (ben2018deep; zhu2019hybrid; yi2018multi). In (ben2018deep), two feature vectors for visual and audio data are averaged as the global genre representations. In (zhu2019hybrid), multi-class features are concatenated to generate a high dimensional joint representation. Some machine learning methods are also employed to learn joint features (guo2019affective; han2015arousal; pandeya2019music; xing2019exploiting). In (guo2019affective), a two-branch network is used to combine the visual and audio features. The outputs of the two-branch network are then fed into a classifier, and the experiment results showed that the joint features outperform other methods. In (han2015arousal), the low-level audio-visual features and fMRI-derived features are fed into multimodal DBM to learn joint representations. The target of this method is to learn the relation between audio-visual features and fMRI-derived features. In (xing2019exploiting), PCA is used to learn the multimodal joint features. In (Wang2015Emotion), canonical correlation analysis (CCA) is used to construct a new video feature space with the help of EEG features and a new EEG feature space with the assistance of video content, so only one modality is needed to predict emotion during the testing process.

By combining the results of different classifiers, decision-level fusion strategy is able to achieve better results (gui2018implicit; nemati2016incorporating; acar2017comprehensive; gupta2016quality; nemati2017evidential; sivaprasad2018multimodal). In (acar2017comprehensive), linear fusion and SVM-based fusion techniques are explored to combine outputs of several classifiers. Specifically, the output of each classifier has its own weight in linear fusion. The final result is the weighted sum of all outputs. In SVM-based fusion, the outputs of unimodal classifiers are concatenated together. And then the higher level representations for each video clip are fed into a fusion SVM to predict the emotion. Based on these results, linear fusion is better than SVM-based fusion. In (gupta2016quality; nemati2017evidential), linear fusion is also used to fuse the outputs of multiple classifiers. The differences among these linear fusion methods depend on the distribution of weights.

6.5. Deep Learning Methods

In tradition, video emotional recognition includes two steps, i.e., feature extraction step and regression or classification step. Because of the lack of consensus on the most relevant emotional features, we may not be able to extract the best features for the problem at hand. As a result, this two-step mode has hampered the development of affective video content analysis. In order to solve this problem, some methods based on end-to-end training frameworks are proposed. khorrami2016deep (khorrami2016deep) combined CNN and RNN to recognize the emotional information of videos. According to their method, a CNN is trained using frame facial images sampled from videos to extract features. Then the features are fed into a RNN to perform continuous emotion recognition. In (huang2018end), a single network using ConvLSTM is proposed, where videos are input to the network and the predicted emotional information is output directly. In fact, due to the complexity of CNNs and RNNs, the training of these frameworks needs large amounts of data. However, in video affective content analysis, the samples in existing datasets are usually limited. This is the reason why end-to-end methods are still less common compared to the traditional two step methods, despite of their influential potentials.

7. Affective Computing of Multimodal Data

In this section, we survey the work that analyze multimodal data beyond audiovisual content. Most of the existing work on affective understanding of multimedia rely on one modality, even when additional modalities are available, for example in videos (jiang2014predicting). Earlier work on emotional understanding of multimedia used hand crafted features from different modalities that are fused at feature or decision levels (hanjalic2005affective; Arifin2008; Benini:2011; soleymani2009; Teixeira2012). The more recent work mainly use deep learning models (jiang2014predicting; Pang2015).

Language is a commonly used modality in addition to vision and audio. There is a large body of work on text-based sentiment analysis (pang2008opinion). Sentiment analysis from text is well-established and is deployed at scale in industry at a broad set of applications involving opinion mining (soleymani2017survey). With the shift toward an increasingly multimodal social web, multimodal sentiment analysis is becoming more relevant. For example, vloggers post their opinions on YouTube, and photos commonly accompany user posts on Instagram and Twitter. Analyzing text for emotion recognition requires representing terms by features. Lexically-based approaches are one of the most popular methods for text-based emotion recognition. They involve using knowledge of words’ affect for estimating document or content’s affect. Linguistic Inquiry and Word Count (LIWC) is a well-known lexical tool that matches the terms in a document with its dictionary and generates scores along different dimensions including affective and cognitive constructs such as “present focus” and “positive emotion” (liwc2007). The terms in each category or selected by experts is extensively validated on different content. AffectNet is another notable lexical resource which includes a semantic netowrk of 10,000 items with representations for “pleasantness”, “attention”, “sensitivity”, and “aptitude” (affectnet). The continuous representations can be mapped to 24 distinct emotions. DepecheMood is a lexicon created through a data-driven method mining a news website annotated with its particular set of discrete emotions, namely, “afraid”, “amusemed”, “anger”, “annoyed”, “don’t care”, “happy”, and “inspired” (staiano2014depeche). DepecheMood is extended to DepecheMood++ by including Italian (depechemoodpp).

The more recent development in text-based affective analysis is models powered by deep learning. Leveraging large scale data, deep neural networks are able to learn representations that are relevant for affective analysis in language. Word embeddings are one of the most common representations used to represent language. Word embeddings, such as Word2Vec (word2vec) or GloVe (pennington2014glove), learn language context of the word by learning a representation (a vector), that can capture semantic and syntactic similarities. More recently, representation learning models that can encode the whole sequence of terms (sentences, documents) showed impressive performance in different language understanding tasks, including sentiment and emotional analysis. Bidirectional Encoder Representations from Transformers (BERT) (devlin-etal-2019-bert)

is a method for learning a language model that can be trained on large amount of data in an unsupervised manner. This pre-trained model is very effective in representing a sequence of terms as a fixed-length representation (vector). BERT architecture is a multi-layer bidirectional Transformer network that encodes the whole sequence at once. BERT representation achieves state-of-the-art results in multiple natural language understanding tasks.

The audiovisual features that are used for multimodal understanding of affect are similar to the ones discussed in previous sections. The main technique between miltimodal models lies in methods for multimodal fusion. Multimodal methods involve extracting features from multiple modalities, e.g., audiovisual, and training joint or separate machine learning models for fusion (tadas_mm_survey). Multimodal fusion can be done in model-based and model-agnostic ways. The model-agnostic fusion methods do not rely on a specific classification or regression method and include feature-level, decision-level, or hybrid fusion techniques. Model-based methods address multimodal fusion in model construction. Examples of model-based fusion methods include Multiple Kernel Learning (MKL) (Gonen2011), graphical models, such as Conditional Random Fields (Baltrusaitis2013) and neural networks (Rajagopalan2016; Nicolaou2011).

Pang2015 (Pang2015)

used Deep Boltzmann Machine (DBM) to learn a joint representation across text, vision, and audio to recognize expected emotions from social media videos. Each modality is separately encoded with stacking multiple Restricted Boltzmann Machines (RBM) and pathways are merged to a joint representation layer. The model was evaluated for recognizing eight emotion categories for 1,101 videos from 

(jiang2014predicting). muszynski2019recognizing (muszynski2019recognizing) studied perceived vs induced emotion in movies. To this end, they collected additional labels on a subset of LIRIS-ACCEDE dataset (baveye2015liris)

. They found that perceived and induced emotions do not always agree. Using multimodal Deep Belief Networks (DBN), they could demonstrate that fusion of electrodermal responses with audiovisual content features improves the overall accuracy for emotion recognition 


In (sivaprasad2018multimodal), authors performed regression to estimate intended arousal and valence levels (as judged by experts in (5946961)

). LSTM recurrent neural networks are used for unimodal regressions and fused via early and late fusion for audiovisual estimation with late fusion achieving the best results.

Tarvainen2018 (Tarvainen2018) performed an in-depth analysis on how emotions are constructed in movies. They identified scene type as a major factor in emotions in movies. They then used content features to recognize emotions along three dimensions of hedonic tone (valence), energetic arousal (awake–tired) and tense arousal (tense–calm).

Bilinear fusion is a method that is proposed to model inter- and intra- modality interaction among modalities by performing outer product between unimodal embeddings (Lin_2015_ICCV). zadeh2017tensor (zadeh2017tensor)

extended this to a Tensor Fusion Network to model intra-modality and inter-modality dynamics in multimodal sentiment analysis. The tensor fusion network includes modality embedding sub-networks, a tensor fusion layer modeling the unimodal, bimodal and trimodal interactions using a three-fold Cartesian product from modality embeddings along with a final sentiment inference sub-network conditioned on the tensor fusion layer. The main drawback of such methods is the increase in the dimensionality of the resulting multimodal representation.

8. Future Directions

Although remarkable progress has been made on affective computing of multimedia (ACM) data, there are still several open issues and directions that can boost the performance of ACM.

Multimedia Content Understanding. As emotions may be directly evoked by the multimedia content in viewers, accurately understanding what is contained in multimedia data can significantly improve the performance of ACM. Sometimes it is even necessary to analyze the subtle details. For example, we may feel “amused” on a video with a laughing baby; but if the laugh is from a negative character, it is more possible for us to feel “angry”. In such cases, besides the common property, such as “laugh”, we may need to further recognize the identity, such as “a lovely baby” and “an evil antagonist”.

Multimedia Summarization. Emotions can play a vital role in selection of multimedia for creation of summaries or highlights. This is an important application in entertainment and sports industries (e.g. movie trailers, sports highlights). There has been some recent work in this direction where affect information from audio visual cues has led to the successful creation of video summaries (Smith:augmenting; merler:highlights). In particular, work reported in (Smith:augmenting) used audiovisual emotions in part to create an AI trailer for a Century Fox film in 2016. Similarly, AI Highlights described in (merler:highlights) hinged on audiovisual emotional cues and have successfully been employed to create the official highlights at Wimbledon and US Open since 2017. This is a very promising direction for affective multimedia computing which can have a direct impact on real world media applications.

Contextual Knowledge Modeling. The contextual information of a viewer watching some multimedia is very important. Similar multimedia data under different contexts may evoke totally different emotions. For example, we may feel “happy” when listening a song about love in a wedding; but if the same song is played when two lovers are departing, it is more likely that we feel “sad”. The prior knowledge of viewers or multimedia data may also influence the emotion perceptions. An optimistic viewer and a pessimistic viewer may have totally different emotions about the same multimedia data.

Group Emotion Clustering. It is too generic to simply recognize the dominant emotion, while it is too specific to predict personalized emotion. It would make more sense to model emotions for groups or cliques of viewers with similar interests and backgrounds. Clustering different viewers into corresponding groups possibly based on the user profiles may provide a feasible solution to this problem.

New AC Setting Adaptation. Because of the domain shift (torralba2011unbiased), the deep learning models trained on one labeled source domain may not work well on the other unlabeled or sparsely labeled target domain, which results in the models’ low transferability to new domains. Exploring domain adaptation techniques that fit well on the AC tasks is worth investigating. One possible solution is to translate the source data to an intermediate domain that are indistinguishable from the target data while preserving the source labels (zhao2018emotiongan; zhao2019cycleemotiongan) using Generative Adversarial Networks (goodfellow2014generative; zhu2017unpaired). How do deal with some practical settings, such as multiple labeled source domains and emotion models’ homogeneity, is more challenging.

Regions-of-Interest Selection. The contributions of different regions of given multimedia may vary to the emotion recognition. For example, the regions that contain the most important semantic information in images are more discriminative than background; some video frames are of no use to emotion recognition. Detecting and selecting the regions-of-interest may significantly improve the recognition performance as well as the computation efficiency.

Viewer-Multimedia Interaction. Instead of direct analysis of the multimedia content or implicit consideration of viewers’ physiological signals (such as facial expressions, Electroencephalogram signals, etc.), joint modeling of both multimedia content and viewers’ responses may better bridge the affective gap and result in superior performances. We should also study how to deal with missing or corrupted data. For example, some physiological signals are unavailable during the data collection stage.

Affective Computing Applications.

Although AC is claimed to be important in real-world applications, few practical systems have been developed due to the relatively low performance. With the availability of larger datasets and improvements in self-supervised and semi-supervised learning, we foresee the deployment of ACM in real-world applications. For example, in media analytics, the content understanding methods will identify the emotional preferences of users and emotional nuances of social media content to better target advertising effort; in fashion recommendation, intelligent costumer service, such as customer-multimedia interaction, can provide better experience to customers; in advertisement, generating or curating multimedia that evokes strong emotions can attract more attention. We believe that an emotional artificial intelligence will become a significant component of mainstream multimedia applications.

Benchmark Dataset Construction. Existing studies on ACM mainly adopt small-scale datasets or construct relatively larger-scale ones using keyword searching strategy without annotation quality guaranteed. To advance the development of ACM, creating a large-scale and high-quality dataset is in urgent need. It has shown that there are three critical factors for dataset construction of ACM, i.e., the context of viewer response, personal variation among viewers, and the effectiveness and efficiency of corpus creation (soleymani2014corpus). In order to include a large number of samples, we may exploit online systems and crowdsourcing platforms to recruit large numbers of viewers with a representative spread of backgrounds to annotate multimedia and provide contextual information on their emotional responses. Since emotion is a subjective variable, personalized emotion annotation would make more sense, from which we can obtain the dominant emotion and emotion distribution. Further, accurate understanding of multimedia content can boost the affective computing performance. Inferring emotional labels from social media users’ interaction with data, e.g., likes, comments, in addition to their spontaneous responses, e.g., facial expression, where possible, will provide new avenues for enriching affective datasets.

9. Conclusion

In this article, we have surveyed affective computing (AC) methods for heterogeneous multimedia data. For each multimedia type, i.e., image, music, video, and multimodal data, we summarized and compared available datasets, handcrafted features, machine learning methods, deep learning models, and experimental results. We also briefly introduced the commonly employed emotion modelds and outlined potential research directions in this area. Although deep learning-based AC methods have achieved remarkable progress in recent years, an efficient and robust AC method that is able to obtain high accuracy under unconstrained conditions is yet to be achieved. With the advent of deep understanding of emotion evocation in brain science, accurate emotion measurement in psychology, and novel deep learning network architectures in machine learning, affective computing of multimedia data will remain an active research topic for a long time.

This work was supported by Berkeley DeepDrive, the National Natural Science Foundation of China (Nos. 61701273, 91748129), and the National Key R&D Program of China (Grant No. 2017YFC011300). The work of MS is supported in part by the U.S. Army. Any opinion, content or information presented does not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred.