Audio-Visual Sentiment Analysis for Learning Emotional Arcs in Movies

by   Eric Chu, et al.

Stories can have tremendous power -- not only useful for entertainment, they can activate our interests and mobilize our actions. The degree to which a story resonates with its audience may be in part reflected in the emotional journey it takes the audience upon. In this paper, we use machine learning methods to construct emotional arcs in movies, calculate families of arcs, and demonstrate the ability for certain arcs to predict audience engagement. The system is applied to Hollywood films and high quality shorts found on the web. We begin by using deep convolutional neural networks for audio and visual sentiment analysis. These models are trained on both new and existing large-scale datasets, after which they can be used to compute separate audio and visual emotional arcs. We then crowdsource annotations for 30-second video clips extracted from highs and lows in the arcs in order to assess the micro-level precision of the system, with precision measured in terms of agreement in polarity between the system's predictions and annotators' ratings. These annotations are also used to combine the audio and visual predictions. Next, we look at macro-level characterizations of movies by investigating whether there exist `universal shapes' of emotional arcs. In particular, we develop a clustering approach to discover distinct classes of emotional arcs. Finally, we show on a sample corpus of short web videos that certain emotional arcs are statistically significant predictors of the number of comments a video receives. These results suggest that the emotional arcs learned by our approach successfully represent macroscopic aspects of a video story that drive audience engagement. Such machine understanding could be used to predict audience reactions to video stories, ultimately improving our ability as storytellers to communicate with each other.


page 1

page 2

page 3

page 4


Story Understanding in Video Advertisements

In order to resonate with the viewers, many video advertisements explore...

Statistical Selection of CNN-Based Audiovisual Features for Instantaneous Estimation of Human Emotional States

Automatic prediction of continuous-level emotional state requires select...

GLA in MediaEval 2018 Emotional Impact of Movies Task

The visual and audio information from movies can evoke a variety of emot...

BCSAT : A Benchmark Corpus for Sentiment Analysis in Telugu Using Word-level Annotations

The presented work aims at generating a systematically annotated corpus ...

Emotional Video to Audio Transformation Using Deep Recurrent Neural Networks and a Neuro-Fuzzy System

Generating music with emotion similar to that of an input video is a ver...

Remote Recording of Emotional and Activity Data: A Methodological Study

The impact of physical exercise on emotional well-being is one of the mo...

Toward Multimodal Modeling of Emotional Expressiveness

Emotional expressiveness captures the extent to which a person tends to ...

I Introduction

Theories of the origin and purpose of stories are numerous, ranging from story as “social glue”, escapist pleasure, practice for social life, and cognitive play [8, 12].

However, not all stories produce the same emotional response. Can we understand these differences in part through the lens of emotional arcs? Kurt Vonnegut, among others, once proposed the concept of “universal shapes” of stories, defined by the “Beginning–End” and “Ill Fortune–Great Fortune” axes. He argued nearly all stories could fit a core arc such as the classic “Cinderella” (rise-fall-rise) pattern [1].

It is well known that emotions play no small part in people’s lives. We have seen emotional narratives as a convincing medium for explaining the world we inhabit, enforcing societal norms, and giving meaning to our existence [21]. It has even been shown that emotions are the most important factor in how we make meaningful decisions [19].

There is also evidence that a story’s emotional content can explain the degree of audience engagement. The authors of [4] and [3] examined whether the valence and emotionality was predictive of New York Times articles making the Times’ most e-mailed list. Ultimately, they found that emotional and positive media were more likely to be shared.

Motivated by the surge of video as a means of communication [14], the opportunities for machine modeling, and the lack of existing research in this area, we makes steps towards tackling these questions by viewing movies through emotional arcs. We deliver the following contributions:

  • Datasets. We introduce a Spotify dataset containing over 600,000 audio samples and features that can be used for audio classification. We also collect annotation data regarding the sentiment of approximately 1000 30-second movie clips. Both datasets will be made publicly available111

  • Modeling emotional arcs.

    We train sentiment classifiers that can be used to compute audio and visual arcs. We also motivate a) the use of dynamic time warping with the Keogh lower bound to compare the shapes of arcs, and b) k-medoids as the algorithm for clustering.

  • Engagement analysis. We provide an example in which a movie’s arc can be a statistically significant predictor of the number of comments an online video receives.

Ii Related work

The work is most closely paralleled by [26]

, which analyzed books to state that “the emotional arcs of stories are dominated by six basic shapes.” Using text-based sentiment analysis, they use a singular value decomposition analysis to find a basis for arcs. The bases that explain the greatest amount of variance then form the basic shapes of stories.

In contrast to our computational approach, writers at Dramatica [2] have manually analyzed a number of books and films under the Dramatica theory of story, which has since been used to create software that can guide writers.

Research in sentiment analysis has primarily been text-based, with work ranging from short, sentence-length statements to long form articles [28, 24]. There has been comparatively little work on images, with the Sentibank visual sentiment concept dataset [7] being a prominent example.

Neural networks have been used in earlier research for tasks such as document recognition [18]. In recent years, deep neural networks have been successful in both the visual and audio domain, being used for image classification [17], speech recognition [13], and many other tasks.

Outside of emotional arcs, there exists other research that applies computational methods to understanding story. For instsance, the M-VAD [29] and MovieQA [27] datasets include a combination of subtitles, scripts, and Described Video Service narrations for movies, enabling research on visual question-answering of plot.

Iii Overview

Figure 1 outlines the major pieces of this work. We note that modeling occurs at two scales – micro-level sentiment, performed on a slice of video such a frame or snippet of audio, and macro-level emotional arcs.

Fig. 1: Overview

Reflecting this distinction, we first evaluate the ability to accurately extract micro-level emotional highs and lows, which we refer to as

emotionally charged moments

or emotional peaks and valleys. Specifically, we measure precision as the amount of agreement in polarity between the models’ sentiment predictions and annotators’ ratings. Second, at the macro-level, we evaluate arcs by a) clustering, and b) using the arcs for the engagement analysis. We note that the final engagement analysis uses only the visual arcs, as we are currently limited by the amount of ground truth data that allows us to combine audio and visual predictions.

Iv Datasets

Iv-a Videos – Films and Shorts Corpora

The system operates on two datasets collected for this work – a corpora of Hollywood films and a corpora of hiqh quality short films. We selected films because they are created to tell a story. That there exist common film-making techniques to convey plot and elicit emotional responses also suggests the possibility of finding families of arcs.

Including Vimeo shorts allows us to 1) find differences in storytelling that may exist between films and newer, emergent formats, 2) possibly serve as a gateway to modeling, more generally, the short form videos that commonly spread on social networks, and 3) conduct an engagement analysis using the online comments left on Vimeo.

The first dataset, the Films Corpora, consists of 509 Hollywood films. Notably, there is considerable overlap with the MovieQA [27] and M-VAD [29] datasets. The Shorts Corpora is a dataset of 1,326 shorts from the Vimeo channel ‘Short of the Week’. These shorts are collected by filmmakers and writers. A short can be 30 seconds to over 30 minutes long, with the median length being 8 minutes and 25 seconds.

Iv-B Image sentiment – Sentibank dataset

We use the Sentibank dataset [7] of nearly half a million images, each labeled with one of 1,533 adjective-noun pairs. These pairs, such as “charming house” and “ugly fish”, are termed emotional biconcepts

. Each biconcept is mapped to a sentiment value using the SentiWordnet lexicon.

Iv-C Audio sentiment – Spotify dataset

We started with the Million Song Dataset [6], in which certain metadata, such as tempo or major/minor key, could in theory be used as a proxy for valence and other relevant target features. We also considered the Last.FM dataset of user-provided labels such as “happy” or “sad”. Unfortunately, this dataset is significantly smaller.

To address our needs, we turned to the Spotify API. The API provides not only 30-second samples for songs, but also audio features used by the company. These include valence, which measures the “musical positiveness conveyed by a track.” Other features include speechiness, liveness, etc.

V Methodology

V-a Constructing emotional arcs

Once the visual and audio models are trained for sentiment prediction (to be explained in Sections V-B and V-C), each is applied separately across the length of the movie. To construct the visual arc, we extract a frame per second in the movie, resize and center crop it to size , and then pass it through the sentiment classifier. To construct the audio arc, we extract sliding 20-second windows.

Fig. 2: Effect of smoothing values for arcs: no smoothing versus

While the left plot in Figure 2 shows that the macro-level shape is visible from the raw predictions, it is helpful to smooth these signals to produce clearer arcs by convolving each time series with a Hann window of size . In downstream tasks, commonly used window sizes are 0.05, 0.1, and 0.2 * , where is the length of the video. Example arcs are shown in Figure 3

, with the audio arc bounded by the confidence intervals to be described in Section


Fig. 3: Audio (yellow) and visual (blue) arcs for Her

V-B Image modeling

The various effects of the visual medium has been well studied, ranging from the positive psychological effects of nature scenes [30] to the primacy of color, an effect so powerful that some filmmakers explicitly map color to target emotions in pre-production colorscripts [5]. We thus built models take a frame as input.

V-B1 Model

We use a deep convolutional neural network based on the AlexNet architecture [17]

to classify images. While a more state-of-the-art architecture would have higher accuracy, our focus was on building higher order arcs, for which this relatively simple model sufficed. However, we did use the PReLU activation unit, batch normalization, and ADAM for optimization to reflect recent advancements.

V-B2 Sentiment prediction

The network was trained using images with a sentiment greater than 0.5 as ‘positive’, and those less than -0.5 as ‘negative’. We used a learning rate of 0.01, a batch size of 128, and a batch normalization decay of 0.9. The performance is shown in Table I.

Accuracy Precision Recall F1
0.652 0.753 0.729 0.741
TABLE I: Performance of sentiment classifier

V-B3 Emotional biconcept prediction

Using only the sentiment value is useful for creating emotional arcs, but it also discards information. Thus, we trained a second network that treats the biconcepts as labels. This network proves useful in creating movie embeddings that broadly capture a movie’s emotional content. Details are discussed in section VI-C.

Only biconcepts with at least 125 images were used, leaving 880 biconcepts. The accuracy is shown in Table II. Top- accuracy is defined as the percent of images for which the true label was found in the top- predicted labels. We also show the top- accuracy defined by whether the true adj / noun was found in the top- predicted labels.

Acc. Top-1 7.4% Top-5 19.9% Top-10 28.4%
(a) Predicting adj-noun pair
Match adj Match noun Top-1 12.0% 15.1% Top-5 30.2% 31.7% Top-10 41.5% 40.7%
(b) Matching adj / noun in predicted adj-noun pair
TABLE II: Performance of emotional biconcept classifier

V-C Audio modeling

Imagine ‘watching’ a movie with your eyes closed – you would likely still be able to pinpoint moments of suspense or sadness. While often secondary to the more obvious visual stimuli, sound and music can be played with or in contrast to the visual scene. With the idea that just a few seconds is enough to set the mood, we created a model for sentiment classification that operates on 20-second snippets of audio.

V-C1 Model

We represent each audio sample as a 96-bin mel-spectrogram. We adopt the architecture used for music tagging in [9], which uses five conv layers with ELU and batch normalization, followed by a fully connected layer.

V-C2 Sentiment prediction

We used all samples that have a valence either greater than 0.75 or less than 0.25, leaving 200,000 samples. The performance is shown in Table III.

Accuracy Recall Precision F1
0.896 0.871 0.931 0.900
TABLE III: Performance of audio sentiment classifier

V-C3 Uncertainty estimates

Unfortunately, we face the problem of covariate shift, where the audio found in movies will often contain sound not found in the song-based training set. For example, there may be significant sections of background noise, conversation, or silence.

To handle ‘unfamiliar’ inputs, we aim to produce confidence intervals for every prediction. While the softmaxed activations can be interpreted as probabilities, and hence a reflection of confidence, these probabilities can often be biased and require calibration

[23]. We thus follow a method introduced in [11]

, which produces approximate uncertainty estimates for any dropout network by passing the input

times through the network with dropout at test time

, and using the standard deviation of the predictions to define a confidence interval around the mean of the predictions.

V-D Finding families of emotional arcs

V-D1 Approach: k-medoids and dynamic time warping

A naive approach to clustering arcs could be to use a popular algorithm such as k-means

[20] with an Euclidean metric to measure the distance between two arcs. However, this is a poor approach for our problem for two reasons:

  • Taking the mean of arcs can fail to find centroids that accurately represent the shapes in that cluster. Figure 4 shows a pathological example of when this occurs. The mean of the left two arcs has two peaks instead of one.

  • The Euclidean distance between two arcs doesn’t necessarily reflect the similarity of their shapes. While the left two arcs in Figure 4 are similar in shape (one large peak), their Euclidean distance may be quite large.

With these limitations in mind, we turn to k-medoids [15] with dynamic time warping (DTW) [10] as the distance function. K-medoids updates a medoid as the point that is the median distance to all other points in the cluster, while DTW is an effective measures the distance between two time series that may operate at different time scales.

Fig. 4: Pathological example of how k-means and Euclidean distance fails for clustering emotional arcs

Given two time series and of length , we construct a x matrix , where contains the squared difference between and . The DTW distance is the shortest path through this matrix. In Figure 4, for example, the DTW distance between the left two time series is 0.

V-D2 LB-Keogh for speed-up and better modeling

Several techniques to speed up DTW center around creating a ‘warping window’ that limits the available paths through the matrix . The Keogh lower bound creates upper and lower bounds that envelop the original time series , defined as:


where is a parameter reach that controls the size of the window. Intuitively, this controls how much a time series is allowed to warp. The lower-bounded distance between and is then given as:


If lies inside the envelope of , then the distance is 0.

Importantly, this approach has ramifications beyond increased speed up. Consider again the left two arcs in Figure 4. While both characterized by a large peak, it’s possible that the second, ending on a emotional moment, has greater impact. Consequently, we would like to only allow warping to a certain extent. The window does exactly that.

V-D3 Practical notes

First, we exclude movies longer than seconds (10000 and 1800 for the Films and Shorts Corpora). After all, a 60 minute movie is hardly a ‘short’. Second, since we are interested in the overall shape of the arc, we z-normalize each emotional arc. Finally, we settled on . This is close to , found to be optimal for a number of different tasks [25].

Vi Evaluation – Micro-level Moments

We evaluated the system’s micro-level accuracy by its precision in extracting emotionally charged moments from the emotional arcs. Notably, collecting ground truth data also allows us combine the audio and visual predictions.

Annotating an entire film, let alone multiple films, would be too costly and time intensive. We thus extracted video clips at peaks and valleys in the audio and visual arcs, hoping that these clips would lead to more interesting and informative annotations. Workers were asked to watch a clip and answer four questions regarding its emotional content. Each clip was annotated by three workers.

Vi-a Crowdsourcing experiment

We chose the CrowdFlower platform for its simplicity and ease of use. Answers to Question 1 (How positive or negative is this video clip? 1 being most negative, 7 being most positive) are referred to as valence ratings. We define a positive rating as a valence rating greater than 4, and a positive clip as those with a mean rating greater than 4. Negative ratings and clips are similarly defined.

Vi-B Precision of system

Vi-B1 Defining precision

We define the precision as:


In other words, a clip was accurately extracted if a) it was extracted from a peak in either arc, and it was labeled as a positive clip, or b) it was extracted from a valley in either arc, and it was labeled as a negative clip.

Vi-B2 Overall precision

Table IV lists the precision on both the full dataset and the full dataset with ambiguous clips (receiving both a positive and a negative rating) removed. We note that random chance would be 3/7 = 0.429. We argue that ambiguous clips should not be included, as it is unclear what their valence might be without more annotations. Further numbers are calculated with ambiguous clips removed.

Set Precision
All clips 0.642
No ambiguous clips 0.681
TABLE IV: Precision of clips: overall

Vi-B3 Precision of audio

Next, we look at the precision of clips extracted from the audio arc, examining our hypothesis that predictions with smaller confidence intervals would be more accurate. This would help affirm our uncertainty estimates approach. Table V shows that smaller confidence intervals do indeed correspond with greater precision.

Stddev Audio-peak precision Audio-valley precision
[0, 0.02) 4 / 4 = 1.0 81 / 88 = 0.921
[0.02, 0.04) 11 / 11 = 1.0 38 / 56 = 0.679
[0.04, 0.06) 28 / 40 = 0.7 21 / 35 = 0.6
[0.06, 0.08) 39 / 60 = 0.65 13 / 21 = 0.619
[0.08, 0.1) 43 / 68 = 0.632 8 /23 = 0.615
TABLE V: Precision of clips extracted from audio emotional arc: smaller confidence intervals are more precise

Vi-B4 Precision of various cuts

The precision on various subsets is shown in Table VI (a). We highlight that the visual-peaks have low precision. This is explored in the next section and used to motivate feature engineering for the combined model in Section VI-C.

Cut Precision Audio-peaks 0.683 Audio-valleys 0.758 Visual-peaks 0.508 Visual-valleys 0.757
(a) Cuts
Genre Overall Visual-peak Action 0.678 0.264 Science Fiction 0.699 0.333 Thriller 0.678 0.382 Adventure 0.726 0.443 Drama 0.660 0.520 Fantasy 0.769 0.590 Comedy 0.705 0.667 Animation 0.798 0.667 Family Film 0.760 0.722 Romance 0.678 0.757 Romantic Comedy 0.677 0.823
(b) Genres
TABLE VI: Precision of clips: cuts and genre

Vi-B5 Precision by genre

Finally, we use [22] to tag each movie with genres. A subset of the results is shown in Table VI (b). The relatively poor precision of visual-peaks, as noted in the previous section, appears to be a product of poor precision on a number of genres. We also find a natural grouping of the genres when listed in this order. Genres with high visual-peak precision appear to be lighter films falling in the romance and family film genres.

Manual inspection of ‘incorrect’ visual-peak clips from the action-thriller genres shows many scenes with gore and death, images unlikely to be found in the Sentibank dataset, which was culled from publicly available images on Flickr.

Vi-C Combined audio-visual model

We create a linear regression model to predict the mean valence rating assigned by the annotators. In addition to standard features related to the clip’s audio / visual valence (relative to the movie’s mean, relative to the movie’s max, etc.), we create two key features detailed below.

Peakiness. The function approximates the slope and mean around a given point for arc (audio or visual), where is the window size around . returns 4 values: (proportional to the slope left of the ), , (the mean value left of ), and . This covers peaks, valleys, and inflection points. We use for our analyses.

Movie embedding. Motivated by the impact of genre in Section VI-B5, we sought to loosely summarize a movie’s emotional content. We represent each frame by the penultimate activation of the biconcept classifier described in Section V-B. Next, we average these activations across 10% chunks of the movie, resulting in a matrix. To translate these movie embeddings

to features, we take the mean of each 2048-sized vector, ending with a final

feature vector. While not shown in the interest of space, clustering these movie embeddings shows correspondence between clusters and genres, with romance, adventure, fantasy, and animated films being clearly visible.

Vi-D Precision of combined audio-visual model

The performance of the final combined model is shown in Table VII, along with various ablations of important features. Using all features, we achieve a precision of 0.894.

Feature set Overall Aud-peak Aud-valley Vis-peak Vis-valley
All features 0.894 0.940 0.884 0.872 0.886
No peakiness 0.815 0.836 0.828 0.765 0.824
No movie-embedding 0.784 0.869 0.786 0.722 0.752
TABLE VII: Precision of combined audio-visual model

Vii Evaluation – Macro-level

Results shown here are based on the visual arcs, which we consider to be the primary medium. We could in theory use arcs constructed from the combined audio-visual model described in Section VI-C. Experiments clustering the combined arcs, however, produce largely indistinct clusters. This is a result of a) the sparsity of ground truth data covering the entire audio-visual space found in movies (e.g. no clips were extracted from moments that were neutral in both the audio and visual arcs), and b) simply the small size of the dataset, leading to inaccurate combining of audio and visual.

Vii-a Cluster results

Figure 5 shows the results of the elbow method, which plots the number of clusters against the within cluster distance (WCD). We can see possible ‘elbows’ at =5 and =9. We briefly note that a k-means approach produced indistinct clusters and no discernible decrease in the WCD.

(a) Films Corpora
(b) Shorts Corpora
Fig. 5: Elbow plots for k-medoid clustering

Figure 6 shows one example clustering, representing five typical emotional arcs. Note that the steep inclines and declines at the start and end are artifacts of opening scenes and credits. Compared to the Films Corpora, typical arcs in the Shorts Copora tend to be less complex, but also more extreme (e.g. the yellow arc that ends on a steady decline).

Fig. 6: Clustering on Shorts Corpora with = 5, = 0.1

Vii-B Engagement analysis

We can now return to the question of whether a video’s emotional arcs affects the degree of viewer engagement.We performed a small experiment on our Shorts Corpora by using a) metadata features, and b) categorical cluster assignment features (which family of visual arcs does this movie belong to) as inputs to a regression model that predicts the number of comments a video received on Vimeo.

Nine models are created – one for each value of in [2,10]. Not surprisingly, the duration and year are often stat-sig predictors. However, three arcs are also stat-sig, each positively correlated with the number of comments. The first stat-sig arc, the yellow arc shown in Figure 6, fits the “Icarus” shape (rise-fall). The second ( = 8) and third ( = 10) arcs are characterized by a large peak near the end, with the former having some incline before the peak and the latter flat before the peak. In other words, they end with a bang. The results for = 8 are shown in Figure 7 and Table IX.

Fig. 7: TABLE IX. Engagement analysis for clustering. P-values less than 0.05 are bolded; less than 0.1 are italicized. Statistically significant predictive arc is shown.

Viii Concluding Remarks

We first developed methods for constructing and finding families of arcs. The crowdsourced annotation data prompts a number of possibilities for future work, such as dialogue-based arcs and plot-based sentiment modeling.

We were also able to show the predictive power of emotional arcs on a small subset of online Vimeo shorts. While intriguing, this was performed a) using only the visual arcs, and b) against a relatively simple metric. More data for the combined audio-visual model should generate more accurate arcs. It would also be interesting to see how, if at all, emotional features affect how videos propagate through social media sites like Twitter and Reddit.


  • [1] Kurt Vonnegut on the Shapes of Stories. Accessed: 2017-04-27.
  • [2] Dramatica: the next chapter in story development. Accessed: 2017-04-27.
  • [3] Katherine L Milkman and Jonah Berger. The science of sharing and the sharing of science. Proceedings of the National Academy of Sciences, 111(Supplement 4):13642–13649, 2014.
  • [4] Jonah Berger and Katherine L Milkman. What makes online content viral? Journal of marketing research, 49(2):192–205, 2012.
  • [5] Amid Amidi. The Art of Pixar: 25th Anniversary: The Complete Color Scripts and Select Art from 25 Years of Animation. Chronicle Books, 2015.
  • [6] Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In ISMIR, volume 2, page 10, 2011.
  • [7] Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM international conference on Multimedia, pages 223–232. ACM, 2013.
  • [8] Brian Boyd. On the origin of stories. Harvard University Press, 2009.
  • [9] Keunwoo Choi, George Fazekas, and Mark Sandler. Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298, 2016.
  • [10] Hui Ding, Goce Trajcevski, Peter Scheuermann, Xiaoyue Wang, and Eamonn Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. Proceedings of the VLDB Endowment, 1(2):1542–1552, 2008.
  • [11] Yarin Gal and Zoubin Ghahramani.

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning.

    In International Conference on Machine Learning, pages 1050–1059, 2016.
  • [12] Jonathan Gottschall. The storytelling animal: How stories make us human. Houghton Mifflin Harcourt, 2012.
  • [13] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
  • [14] Cisco Visual Networking Index. Cisco visual networking index: Forecast and methodology, 2010-2015. White Paper, CISCO Systems Inc, 9, 2011.
  • [15] Leonard Kaufman. Clustering by means of medoids. Statistical data analysis based on the L1-norm and related methods, 1987.
  • [16] Eamonn Keogh. Exact indexing of dynamic time warping. In Proceedings of the 28th international conference on Very Large Data Bases, pages 406–417. VLDB Endowment, 2002.
  • [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [19] Jennifer S Lerner, Ye Li, Piercarlo Valdesolo, and Karim S Kassam. Emotion and decision making. Annual Review of Psychology, 66:799–823, 2015.
  • [20] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA., 1967.
  • [21] Douglas S Massey. A brief history of human society: The origin and role of emotion in social life. American Sociological Review, 67(1):1, 2002.
  • [22] David Bamman, Brendan O’Connor, and Noah A Smith. Learning latent personas of film characters. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), page 352, 2014.
  • [23] Alexandru Niculescu-Mizil and Rich Caruana.

    Predicting good probabilities with supervised learning.

    In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM, 2005.
  • [24] Bo Pang, Lillian Lee, et al. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, 2(1–2):1–135, 2008.
  • [25] Chotirat Ann Ratanamahatana and Eamonn Keogh. Everything you know about dynamic time warping is wrong. In Third Workshop on Mining Temporal and Sequential Data. Citeseer, 2004.
  • [26] Andrew J Reagan, Lewis Mitchell, Dilan Kiley, Christopher M Danforth, and Peter Sheridan Dodds. The emotional arcs of stories are dominated by six basic shapes.

    EPJ Data Science

    , 5(1):31, 2016.
  • [27] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4631–4640, 2016.
  • [28] Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas. Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12):2544–2558, 2010.
  • [29] Atousa Torabi, Christopher Pal, Hugo Larochelle, and Aaron Courville. Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070, 2015.
  • [30] Roger S Ulrich. Visual landscapes and psychological well-being. Landscape research, 4(1):17–23, 1979.