Emotion Detection on TV Show Transcripts with Sequence-based Convolutional Neural Networks

by   Sayyed M. Zahiri, et al.
Emory University

While there have been significant advances in detecting emotions from speech and image recognition, emotion detection on text is still under-explored and remained as an active research field. This paper introduces a corpus for text-based emotion detection on multiparty dialogue as well as deep neural models that outperform the existing approaches for document classification. We first present a new corpus that provides annotation of seven emotions on consecutive utterances in dialogues extracted from the show, Friends. We then suggest four types of sequence-based convolutional neural network models with attention that leverage the sequence information encapsulated in dialogue. Our best model shows the accuracies of 37.9 emotions, respectively. Given the difficulty of this task, this is promising.


EmotionLines: An Emotion Corpus of Multi-Party Conversations

Feeling emotion is a critical characteristic to distinguish people from ...

Building a Dialogue Corpus Annotated with Expressed and Experienced Emotions

In communication, a human would recognize the emotion of an interlocutor...

Emotion Recognition in Horses with Convolutional Neural Networks

Creating intelligent systems capable of recognizing emotions is a diffic...

Generating Responses Expressing Emotion in an Open-domain Dialogue System

Neural network-based Open-ended conversational agents automatically gene...

A Deep Learning Based Chatbot for Campus Psychological Therapy

In this paper, we propose Evebot, an innovative, sequence to sequence (S...

Predicting Audience's Laughter Using Convolutional Neural Network

For the purpose of automatically evaluating speakers' humor usage, we bu...

CAiRE_HKUST at SemEval-2019 Task 3: Hierarchical Attention for Dialogue Emotion Classification

Detecting emotion from dialogue is a challenge that has not yet been ext...

1 Introduction

Human emotions have been widely studied in the realm of psychological and behavioral sciences as well as computer science Strapparava and Mihalcea (2008). A wide variety of researches have been conducted in detecting emotions from facial expressions and audio waves Yu et al. (2001); Zeng et al. (2006); Lucey et al. (2010)

. The recent advent of natural language processing and machine learning has made the task of emotion detection on text possible, yet since emotions are not necessarily conveyed on text, quantifying different types of emotions using only text is generally challenging.

Another challenging aspect about this task is due to the lack of annotated datasets. There are few publicly available datasets Strapparava and Mihalcea (2007); Alm (2008); Mohammad and Bravo-Marquez (2017); Buechel and Hahn (2017). However, in order to further explore the feasibility of text-based emotion detection on dialogue, a more comprehensive dataset would be desired. This paper presents a new corpus that comprises transcripts of the TV show, Friends, where each utterance is annotated with one of the seven emotions: sad, mad, scared, powerful, peaceful, joyful, and neutral. Several annotation tasks are conducted through crowdsourcing for the maintenance of a high quality dataset. Dialogues from these transcripts include disfluency, slangs, metaphors, humors, etc., which make this task even more challenging. To the best of our knowledge, this is the largest text-based corpus providing fine-grained emotions for such long sequences of consecutive utterances in multiparty dialogue.

Convolutional neural networks (CNN) have been popular for several tasks on document classification. One of the major advantages of CNN is found in its capability of extensive feature extraction through deep-layered convolutions. Nonetheless, CNN are often not used for sequence modeling 

Waibel et al. (1989); LeCun et al. (1995); Gehring et al. (2017)

because their basic architecture does not take previous sequence information into account. One common approach to alleviate this issue is using recurrent neural networks 

Sutskever et al. (2014); Liu et al. (2016). However, RNNs typically perform slower and require more training data to avoid overfitting. To exploit the sequence information embedded in our corpus yet to employ the advantages of CNN, sequenced-based CNN (SCNN) are proposed along with attention mechanisms, which guide CNN to fuse features form the current state with features from the previous states. The contributions of this research are summarized as follows:

Speaker Utterance A1 A2 A3 A4
Monica He is so cute . So , where did you guys grow up ? Peaceful Joyful Joyful Joyful
Angela Brooklyn Heights . Neutral Neutral Neutral Neutral
Bob Cleveland . Neutral Neutral Neutral Neutral
Monica How , how did that happen ? Peaceful Scared Neutral Neutral
Joey Oh my god . Joyful Sad Scared Scared
Monica What ? Neutral Neutral Neutral Neutral
Joey I suddenly had the feeling that I was falling . But I ’m not . Scared Scared Scared Scared
Table 1: An example of the emotion annotation. A#: annotation from 4 different crowd workers.
  • We create a new corpus providing fine-grained emotion annotation on dialogue and give thorough corpus analytics (Section 3) .

  • We introduce several sequence-based convolution neural network models with attention to facilitate sequential dependencies among utterances. (Section 4).

  • We give both quantitative and qualitative analysis that show the advantages of SCNN over the basic CNN as well as the advances in the attention mechanisms (Section 5).

2 Related Work

2.1 Text-based Emotion Detection

Text-based emotion detection is on its early stage in natural language processing although it has recently drawn lots of attention. There are three common methods researchers have employed for detecting emotions from text: keyword-based, learning-based and hybrids of those two. In the first method, classification is done by aid of emotional keywords. strapparava2004wordnet categorized emotions by mapping keywords in sentences into lexical representation of affective concepts. chaumartin2007upar7 did emotion detection on news headlines. The performance of these keyword-based approaches had been rather unsatisfactory due to the fact that the semantics of the keywords heavily depend on the contexts, and it is significantly affected by the absence of those keywords Shaheen et al. (2014)

Two types of machine learning approaches have been used for the second method, supervised approaches where training examples are used to classify documents into emotional categories, and unsupervised approaches where statistical measures are used to capture the semantic dependencies between words and infer their relevant emotional categories. chaffar2011using detected emotions from several corpora collected from blog, news headline, and fairy tale using a supervised approach. hajar2016using used YouTube comments and developed an unsupervised learning algorithm to detect emotions from the comments. Their approach gave comparative performance to supervised approaches such as chaffar2011using that support vector machines were employed for statistical learning.

The hybrid method attempts to take advantages from both the keyword-based and learning methods. seol2008emotion used an ensemble of a keyword-based approach and knowledge-based artificial neural networks to classify emotions in drama, novel, and public web diary. Hybrid approaches generally perform well although their architectures tend to be complicated for the replication.

2.2 Sequence Modeling on Dialogue

The tasks of state tracking and dialogue act classification are ongoing fields of research similar to our task. shi2016multichannel proposed multi-channel CNN for cross-language dialogue tracking. dufour2016tracking introduced a topic model that considered all information included in sub-dialogues to track the dialogue states introduced by the DSTC5 challenge Kim et al. (2016). stolcke2000dialogue proposed a statistical approach to model dialogue act on human-to-human telephone conversations.

2.3 Attention in Neural Networks

Attention mechanism has been widely employed in the field of computer vision 

Mnih et al. (2014); Xu et al. (2015) and recently become popular in natural language processing as well. In particular, incorporating attention mechanism has achieved the state-of-the-art performance in machine translation and question answering tasks Bahdanau et al. (2014); Hermann et al. (2015); dos Santos et al. (2016). Our attention mechanism is distinguished from the previous work which perform attention on two statistic embeddings whereas our approach puts attention on static and dynamically generated embeddings.

3 Corpus

The Character Mining project provides transcripts of the TV show, Friends; transcripts from all seasons of the show are publicly available in JSON.111nlp.mathcs.emory.edu/character-mining Each season consists of episodes, each episode contains scenes, each scene includes utterances, where each utterance gives the speaker information. For this research, we take transcripts from the first four seasons and create a corpus by adding another layer of annotation with emotions. As a result, our corpus comprises 97 episodes, 897 scenes, and 12,606 utterances, where each utterance is annotated with one of the seven emotions borrowed from the six primary emotions in the willcox1982feeling’s feeling wheel, sad, mad, scared, powerful, peaceful, joyful, and a default emotion of neutral. Table 1 describes a scene containing seven utterances and their corresponding annotation from crowdsourcing.

3.1 Crowdsourcing

Pioneered by snow2008cheap, crowdsourcing has been widely used for the creation of many corpora in natural language processing. Our annotation tasks are conducted on the Amazon Mechanical Turk. Each MTurk HIT shows a scene, where each utterance in the scene is annotated by four crowd workers who are asked to choose the most relevant emotion associated with that utterance. To assign a suitable budget for each HIT, the corpus is divided into four batches, where all scenes in each batch are restricted to the [5, 10), [11, 15), [15, 20), [20, 25] number of utterances and are budgeted to 10, 13, 17, 20 cents per HIT, respectively. Each HIT takes about 2.5 minutes on average, and the entire annotation costs about $680. The annotation quality for 20% of each HIT is checked manually and those HITs with poor quality are re-annotated.

Type B1 B2 B3 B4 Total
Utterances 1,742 2,988 3,662 4,214 12,606
Scenes 241 249 216 191 897
Table 2: The total number of utterances and scenes in each annotation batch. B#: the batch number.

3.2 Inter-Annotator Agreement

Two kinds of measurements are used to evaluate the inter-annotator agreement (Table 3). First, Cohen’s kappa is used to measure the agreement between two annotators whereas Fleiss’ kappa is used for three and four annotators. Second, the partial agreement (an agreement between any pair of annotators) is measured to illustrate the improvement from a fewer to a greater number of crowd workers. Among all kinds of annotator groups, the kappa scores around 14% are achieved. Such low scores are rather expected because emotion detection is highly subjective so that annotators often judge different emotions that are all acceptable for the same utterance. This may also be attributed to the limitation of our dataset; a higher kappa score could be achieved if the annotators were provided with a multimodal dataset (e.g., text, speech, image).

# of annotators 2 3 4
Kappa 14.18 14.21 14.34
Partial 29.40 62.20 85.09
Table 3: Inter-annotator agreement (in %).

While the kappa scores are not so much compelling, the partial agreement scores show more promising results. The impact of a greater number of annotators is clearly illustrated in this measurement; over 70% of the annotation have no agreement with only two annotators, whereas 85% of the annotation find some agreement with four annotators. This implies that it is possible to improve the annotation quality by adding more annotators.

It is worth mentioning that crowd workers were asked to choose from 37 emotions for the first 25% of the annotation, which comprised 36 secondary emotions from willcox1982feeling and the neutral emotion. However, vast disagreements were observed for this annotation, resulting Cohen’s kappa score of 0.8%. Thus, we proceeded with the seven emotions described above, hoping to go back and complete the annotation for fine-grained emotions later.

3.3 Voting and Ranking

We propose a voting/ranking scheme that allows to assign appropriate labels to the utterances with disagreed annotation. Given the quadruple annotation, we first divide the dataset into five-folds (Table 4):

Fold Count Ratio
778 6.17
5,774 45.80
2,991 23.73
1,879 14.91
1,184 9.39
Table 4: Folds with their utterance counts and ratios (in %) for voting/ranking. : annotation 1..4.

For the first three folds, the annotation coming from the majority vote is considered the gold label (). The least absolute error (LAE) is then measured for each annotator by comparing one’s annotation to the gold labels. For the last two folds, annotation generated by the annotator with the minimum LAE is chosen as gold, which is reasonable since those annotators generally produce higher quality annotation. With this scheme, 75.5% of the dataset can be deterministically assigned with gold labels from voting and the rest can be assigned by ranking.

3.4 Analysis

Table 5 shows the distribution of all emotions in our corpus. The two most dominant emotions, neutral and joyful, together yield over 50% of the dataset, which does not seem to be balanced although it is understandable given the nature of this show, that is a comedy. However, when the coarse-grained emotions are considered, positive, negative, and neutral, they yield about 40%, 30%, and 30% respectively, which gives a more balanced distribution. The last column shows the ratio of annotation that all four annotators agree. Only around 1% of the annotation shows complete agreement for peaceful and powerful, which is reasonable since these emotions often get confused with neutral.

3 Emotions 7 Emotions Ratio 4-Agree
Neutral Neutral 29.95 7.8
Positive Joyful 21.85 9.4
Peaceful 9.44 1.0
Powerful 8.43 0.8
Negative Scared 13.06 3.8
Mad 10.57 7.7
Sad 6.70 4.3
Table 5: Emotion distribution ratios and the complete agreement ratios (in %).
Figure 1: Emotions of the main characters within a scene. Rows correspond to the main characters’ emotions and columns show the utterance number. No talking is occurred in the white regions.

Figure 1 illustrates how emotions of the six main characters are progressing and getting affected by other utterances as time elapses within a scene. It is clear that the emotion of the current speaker is often affected by the same speaker’s previous emotions as well as previous emotions of the other speakers participating in the dialogue.

Figure 2: Confusion matrix of corpus annotation. Each matrix cell contains the raw count.

Figure 2 shows the confusion matrix with respect to the annotation (dis)agreement. Rows correspond to the labels obtained using the voting scheme (Section 3.3), and columns represent the emotions selected by each of four annotators. The two dominant emotions, neutral and joyful, cause the most confusion for annotators whereas the minor emotions such as sad, powerful, or peaceful show good agreement on the diagonal.

3.5 Comparison

The ISEAR databank consists of 7,666 statements and six emotions,222emotion-research.net/toolbox/toolboxdatabase.2006-10-13.2581092615 which were gathered from a research conducted by psychologists on 3,000 participants. The SemEval’07 Task 14 dataset was created from news headlines; it contained 250 sentences annotated with six emotions Strapparava and Mihalcea (2007). The WASSA’17 Task 1 dataset was collected from 7,097 tweets and labeled with four emotions Mohammad and Bravo-Marquez (2017). The participants of this shared task were expected to develop a model to detect the intensity of emotions in tweets. Our corpus is larger than most of the other text-based corpora and the only kind providing emotion annotation on dialogue sequences conveyed by consecutive utterances.

Figure 3: The overview of the sequence-based CNN using concatenation (SCNN), SM: Softmax.

4 Sequence-Based Convolutional Neural Networks (SCNN)

A unique aspect about our corpus is that it preserves the original sequence of utterances given each dialogue such that it allows to tackle this problem as a sequence classification task. This section introduces sequence-based CNN models that utilize the emotion sequence from the previous utterances for detecting the emotion of the current utterance. Additionally, attention mechanisms are suggested for better optimization of these SCNN models.

4.1 Sequence Unification: Concatenation

We present a Sequence-based CNN that leverages the sequence information to improve classification. Figure 3 describes our first SCNN model. The input to SCNN is an embedding matrix , where is the maximum number of tokens in any utterance and is the embedding size. Each row in represents a token in the utterance. At each time step, region sizes of are considered. Each region has the

-number of filters. As a result of applying convolution and max-pooling, a univariate feature vector

is generated for the current utterance.

In the next step, the dense feature vectors from the current utterance and -1 previous utterances within the same dialogue get concatenated column-wise. As a result of this concatenation, the vector is created. Then, 1-D convolution is applied to to obtain the vector , where

is the stride and

is the receptive field of the 1-D convolution. As a result of this operation, features extracted from the current utterance gets fused with the features associated with the previous utterances. In the final step, softmax is applied for the classification of seven emotions.

4.2 Sequence Unification: Convolution

Let us refer the model in Section 4.1 to SCNN. In the second proposed model, referred to SCNN (Fig. 4), two separate 2-D convolutions, Conv and Conv, are utilized for sequence unification. The input to Conv is the same embedding matrix . The input to Conv is another matrix , which is a row-wise concatenation of the dense vectors generated from the Conv of the current utterance as well as -1 previous utterances. Conv has region sizes of with the -number of filters for each region size. The output of Conv is the vector . Conv and Conv are conceptually identical although they have different region sizes, filters and other hyper-parameters.

Figure 4: The SCNN model, SM: Softmax.

In the next step, the outputs of two convolutions get concatenated column-wise to create the vector , which is fed into an one-dimensional CNN to create the vector , that is the fused version of the feature vectors. Finally, the vector

is passed to a softmax layer for classification. Note that the intuition behind Conv

is to capture features from the emotion sequence as Conv captures features from -grams.

Figure 5: The overview of SCNN model.

4.3 Attention Mechanism

We also equipped the SCNN and SCNN

models with an attention mechanism. This mechanism allows the SCNN models to learn what part of the features should be attended more. Essentially, this attention model is a weighted arithmetic sum of current utterance’s feature vector where the weights are chosen based on the relevance of each element of that feature vector given the unified feature vectors from the previous utterances.

Figure 5 depicts attention on SCNN, SCNN. In this model, the current feature vector and the -1 previous feature vectors get concatenated row-wise to create . An attention matrix is applied to the current feature vector. The weights of this attention matrix are learned given the past feature vectors. The result of this operation is . Finally, 1-D convolution and softmax are applied to .

Figure 6: The SCNN model, SM:Softmax.

Figure 6 shows another attention model, called SCNN, based on SCNN. In this model, an attention vector with trainable weights is applied to the outputs of Conv () and Conv (). The output of the attention vector is a vector . Finally, 1-D convolution and softmax are applied to to complete the classification task. The multiplication sign in Figures 5 and 6 are the matrix multiplication operator and the sign refers to the transpose operation.

Generally the inputs to attention mechanism have a fixed size whereas in our model one of the input comes from dynamically generated embedding of previous hidden layers and the other input is the dense representation of the current utterance.

5 Experiments

5.1 Corpus

Our corpus is split into training, development, and evaluation sets that include 77, 11, and 9 episodes, respectively. Although episodes are randomly assigned, all utterances from the same episode are kept in the same set to preserve the sequence information. Further attempts are made to maintain similar ratios for each emotion across different sets. Table 6 shows the distributions of the datasets.

Set N J P W S M D Total
TRN 3,034 2,184 899 784 1,286 1,076 671 9,934
DEV 393 289 132 134 178 143 75 1,344
TST 349 282 159 145 182 113 98 1,328
Table 6: The number of utterances in each dataset. N: neutral, J: joyful, P: peaceful, W: powerful, S: scared, M: mad, D: sad.
Acc F1 Acc F1 Acc F1 Acc F1
1 38.30 24.1 36.5 21.00 38.30 26.60 38.10 25.70
2 38.80 23.0 37.01 20.00 38.60 28.00 38.70 25.60
3 39.10 25.74 37.00 21.00 39.70 28.50 38.46 26.15
4 37.40 22.35 36.30 20.00 38.00 26.80 38.16 25.20
5 38.20 24.1 37.20 22.00 38.90 27.76 38.90 28.20
Table 7: The performance of different models (in %) on development set. Acc: accuracy for 7 classes, F1: macro-average F1-score for 7 classes. First column refers to number of sequence information.

5.2 Preprocessing

In this work we utilize Word2vec word embedding model introduced by mikolov2013distributed. The word embedding is trained separately using Friends TV show transcripts, Amazon reviews, New York times and Wall street journal articles. In this research we use word embedding size of 200.

5.3 Models

We report the results of the following four models: SCNN , SCNN, SCNN and SCNN. For each of the mentioned models, we collect the results using previous utterances. For a better comparison we also report the results achieved from base CNN Kim (2014)

and RNN-CNN. RNN-CNN is our replication of the model proposed by donahue2015long to fuse time-series information for visual recognition and description. RNN-CNN that we employ is comprised of a Long Short Term Memory (LSTM); the input to LSTM is the feature vectors generated by CNN for all the utterances in a scene and we train RNN-CNN to tune all the hyper-parameters.

5.4 Results

Table 7 summarizes the overall performance of our proposed sequence-based CNN models on the development set. First column of the table indicates the number of previous utterances included in our model. We report both accuracy and F1-score for evaluating the performance of our models. F1-score, by considering false positives and false negatives is generally a better way of measuring the performance on a corpus with unbalanced distributed classes. In SCNN and SCNN, by including previous three utterances, and in SCNN and SCNN, by considering previous five utterances the best results are achieved. We also ran our experiments with considering more than five utterances, however no significant improvement was observed.

Acc Acc F1 F1
CNN 37.01 49.78 22.91 36.83
RNN-CNN 29.00 42.10 11.00 24.05
SCNN 37.35 53.20 25.06 38.00
SCNN 36.45 51.11 21.00 36.50
SCNN 37.90 54.00 26.90 39.25
SCNN 37.67 51.90 26.70 38.21
Table 8: The performance of different models (in %) on evaluation set. Acc: accuracy for 7 classes, Acc: accuracy for 3 classes (Table 5), F1: macro-average F1-score for 7 classes, F1: macro-average F1-score for 3 classes.

Table 8 summarizes the overall performance of our models on the evaluation set. To compare our proposed models to some other baseline models, we also include performances of CNN and RNN-CNN. The accuracies and the F1-scores are reported for the cases of 7 and 3 emotions (Table 5

), where the latter case is comparable to the typical sentiment analysis task. For the models listed in table

8, we choose the sequence numbers that had the best performances on the development set (3 for SCNN and SCNN, 5 for SCNN and SCNN). From tables 7 and 8, we can see that SCNN outperformed all other listed models.

To fuse the generated dense feature vectors we applied different combinations of regularized feature fusion networks similar to the networks employed in wu2015modeling and bodla2017deep. However, for our task, none of these fusion networks performed better than the 1-D convolutional layer we utilized. It worth to be mention that, in the first time step of our both attentive models, the two inputs to the attention matrix/vector are two identical vectors (first utterance’s feature vector) which mean that the main impact of attention mechanism starts from the second time step.

5.5 Analysis

Figure 7 shows the confusion matrix of gold labels and prediction for our best model, SCNN. Mostly all of the emotions get confused the most with neutral. Peaceful has the highest rate of confusion with neutral; 30% of total number of examples in this class are confused with neutral. Whereas, joyful and powerful have the least confusion rates with neutral (13.8% and 20.4% respectively).

Figure 7: Confusion matrix of the best model on the evaluation set. Each matrix cell contains the raw count.

To further explore the effect of sequence number, we divided the evaluation set into four batches based on the number of utterances in the scenes as described in section 3.1. After examining the performances of SCNN and SCNN on four batches, we noticed these two models by considering previous three sequences, performed better (roughly 4% boost in F1-scores) on the first two batches (i.e. scenes containing [5,15) utterances) compared to other models such as base CNN. It seems, in very long scenes which usually contain more speakers and transitions between the speakers, our proposed models did not significantly outperform base CNN.

During our experiments we observed that RNN-CNN did not have a compelling performance. Generally, complicated models with higher number of hyper parameters require a larger corpus for tuning the hyper parameters. Given the relatively small size of our proposed corpus for such a model, we noticed that RNN-CNN over fitted rapidly at the very early epochs. It was basically well-tuned to detect two most dominate classes. Also,

SCNN, did not outperform base CNN model although it did outperform base CNN after including attention mechanism (SCNN). We believe that size of our corpus could be inadequate to train this model which has more hyper-parameters than SCNN.

Figure 8 depicts the heat-map representation of the vector created from multiplication of current utterance’s feature vector with the attention matrix A, in SCNN. The heat-map includes the first eight consecutive utterances (rows of the heat-map) of a scene. Each row shows importance of the current utterance (color-mapped with blue) as well as previous three utterances (last three columns of the heat-map) at that particular time step. At each time step, the current utterance has the largest value which means our model attends more to the current utterance compared to previous ones. Attention matrix learns to assign the weights to previous utterances given the weight assigned to the current utterance.

Figure 8: The heat-map representation of in SCNN model, U#: Utterance number.

For instance, utterance number 7 and 8 attend more to their previous two utterances (which have similar emotions). Utterance number 5 attends less to utterance 2 as the second utterance has a positive emotion. Similarly, utterance 2 attends more to its previous utterance as they both have a positive emotion. Generally, in the cases where the emotion of current utterance is neutral, the weights assigned to its previous utterances are relatively small and mostly similar.

6 Conclusion

In this work, we introduced a new corpus for emotion detection task which was gathered from the spoken dialogues. We also proposed attentive SCNN models which incorporated the existing sequence information. The experimental results showed that our proposed models outperformed the base CNN. As annotating emotions from text are usually subjective, in future, we plan to assign more annotators to improve the quality of the current corpus. Also, to fully evaluate the performances of our proposed models we intend to implement different combinations of attention mechanisms and expand the size of our corpus by annotating more seasons of Friends TV show.


  • Alm (2008) Cecilia Ovesdotter Alm. 2008. Affect in Text and Speech. Ph.D. thesis, University of Illinois at Urbana-Champaign.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
  • Bodla et al. (2017) Navaneeth Bodla, Jingxiao Zheng, Hongyu Xu, Jun-Cheng Chen, Carlos Castillo, and Rama Chellappa. 2017.

    Deep heterogeneous feature fusion for template-based face recognition.

    In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, pages 586–595.
  • Buechel and Hahn (2017) Sven Buechel and Udo Hahn. 2017. Emobank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Valencia, Spain, pages 578–585. http://www.aclweb.org/anthology/E17-2092.
  • Chaffar and Inkpen (2011) Soumaya Chaffar and Diana Inkpen. 2011. Using a heterogeneous dataset for emotion analysis in text. In

    Canadian Conference on Artificial Intelligence

    . Springer, pages 62–67.
  • Chaumartin (2007) François-Régis Chaumartin. 2007. Upar7: A knowledge-based system for headline sentiment tagging. In Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, pages 422–425.
  • Donahue et al. (2015) Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . pages 2625–2634.
  • dos Santos et al. (2016) Cıcero Nogueira dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. 2016. Attentive pooling networks. CoRR, abs/1602.03609 .
  • Douiji et al. (2016) Yasmina Douiji, Hajar Mousannif, and Hassan Al Moatassime. 2016. Using youtube comments for text-based emotion recognition. Procedia Computer Science 83:292–299.
  • Dufour et al. (2016) Richard Dufour, Mohamed Morchid, and Titouan Parcollet. 2016. Tracking dialog states using an author-topic based representation. In Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, pages 544–551.
  • Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122 .
  • Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems. pages 1693–1701.
  • Kim et al. (2016) Seokhwan Kim, Luis Fernando D’Haro, Rafael E Banchs, Jason D Williams, Matthew Henderson, and Koichiro Yoshino. 2016. The fifth dialog state tracking challenge. In Proceedings of the 2016 IEEE Workshop on Spoken Language Technology (SLT).
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 .
  • LeCun et al. (1995) Yann LeCun, Yoshua Bengio, et al. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10):1995.
  • Liu et al. (2016) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. arXiv preprint arXiv:1605.05101 .
  • Lucey et al. (2010) Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. 2010. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on. IEEE, pages 94–101.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. pages 3111–3119.
  • Mnih et al. (2014) Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. 2014. Recurrent models of visual attention. In Advances in neural information processing systems. pages 2204–2212.
  • Mohammad and Bravo-Marquez (2017) Saif M. Mohammad and Felipe Bravo-Marquez. 2017. WASSA-2017 shared task on emotion intensity. In Proceedings of the Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA). Copenhagen, Denmark.
  • Seol et al. (2008) Young-Soo Seol, Dong-Joo Kim, and Han-Woo Kim. 2008. Emotion recognition from text using knowledge-based ann. In Proceedings of ITC-CSCC. pages 1569–1572.
  • Shaheen et al. (2014) Shadi Shaheen, Wassim El-Hajj, Hazem Hajj, and Shady Elbassuoni. 2014. Emotion recognition from text based on automatically generated rules. In Data Mining Workshop (ICDMW), 2014 IEEE International Conference on. IEEE, pages 383–392.
  • Shi et al. (2016) Hongjie Shi, Takashi Ushio, Mitsuru Endo, Katsuyoshi Yamagami, and Noriaki Horii. 2016. A multichannel convolutional neural network for cross-language dialog state tracking. In Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, pages 559–564.
  • Snow et al. (2008) Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. 2008. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pages 254–263.
  • Stolcke et al. (2000) Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics 26(3):339–373.
  • Strapparava and Mihalcea (2007) Carlo Strapparava and Rada Mihalcea. 2007. Semeval-2007 task 14: Affective text. In Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, pages 70–74.
  • Strapparava and Mihalcea (2008) Carlo Strapparava and Rada Mihalcea. 2008. Learning to identify emotions in text. In Proceedings of the 2008 ACM symposium on Applied computing. ACM, pages 1556–1560.
  • Strapparava et al. (2004) Carlo Strapparava, Alessandro Valitutti, et al. 2004. Wordnet affect: an affective extension of wordnet. In LREC. Citeseer, volume 4, pages 1083–1086.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112.
  • Waibel et al. (1989) Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J Lang. 1989. Phoneme recognition using time-delay neural networks. IEEE transactions on acoustics, speech, and signal processing 37(3):328–339.
  • Willcox (1982) Gloria Willcox. 1982. The feeling wheel: A tool for expanding awareness of emotions and increasing spontaneity and intimacy. Transactional Analysis Journal 12(4):274–276.
  • Wu et al. (2015) Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015.

    Modeling spatial-temporal clues in a hybrid deep learning framework for video classification.

    In Proceedings of the 23rd ACM international conference on Multimedia. ACM, pages 461–470.
  • Xu et al. (2015) K Xu, J Ba, R Kiros, A Courville, R Salakhutdinov, R Zemel, and Y Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention (2015). arxiv preprint. arXiv preprint arXiv:1502.03044 .
  • Yu et al. (2001) Feng Yu, Eric Chang, Ying-Qing Xu, and Heung-Yeung Shum. 2001. Emotion detection from speech to enrich multimedia content. Advances in multimedia information processing—PCM 2001 pages 550–557.
  • Zeng et al. (2006) Zhihong Zeng, Yun Fu, Glenn I Roisman, Zhen Wen, Yuxiao Hu, and Thomas S Huang. 2006. Spontaneous emotional facial expression detection. Journal of multimedia 1(5):1–8.