Multi-view Characterization of Stories from Narratives and Reviews using Multi-label Ranking

08/24/2019 ∙ by Sudipta Kar, et al. ∙ University of Houston 0

This paper considers the problem of characterizing stories by inferring attributes like theme and genre using the written narrative and user reviews. We experiment with a multi-label dataset of narratives representing the story of movies and a tagset representing various attributes of stories. To identify the story attributes, we propose a hierarchical representation of narratives that improves over the traditional feature-based machine learning methods as well as sequential representation approaches. Finally, we demonstrate a multi-view method for discovering story attributes from user opinions in reviews that are complementary to the gold standard data set.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The availability of multiple streaming services like Netflix, Hulu, Amazon, and the like , offers users a seemingly unlimited amount of movies and TV shows to binge on. While having a large number of options to choose from is always a plus, users can be sometimes overwhelmed by the amount of options, and are in need for efficient mechanisms to identify the movies or TV shows that are most aligned with their preferences.

We designed an approach that performs a high level story understanding from plot synopses to describe movies. The output of our model is a list of tags. The underlying assumption is that movie synopses describe the most salient events and character traits in the movie. Movie plot synopses can be long, sometimes up to 10K words. So generating a list of tags from them can provide users sufficient information about what to expect from the movie and make the selection process less time consuming.

We also argue that content in user reviews for movies can provide information relevant to describe stories, and as we will show in the results, movie reviews contain complementary attributes from those in the gold standard tags. However waiting for reviews to accumulate for a movie in order to generate tags is not practical. Therefore, what we propose is a system that predicts tags by jointly modelling movie synopses and reviews if available, but can still predict tags from only the movie synopsis. Our model uses a gating mechanism to control how much weight should be given to the input from the reviews and movie synopsis.

Concretely, we develop a multi-view multi-label tag prediction system, which exploits the hierarchical representation of the texts from the synopses and reviews if available. We empirically demonstrate that user reviews and the gating mechanism combined with skip connections can benefit movie tag prediction performance. Furthermore, we extract another set of tags from the reviews that contains open set story attributes that the model was never trained to predict. Besides evaluating our system using multiple evaluation metrics, we facilitate the evaluation of multi-label classification from the perspective of ranking relevant classes at the top by proposing

Multi-label Rank, a rank inspired metric for evaluating multi-label classification problems. Additionally, we show the effectiveness of our proposed method and novel ranking metric by using human judgement to evaluate the relevance of tags for a set of movies.

2 Background

Over the years, high-level narrative characterization approaches evolved around the problem of identifying genres Biber (1992); Kessler et al. (1997); Petrenz (2012); Worsham and Kalita (2018). Genre information is helpful but not very expressive most of the times as it is a broad way to categorize items. Several recent attempts Gorinski and Lapata (2018); Kar et al. (2018) retrieve other attributes of stories like mood, plot type, and possible feeling of consumers in a supervised fashion where the number of predictable categories are larger compared to the approaches for genre classification. Even though these systems can retrieve comparatively larger sets of story attributes, the predictable attributes are a close set of tags, where in real life these attributes can be unlimited. User reviews can be helpful to discover such open set of story attributes.

There is a subtle distinction between the reviews of typical material products (e.g. phone, TV, furniture) and story-based items (e.g. literature, film, blog). In contrast to the usual aspect based opinions (e.g. battery, resolution, color), reviews of story-based items often contain end users’ feelings, important events of stories, or genre related information, which are abstract in nature (e.g. heart-warming, slasher, melodramatic) and do not have a very specific target aspect. Extraction of abstract opinions for stories has been approached by several works using reviews of movies Zhuang et al. (2006); Li et al. (2010) and books Lin et al. (2013). Such attempts are broadly divided into two categories. The first category deals with spotting words or phrases (excellent, fantastic, boring) used by people to express how they felt about the story and the second category of works focus on extracting important opinionated sentences from reviews and generating a summary. In our work, we focus on building such a system that can spot opinions describing story attributes as a secondary functionality where the primary task is to retrieve relevant attributes from the closed set of tags.

3 Dataset

Our starting data set is the MPST corpus Kar et al. (7-12). This corpus has around 15K movies and a tag set of close 70 tags. To extract tags from reviews, we crawled reviews from IMDB. We collected up to 100 reviews per movie, selecting the reviews ranked as the most helpful. Out of the 15K movies in MPST, only 285 movies are missing reviews in IMDB.

Data Pre-processing After normalizing the texts into unicode, we summarize the reviews using Textrank Mihalcea and Tarau (2004)111We used the implementation from Gensim library. to remove redundant comments. Then we tokenize the data using NLP library. To remove rare words and other noise, we retain the words that appear at least in ten synopsis and review texts ( of the dataset). Additionally, we replace the numbers with a cc token. Through these steps we create a vocabulary of 42K word tokens. We represent the out of vocabulary words with a UNK token. For each movie sample, there are two text inputs (written narrative and summary of reviews). We use an empty string as the review text for the movies not having any review.

4 Methods

Given , which represents the narrative and user review with a tagset , we would like to model

. The goal of our model is to identify words and sentences that carry the most relevant attributes that can characterize the story. To do so, we encode each input with a hierarchical attention network as it can create hierarchical representation of documents for classification purpose by efficiently putting more weights on salient words and sentences. We enable the model to control the information flow from each input channel using a gating mechanism and estimate the likelihood of the tags through a layer of linear units. We aim at optimizing the model to compute probability values for

in such a way that the tags in get ranked at the top by having higher probability scores.

Figure 1: Overall model architecture. We encode the narratives and reviews using the hierarchical attention network (HAN) in a that creates the narrative representation and the review representation. We use the sentence representations for the narratives to make sentence level tag predictions in b and compute a weighted sum using the sentence level attention weights from the HAN. A gated fusion of the narrative and review representation is concatenated with the weighted sentence level predictions to estimate .

4.1 Hierarchical Representation of Narratives

We encode the narrative text input

using a hierarchical attention network (HAN) based encoder similar to han_yang-EtAl:2016:N16-13. Different sentences of a narrative text have different roles on the overall story. For example, some sentences narrate the setting or background of a story, whereas some sentences describe different events and actions. In a similar manner, different words have different roles in a sentence. HAN allows to learn the importance of different words and sentences using bidirectional recurrent neural networks with attention mechanism

Bahdanau et al. (2015).  
Bidirectional Sentence Encoder For a sentence having words, we create a matrix where

is the vector representation for word

in . We use pre-trained word embeddings from Glove Pennington et al. (2014) to initialize . Two sets of LSTM Hochreiter and Schmidhuber (1997) units summarize the information from the sentence from the forward and backward directions and produce a hidden state vector for word .

Sentence Attention

We force the model to learn to put more focus on the words that have strong correlation with target attributes of stories using an attention mechanism. More specifically, for the hidden representation

of sentence , we perform the following transformations to compute a word level weight vector :

Here, and are a set of parameters that are learned during the training process. indicates the importance of word in . We use these weights to compute a weighted sum of the hidden vector (Equation 1).


Encoding Documents Not all the sentences in stories or reviews are equally worthwhile for inferring story traits. We enable the model learn to identify and place more weight on the sentences that are important by using attention. This process is similar to encoding sentences using attention where more weights are given to the important words. For a document , the sentence encoder creates an encoded 2D sentence vector . The document encoder uses a bidirectional LSTM followed by a sentence level attention module that computes attention weight vector using for sentence . We compute the weighted sum that represents the attention weighted encoded form of the narrative texts. We also use this methodology for encoding the review texts.

Sentence Level Tag Prediction Beside computing the importance of the sentences in a narrative, we also want to model the specific role of the sentences for building up particular attributes of a story (e.g. what parts of a story make it suspenseful). For this reason, we experiment by adding a sub-module in the network that predicts the likelihood of the tags and use them to make the final prediction for the story. We also suspect that the combination of the document representation and sentence level predictions can enable the model to better discern the vital attributes of the stories by providing a global view of the story integrated with the fine-grained local information of the attributes. For a sentence in a narrative, we use the encoded representation to compute . We obtain by computing the weighted sum that prioritizes the predictions for the more important sentences. Where, is the attention weight for the sentence computed by the document encoder.

4.2 Information Controlling Gates

User reviews are different in nature than the narratives and we hypothesize that the way they can contribute for predicting tags is also different. While important story events and settings found from the narratives can correlate with some tags, viewers’ reactions can also correlate with complementary tags. We believe that learning to control the contribution of information encoded from narratives and reviews can improve the overall model performance. For instance, if the narrative is not descriptive enough to retrieve relevant tags but the reviews have adequate information, we want the model to use more information from the reviews. Hence we use a gated fusion mechanism Ovalle et al. (2017) on top of the encoded narrative and review representations. For the encoded narrative and review representation , the mechanism works as follows:

As the final prediction layer, we use a linear layer of dimension followed by Softmax activation. The input to this layer is the gated combination of the encoded narrative and review concatenated with obtained from the sentence level predictions. We also experiment with skip connections from and to the prediction layer and we observe better performance in the validation set.

Implementation and Training

We implement the network using the PyTorch


machine learning library. We use KL divergence as the loss function for the network and Stochastic Gradient Descent (SGD) (

, ) to optimize the network parameters. We use a dropout rate of 40% between the layers and L2 regularization (

) to prevent overfitting. We observe faster convergence using batch normalization after each layer.

4.3 Extracting Attributes from Reviews

Figure 2: Illustration of the method for tag extraction from reviews using the importance score computed by the attention weights in the word and sentence levels. All the words left to the red vertical line are selected as the candidate words.

We observe that the model usually puts higher attention weights for the opinion words in the reviews. Therefore, we use the attention weights on words and sentences in reviews computed by the model to extract new tags not present in the tagsets we use for training the models. Given , we estimate , that produces attention weight vectors and for . For each word in sentence in , we compute an importance score with the following equation:

Here, is the attention weight of word and is the attention weight of the sentence. indicates the number of words in the sentence and it helps to overcome the fact that word level attention scores are higher in shorter sentences. We rank the words in the reviews based on their importance scores and choose the first few words as the primary candidate tags as shown in Figure 2. The idea is that the sorted scores create a downward slope and we stop at the point where the slope starts getting flat. We detect this point by computing the derivative at this point based on its neighboring four points and set a threshold of 0.005 based on our observations on the validation set. After selecting the candidates we remove the duplicates and the words that are already in the predefined tagset to avoid redundancy. This process gives us a new tagset that is created from the opinion of users and able to provide new dimension to the characterization process by extracting tags foreign to the tags in the dataset.

Validation Test
Top 3 Top 5 Top 3 Top 5
MLR Micro F1 TL Micro F1 TL MLR Micro F1 TL Micro F1 TL
Only Narrative
Top N 85.23 29.70 3 31.50 5 85.66 29.70 3 28.40 5
Random N 51.49 4.20 71 5.40 71 52.12 4.20 71 6.36 71
Features 85.23 36.90 40 36.80 48 85.66 37.30 47 37.30 52
CNN-FE 86.10 37.70 37 37.60 46 86.51 36.90 58 36.70 65
HAN 90.94 38.89 38 38.79 51 90.85 38.25 41 38.29 49
HAN 91.32 38.88 54 39.04 61 91.09 38.43 57 38.74 64
Narrative + Review
Concat 93.36 42.46 60 42.30 67 93.19 41.96 57 41.74 65
Concat 93.31 42.78 60 42.80 67 93.31 41.84 60 41.79 65
Gate 93.39 43.60 53 42.89 62 93.23 42.28 53 42.04 59
Gate 93.32 42.74 60 42.29 66 93.10 41.81 59 41.67 66
Gate 93.50 43.36 61 43.06 68 93.42 42.92 60 42.34 66
Gate 93.56 42.74 63 42.52 68 93.38 41.90 61 41.97 68
Table 1: Results obtained on the validation and test set using different methodologies on the narratives and after adding reviews with the narratives. MLR and TL stand for multi-label rank and tags learned respectively. baseline, sentence level prediction enabled, skip connection enabled.

5 Experiments

We treat the problem as a mix of ranking and multi-label classification problem. Based on , we sort the tagset in descending order, so that the tags with higher probabilities are ranked at the top. Then in different settings we select the top N (N=3, 5) tags as the final tags to characterize each story.

5.1 Evaluation Metrics

We evaluate three aspects of the models: a) performance in ranking most relevant tags at the top, b) F1 measure at top tags, and c) Number of distinct tags predicted in top tags. Previous systems trained on the MPST corpus were evaluated by three metrics: micro-F1, tag recall (TR), and number of distinct tags learned (TL) for the test set. Here, micro-F1 gives a sense of the quality of the predictions, TR measures the average recall for each of the tags in the dataset, and TL helps to understand the systems’ ability to infer diverse attributes of movies regardless of the imbalance in the dataset. Considering the fact that all the target tags have similar weights, measuring the prediction quality with micro-F1 is constrained by the following limitations:

  • Considering the top N tags as predicted tagset, micro-F1 automatically penalizes the samples with target tagset size N. For example, if and a movie has seven target tags, our system will be penalized even if all 5 tags predicted are in the target tags.

  • Micro-F1 does not consider the position of predicted tags into consideration. But we argue that a metric that can reward a system for having target tags near the top of the list is more appropriate for this task.

Hence, we propose a new metric called Multi-label Rank (MLR) that evaluates the prediction quality by considering the ranking of the target tags in models’ predictions. Multi-label rank of data samples is defined by


Here, is the mean Rank Distance for an instance . Rank Distance for a target tag is defined by:


where, is the ranking position of tag in the predicted ranking of the label set and is the size of label set in the dataset.

This metric is helpful as it considers all the target tags as equally weighted and results in rank score of if all of the target tags are located at the top positions in the predicted ranking regardless the order of them. So the objective for a model becomes to have all the target tags at the top of the predicted ranking in any order to achieve a rank score of .

5.2 Baselines

We compare our results with the following baseline systems:

  • Top N This simple baseline ranks the tagset for a story based on the frequency of tags in the training set. The most frequent tags are considered as the most relevant ones.

  • Random N We randomly chose tags as the relevant tags. This baseline shows strong diversity but poor accuracy.

  • Hand-crafted linguistic featuresLogistic regression based One-versus-Rest classifier trained on lexical, semantic, and sentiment features Kar et al. (7-12).

  • Convolutional neural network with flow of emotion (CNN-FE)

    : Convolutional neural network based text encoder to extract features from written synopses and Bidirectional LSTMs to model the flow of emotions in the stories

    Kar et al. (2018).

For generating the predictions with the last two systems, we obtained the source codes from the authors.

6 Results

Quantitative Results We report the results of our experiments on the validation and test set in Table 1. In the experiments of predicting tags using only the narratives, multi-label rank (MLR) achieved by the HAN model is 90.94 on the validation set, which is higher than all the baseline systems. Additionally, we observe improvements in micro-F1 computed on the top N tags. Integrating the sentence level predictions with the document representation helps to improve MLR to 91.32 and also boosts the number of tags learned (TL). This supports our intuition that exploiting sentence level tag predictions can improve performance.

Fusing user reviews seems to be effective as it improves the system performance in almost every aspect. Simply concatenating the review representation obtained by HAN with the document representation yields improvements in MLR (91.32 vs 93.36) and micro-F1 (38.88 vs 42.46). Using gating mechanism instead of concatenation decreases TL, but enabling sentence level prediction improves it back making a tiny drop of 0.07 in MLR in exchange. Adding skip connections in the network with sentence level prediction results in slight improvements in the performance and the model achieves the best MLR (93.56) on the validation set. But in the test set, highest MLR (93.42) and micro-F1 (42.92) is achieved by only having the skip connections and not using sentence level predictions. Note that we discuss the micro-F1 on top 3 tags.

Inspecting the Hierarchical Representation We analyze the reason behind the effectiveness of our proposed system by visualizing the attention weights in the sentence level and word level. In Figure 3, we visualize the attention weights for the movie August Rush for both of the narrative and review. We can see that the sentences in the synopsis that have important story events (e.g. 16, 18, 21, 22) and review sentences having user opinions (e.g. 44, 46, 51) about to the story got more weights. Similarly in the sentence level, words related to important events, characters got more weights by the model, and words in the review sentences that convey opinion about the movie got more weights by the model. If we observe the tagsets provided in the caption and the highlighted words and sentences, we can conclude that the model is efficiently modelling the correlations between the salient parts of the texts and the tags.

Figure 3: Small snippets from the narrative (left) and review (right) of the movie August Rush are shown with their sentence level and word level attention weights. Each sentence starts with the sentence number highlighted by Blue. Strength of the highlights are proportional to the attention weights for the sentence. Strength of the word level Yellow highlights are proportional to the attention weights for the corresponding words. Gold standard tags for this movie was thought-provoking, romantic, inspiring, and flashback, and our top five predictions are romantic, inspiring, flashback, philosophical, sentimental. Some examples from the additional tags extracted from the reviews using the attention weights are masterpiece, intrigue, inspirational, transcendent and magical.

Multi-label Rank Visualization

Epoch 1 Epoch 10 Epoch 20
MLR: 53.82, F1: 15.81 MLR: 91.42, F1: 28.18 MLR: 96.10, F1: 46.05
Epoch 30 Epoch 40 Epoch 50
MLR: 95.30, F1: 55.67 MLR: 98.55, F1: 62.54 MLR: 99.03, F1: 61.17
Table 2: Visualization of the Multi-label Rank metric. Each cell presents a plot of the predictions after a particular epoch. In each plot, each column represents one data instance and each row represents the rank. There are rows in total. The topmost cell in each column represents the class label having the highest probability score predicted by the model that makes it the top ranked class. Similarly the bottom-most cell is the lowest ranked class. If a cell represents a target class, the cell is colored as Blue.

We present a visualization of the training progress with the update on Multi-Label Rank (MLR) and micro-F1 for top 3 predictions in Table 2 for a small amount of data. The plots show that target tags are progressively moved to the top with each new training epoch. But this training performance improvements are obscured with the f-measure, when most of the target tags are ranked at the top (after the 50th epoch), micro-F1 is only 61.17. But if we look at the MLR scores, they reflect the ranking performance more intuitively, which can eventually benefit any multi-label classification problems by providing a better way of performance evaluation.

6.1 Human Evaluation

Despite observing better quantitative results with our approach, we perform a human evaluation to answer the following research questions: “How relevant are our predicted tags compared to a baseline system? And how useful are the tags for the end users to get a quick idea about a movie?” We select CNN-FE Kar et al. (2018) as the baseline system444We used the online demo system released by the authors to generate the tags. to compare the quality of the tagsets created by our system for 21 randomly sampled movies from the test set. For each movie, we instruct three human annotators to read the plot synopsis of a movie to understand the story. Then we show them two sets of tags for each movie and ask them to select the tags characterizing the story. In the first tagset, we mix the sets of five tags created by the baseline system and five tags from our system to avoid the possibility of bias towards any particular system. The second tagset is created by the tags extracted from the user reviews (Section 4.3). Then we ask annotators’ opinion on whether these tagsets would help them decide if the want to see the movie or not. Finally, we reveal the title of the movie to them and ask if they have watched the movie or not and their opinion on the relevance of watching the film to fully evaluate the helpfulness of the tags. For this process, 21 people completed the evaluation and each of them evaluated three movies, for a total of 63 annotations.

Results For each tag in the tagsets, we consider a tag as relevant if it is selected by at least two annotators out of three. With this criterion, our proposed system won555If annotators select more tags from System A than System B for a particular movie, we announce A as the winner. for 12 out of 21 movies (57%), the baseline system won for five movies (23%), and the number of relevant tags were equal for the remaining four movies (20%). For these 21 movies, 141 tags (114 distinct tags) were extracted from the reviews in total. Out of these, 24% of the tags were selected as relevant by all of the three annotators, 18% tags by two annotators, and 32% tags by one annotator. It shows that around 74% of the extracted tags were marked as relevant by at least one annotator. When we ask the annotators about the helpfulness of the tagsets to decide whether to watch the movie or not, for the closed set tags, 4 out of the 63 total annotations were mentioned as not helpful, 29 as moderately helpful, and 30 as very helpful. For the open set tagsets created from the user reviews, 18 annotators mentioned them as not helpful, 29 as moderately helpful, and 16 as very helpful. From these observations, we can conclude that our system is better at finding relevant story attributes than the baseline system, and in most of the cases the tags are helpful for the end users.

6.2 Information from Reviews and Narratives

Figure 4: Percentage of gates activated for narratives and reviews. More active gates indicate more importance of the source for certain tags.

By observing the predictions using only the narratives and having user reviews as an additional view, we try to understand the contribution of each view to identify story attributes. We notice that using user reviews improved ranking performance for tags like non fiction, inspiring, haunting, and pornographic. In Figure 4, we observe that the percentage of activated gates for the reviews were higher than the narratives for the instances having the mentioned tags. Again, such tags are more likely to be related to the visual experience or somewhat challenging to understand from written narratives only. For example, narratives are more important to characterize adult comedy stories, but pornographic representation can be better identified by the viewers and this information can be easily conveyed through their opinion in reviews.

6.3 Predictions for New Movies

In February 2019, we collected plot synopses of a few new movies (not present in the existing dataset) and generated a set of tags using our model. We present the predictions in Table 3. So far, we have seen tags assigned by users in IMDB for Aquaman only and with the exception of one tag cult, all other tags predicted by our system have been assigned to the movie by IMDB users. We will check again in the coming months to see what tags appear for these other movies. But this very small-scale experiment shows very promising results.

Movie Title Top 5 predictions
Happy Death Day 2U paranormal, psychedelic, murder, horror, violence
Alita: Battle Angel violence, psychedelic, sci-fi, murder, good versus evil
The Upside comedy, entertaining, satire, humor, flashback
Miss Bala murder, violence, revenge, neo noir, cult
Aquaman good versus evil, action, violence, cult, murder
Table 3: Predicted tags from our model for new movies getting released in February 2019 using the plot synopses. Bold-faced tags are already found to be assigned by users in IMDb.

7 Conclusion

In this paper, we focused on characterizing stories using tags. We considered the problem as a multi-label classification problem where we want to rank the target tags at the top when sorted by the likelihood estimation scores in descending order. Apart from using traditional metrics, we propose a new evaluation metric called Multi-label Rank that can be helpful to evaluate approaches to solve such problems where a) number of target labels varies for instances and usually small compared to the size of possible set of labels b) target tags are equally weighted. We utilized a hierarchical attention network that improved the performance of the tag prediction. We empirically demonstrated that utilizing user reviews can further improve the performance and experimented on several methods for combining user reviews with narratives. Finally, we developed a methodology to extract user opinions that are helpful to identify complementary attributes of movies. We believe that this coarse story understanding approach can be extended to longer stories, i.e, entire books, and are currently exploring this path in our ongoing work.


  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.1.
  • D. Biber (1992) The multi-dimensional approach to linguistic analyses of genre variation: an overview of methodology and findings. Computers and the Humanities 26 (5), pp. 331–345. External Links: ISSN 1572-8412, Document, Link Cited by: §2.
  • P. J. Gorinski and M. Lapata (2018) What’s This Movie About? A Joint Neural Network Architecture for Movie Content Analysis. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1770–1781. External Links: Document, Link Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
  • S. Kar, S. Maharjan, A. P. López-Monroy, and T. Solorio (7-12) MPST: a corpus of movie plot synopses with tags. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France (english). External Links: ISBN 979-10-95546-00-9 Cited by: §3, 3rd item.
  • S. Kar, S. Maharjan, and T. Solorio (2018) Folksonomication: predicting tags for movies from plot synopses using emotion flow encoded neural network. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2879–2891. External Links: Link Cited by: §2, 4th item, §6.1.
  • B. Kessler, G. Nunberg, and H. Schutze (1997) Automatic detection of text genre. In 8th Conference of the European Chapter of the Association for Computational Linguistics, External Links: Link Cited by: §2.
  • F. Li, C. Han, M. Huang, X. Zhu, Y. Xia, S. Zhang, and H. Yu (2010) Structure-aware review mining and summarization. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 653–661. External Links: Link Cited by: §2.
  • E. Lin, S. Fang, and J. Wang (2013) Mining online book reviews for sentimental clustering. In 2013 27th International Conference on Advanced Information Networking and Applications Workshops, Vol. , pp. 179–184. External Links: Document, ISSN Cited by: §2.
  • R. Mihalcea and P. Tarau (2004) TextRank: bringing order into text. In

    Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

    External Links: Link Cited by: §3.
  • J. E. A. Ovalle, T. Solorio, M. Montes-y-Gómez, and F. A. González (2017) Gated multimodal units for information fusion. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, External Links: Link Cited by: §4.2.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §4.1.
  • P. Petrenz (2012) Cross-Lingual Genre Classification. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 11–21. External Links: Link Cited by: §2.
  • J. Worsham and J. Kalita (2018) Genre Identification and the Compositional Effect of Genre in Literature. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1963–1973. External Links: Link Cited by: §2.
  • L. Zhuang, F. Jing, and X. Zhu (2006) Movie Review Mining and Summarization. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM ’06, New York, NY, USA, pp. 43–50. External Links: ISBN 1-59593-433-2, Link, Document Cited by: §2.

Appendix A Appendix

a.1 Detail Results of the Human Evaluation

IMDb id and Title Tags with Number of Votes
Kronk’s New Groove
B: feel-good, magical realism, cute, whimsical, prank
N: flashback, romantic, entertaining, prank, psychedelic
R: kronk, moralising, funny, cartoon, hilarious, gags
The Doom Generation
B: pornographic, adult comedy, neo noir, comic, blaxploitation
N: violence, murder, pornographic, sadist, cult
R: disturbing, deranged, tortured, erotic, violent, psychotic, goriest,
sexual, kinky, weirdness, nihilistic, gay, homoerotic, irony,
humourous, goth
Shock to the System
B: plot twist, murder, suspenseful, intrigue, neo noir
N: murder, queer, plot twist, flashback, romantic
R: whodunit, lesbian, lgbt, gay, vengeance
Say It Isn’t So
B: comedy, adult comedy, humor, dramatic, entertaining
N: comedy, romantic, humor, prank, entertaining
R: humour, funny
Eating Raoul
B: neo noir, adult comedy, humor, comedy, bleak
N: murder, adult comedy, pornographic, satire, comedy
R: violent, slapstick, humour, masochistic, bondage, kinky
Doomsday Gun
B: dramatic, historical, suspenseful, thought-provoking, neo noir
N: violence, intrigue, murder, flashback, alternate history
R: thriller, cynical, backstabbing, conspiracy, amusing, evil, chases,
paradox, nightmare, doomsday, chilling, mi6
Saints and Soldiers
B: historical, action, dramatic, suspenseful, realism
N: violence, historical, murder, suspenseful, flashback
R: massacre, brutality, affirming, brotherhood, underbelly, christianity
Goes Hollywood
B: entertaining, humor, comic, psychedelic, horror
N: cult, flashback, comic, psychedelic, horror
R: scooby
Power Rangers Lost
B: good versus evil, sci-fi, fantasy, alternate history, comic
N: good versus evil, fantasy, violence, paranormal, psychedelic
R: mystical, mythic, cartoon, psycho, magical, funny
The Big Snit
B: thought-provoking, suspenseful, comic, paranormal, bleak
N: psychedelic, absurd, cult, philosophical, satire
R: surreal, absurdist, existential, cartoon, demented
Battle of Britain
B: historical, action, dramatic, thought-provoking, anti war
N: historical, flashback, violence, anti war, suspenseful
R: gripping, tragic, biographical, dogfights, sixties
La noche del terror ciego
B: suspenseful, paranormal, murder, violence, revenge
N: violence, murder, cruelty, cult, flashback
R: disturbing, satanic, gore, eroticism, lesbianism, visions, torture, tinged,
A Time to Kill
B: revenge, suspenseful, murder, violence, neo noir
N: revenge, murder, violence, flashback, sadist
R: violent, crime, brutally, vengeance, vigilante, sadism, poetic,
depraved, fictional
The Passion of the Christ
B: dramatic, thought-provoking, historical, suspenseful, allegory
N: violence, christian film, murder, flashback, avant garde
R: brutality, symbolism, slasher, treachery, enlightening, torture, lucid,
occult, allusion, ironic
Vals Im Bashir
B: historical, thought-provoking, anti war, philosophical, alternate history
N: flashback, violence, storytelling, murder, psychedelic
R: nightmares, nightmare, surreal, escapist, surrealism, disturbing, witty
In the Name of the King:
The Last Mission
B: action, fantasy, violence, good versus evil, historical fiction
N: violence, murder, good versus evil, revenge, flashback
R: antihero, magical, campiness, dungeon, rampage, cinematic,
Deal of the Century
B: dramatic, suicidal, realism, humor, thought-provoking
N: absurd, comedy, satire, cult, humor
R: maniacal, pathos, symbolism
Evil Bong 2: King Bong
B: humor, clever, action, comic, thought-provoking
N: cult, comedy, violence, murder, revenge
R: humour, wicked, amusing, killer, evil, geeky, titular, laced,
irreverence, homophobic
Cross Fire
B: suspenseful, murder, revenge, sadist, neo noir
N: murder, violence, suspenseful, revenge, flashback
R: gunfight, fistfights, classic
B: melodrama, romantic, flashback, intrigue, paranormal
N: murder, flashback, romantic, revenge, paranormal
R: thriller, nightmares, reincarnation, chilling, karz, melancholy
B: romantic, melodrama, historical fiction, queer, intrigue
N: romantic, revenge, murder, violence, flashback
R: cynicism, irony, cruel, liaisons, humour, brutality, ruthless
Table 4: Data from the human evaluation experiment. B represents the tags predicted by the baseline system, N represents the tags predicted by our new system, and R represents the open set tags extracted from the user reviews by our system. If a tag is followed by a number in superscript, the number indicates the number of annotators who selected the tag as relevant to the story. We consider a tag as relevant if it has at least two votes. indicates the instances where our system’s predictions were more relevant compared to the baseline system, and indicates the opposite. For the rest of the instances, both systems had a tie. Annotators’ feedback about the helpfulness of the tagsets (closed set tags and open set tags) are presented by emoticons ( : Very helpful, : Moderately helpful, : Not helpful). First three emoticons are the feedback from all the annotators for the tags from the baseline system and our system. Rest of the three emoticons are the feedback for the tags extracted from the user reviews.