A multimodal movie review corpus for fine-grained opinion mining

02/26/2019 ∙ by Alexandre Garcia, et al. ∙ Télécom ParisTech 0

In this paper, we introduce a set of opinion annotations for the POM movie review dataset, composed of 1000 videos. The annotation campaign is motivated by the development of a hierarchical opinion prediction framework allowing one to predict the different components of the opinions (e.g. polarity and aspect) and to identify the corresponding textual spans. The resulting annotations have been gathered at two granularity levels: a coarse one (opinionated span) and a finer one (span of opinion components). We introduce specific categories in order to make the annotation of opinions easier for movie reviews. For example, some categories allow the discovery of user recommendation and preference in movie reviews. We provide a quantitative analysis of the annotations and report the inter-annotator agreement under the different levels of granularity. We provide thus the first set of ground-truth annotations which can be used for the task of fine-grained multimodal opinion prediction. We provide an analysis of the data gathered through an inter-annotator study and show that a linear structured predictor learns meaningful features even for the prediction of scarce labels. Both the annotations and the baseline system will be made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the expansion of e-commerce on the one hand and social networks on the other hand, opinionated contents have become available on a range of products going from purchasable goods (Amazon, PriceMinister reviews) to touristic services (Hotels and restaurants from TripAdvisor) and activities (Rotten Tomatoes, Imdb). Reviews often come under the form of a written commentary provided with one or more ratings summarizing the reviewer’s satisfaction-level with respect to some aspects of the object being criticized. Even if these ratings provide a mean to measure the global satisfaction of customers, this information cannot be directly used to understand the specific aspects of the product which require improvement

. Different Machine Learning prediction tasks have been proposed to help understanding customer’s satisfaction through the review they generate.

Prediction of the global polarity of a sentence or textual span, possibly with different intensity levels (varying from very negative to very positive), has been addressed as a sentiment analysis task in

(pang2002thumbs; maas2011learning; liu2012sentiment). Following this path, different studies have built coding schema for describing and annotating further aspects of opinions. A common feature of these model is the definition of the functional components of an opinion and their properties (e.g. implicit or explicit) (wiebe2005annotating). The corresponding prediction task is commonly called fine-grained opinion mining or Aspect-Based Sentiment Analysis (ABSA) and consists in predicting the attitude of a speaker (named source) toward an object (named target).

The traditional approach developed in the ABSA task of the SEMEVAL campaign (pontiki2016semeval) consists in two steps. First, the system has to identify whenever an opinion exists in a sentence, and to predict the corresponding target(s) (jakob2010extracting). Second, the polarity of each detected opinion expression is computed (wilson2005recognizing). More recently, the task of fine-grained opinion mining has been cast as a structured prediction problem where the different components of an opinion are predicted at the same time (garciaabstention; marcheggiani2014hierarchical). These structured models take advantage of the relationship existing between the different components of an opinion to help predicting each of them. These methods rely on annotations gathered at the sentence level and possibly at the token level while at the same time using review level feedbacks such as star ratings. Unfortunately, the human interpretation of opinions expressed in the reviews is highly subjective and the opinion aspects and their related polarities are sometimes expressed in an ambiguous way and difficult to annotate (clavel2016sentiment; marcheggiani2014hierarchical). In the case of spoken language, this difficulty is even higher due to the lack of syntax of some sentences and the presence of disfluencies that break the continuity of the discourse.

In this work, we propose flexible guidelines for the fine grained annotation of opinion structures in the context of video based movie reviews. The corresponding schema introduces some links between the coarse opinion recognition (at the review level) and the detection of token-level opinion functional components. This nested model ensures that the annotations are consistent at different levels of details and can be used in joint prediction models (garciaabstention) to take into account the labeled information at each level. Since the working support of each annotator is a set of transcripts of spontaneous spoken reviews, the main difficulty is to provide guidelines that are flexible enough to match with the structure of oral language while ensuring a correct agreement between multiple workers.

In Section 2, we present the previous studies concerning opinion annotation and especially the studies carried out on existing multimodal datasets. Then, we present the dataset we used (Section 3) and the protocol and the setting of our annotation campaign (Section 4). Finally, we present some results validating the dataset in Section 5.

2 Related work

The annotation of opinion in natural language is difficult due to the inherent subjectivity of the task and the need for a framework that ensures that different annotators work in a consistent way. An example of such a framework is the annotation scheme of the MPQA opinion corpus (news articles) (wiebe2005annotating) which relies on the annotation of private state frames, i.e. textual spans that describe a mental state of the author. In the case of an opinion, it can describe either the target (what the private state is about), the source or holder (who is expressing the opinion) and other characteristics such as polarity, intensity, attitude. In (toprak2010sentence), the authors improve the annotation scheme for consumer reviews by splitting it in two successive steps where the polarity and the relevance to the topic of the sentences is first examined and then the different opinion components are identified. They also go beyond the annotation of private state frames and explicitly introduce some new labels: is reference and modifiers that link the different opinion components together. In this paper, we take a step in the same direction by proposing a fine-grained annotation of opinion components and we propose a new setting more flexible for the annotation of multimodal movie reviews.

Regarding multimodal review corpora, even though no fine-grained annotation of these datasets currently exist, different related annotation tasks have been proposed. Among these efforts, the ICT-MMMO corpus (wollmer2013youtube) consists of 370 movie review videos for which an annotator has given an overall label: positive, negative or neutral, to describe the viewpoint of the reviewer. The recent CMU Multimodal SDK (zadeh2018multi) provides a setting ready-to-use for building multimodal predictors based on opinionated or emotionally colored content. In the CMU-MISO dataset (zadeh2016multimodal), 93 videos have been gathered and annotated at the segment level in terms of intensity of the opinion expressed. In their case, opinion is defined as a subjective segment for which a categorical label between 1 and 5 is given. This representation is in fact restrictive since it doesn’t provide information about the target of the expressed opinion. Besides, it doesn’t provide information on the cues that have been used in order to choose a particular intensity. For the present first annotation campaign of fine-grained opinion in multimodal movie reviews, we use the Persuasive Opinion Multimedia (POM) dataset (park2014computational) which consists of 1000 video-based movie reviews that were originally annotated in terms of persuasiveness of each speaker. In the next section, we present the different features of the POM database that led us to select it.

3 The video opinion movie corpus

Our annotation campaign focuses on the identification of the opinions expressed in the POM dataset. In each video, a single speaker in frontal view gives his/her opinion on a movie that he/she has seen. The corpus contains 372 unique speakers and 600 unique movie titles. It has originally been built in order to analyze the persuasiveness of the speakers and no attention has been so far given to the content of the reviews themselves. We expect however that the use of multi-modal data can be of interest when predicting polarized content. Figure 1 shows examples where it is clear that the visual content may be crucial to disambiguate the polarity of some reviews (for example in the hard case of irony).

(a) Negative opinion
(b) Other example of negative opinion
(c) Positive opinion
(d) Neutral opinion
Figure 1: Examples of frames taken from different videos of the dataset illustrating the visual expression of opinions.

This dataset has been chosen for running an annotation campaign for the following reasons:

1) The restricted setting helps the target identification: the documents contain opinionated content and are focused on a single type of target (here movie aspects) which makes it easier to build a typology of the possible targets for the target annotation task;

2) It provides an illustration of spontaneous spoken expressions of opinions in a multimodal context: the reviews are based on spoken language for which the video is also available contrarily to previous studies of sentiment analysis based on phone call studies (clavel2013spontaneous). As a consequence, the annotation of the transcript is harder than for classical written language especially at a fine-grained level;

3) We can build a hierarchical representation of opinions: other auxiliary labels are available such as star ratings given by the reviewer, sentence-based summary and persuasiveness. The fine-grained annotations can be used as intermediate representations to help predicting these values (garciaabstention).

The POM dataset also provides a manual transcription for each review that we used in our annotation campaign. It contains 1000 reviews for which the average number of sentences per review is 15.1 and the average number of tokens per sentence is 22.5. In its current version, this dataset only contains annotations performed at the review level. Indicators of the persuasiveness of the speaker are available (professionalism, quality of argumentation …). Among the available data, the authors of (park2014computational) asked the annotators to evaluate the polarity of the reviews by guessing its corresponding five-level star rating. The results in Table 1 show that the reviews are strongly polarized which indicates the presence of clear opinion expressions.

Star rating 1 2 3 4 5
Number of
occurrences
253 200 61 133 353
Table 1: Repartition of the star ratings at the review level

In the next section, we detail our setup for the fine grained opinion annotation of the POM dataset.

4 Annotation

4.1 Opinion definition

Following the path of previous opinion annotation studies (langlet2017web) and based on appraisal theory (martin2003language), we define an opinion as the expression of a judgement of quality or value of an object. This definition makes it possible to represent an opinion (here called attitude) as an evaluation (positive or negative) by a holder (for example the person who expresses her opinion) of a target (for example a service or a product). In the case of movie reviews, the opinion holder is the reviewer himself most of the time but some exceptions exist. For example, in the sentence "my children like the characters of this cartoon", the holder is ’children’. The target component is defined here as a part of a hierarchically defined set of aspects (Wei2010) which covers the subparts of the object examined (here movie reviews). Finally, the polarity component indicates whether the evaluation is positive or negative. In what follows, we define an opinion as an expression for which these 3 components exist and are not ambiguous. The present definition does not include: i) emotions without any target (Munezero2014AreTD) such as in the sentence "I was so scared", and ii) polar facts (jakob2010extracting) which denotes for facts that can be objectively verified but indirectly carry an evaluation such as in "What a surprise he plays the bad guy once again". In Section 4.3, we provide guidelines to handle these cases in the annotation process.

4.2 Fine-grained annotation strategy

We want to build a set of annotations that identifies the grounds on which the opinions of the reviewer are perceived by an annotator, both at the expression and at the token levels. We expect that better localizing the words which are responsible for the expression of an opinion may help finding the visual/audio features that carry the polarity information. Annotating this data is challenging due to the specific language structures of oral speech and the presence of disfluencies. We propose a two-level annotation method in order to (1) obtain a consistent identification of the opinion expressed in a sentence and the words responsible for this identification and (2) provide accessible guidelines to the annotators when the lack of grammatical structure of the sentences makes it difficult to find the delimitation of the phrases. For this second reason we define the expression level as ’the smallest span of words that contains all the words necessary for the recognition of an opinion’. These boundaries are in practice very flexible and might be very different from one annotator to the other.

Once an opinion is identified at the expression level, the annotator is asked in a second time to highlight its different components based on the tokens located inside the previously chosen boundaries. In what follows, we refer to this step as the token-level annotation. It consists in selecting the group of tokens indicating the target, polarity and holder of the opinion. In this case multiple spans can be responsible for the identification of each components. The instruction in such cases is to pick all the relevant spans for polarity tokens and only the most explicit one for target tokens. As an example in the sentence : "It’s the best movie I’ve seen", the selected polarity token is best, the holder token is I and the target token is movie since it is more explicit than It, which requires anaphora resolution to be understood.

In the end, we provide a dataset with the following features :
Span-level annotation :

Opinion targets and polarities are annotated at the expression level.

For each segment, the targets are categorized in a predefined set adapted to the context of movie reviews.

The corresponding polarities are then categorized on a five-level intensity scale.
Token-level annotation :

The words which led to the choice of the target category and polarity intensity are specifically annotated. In the next section we study the difficulties specific to the corpus used.

4.3 Annotation challenges and guidelines

We have previously highlighted the specificities of the dataset, namely the oral nature of the discourse and especially the presence of disfluencies and non grammatical phrases. For these reasons, defining precisely the textual span corresponding to an opinion is difficult. We tackled this issue by providing a rule of thumb to the annotators. Some difficulties remain, owing to the non professional nature of the movie reviews: not only do the reviewers give their opinion about the movie itself, but also they take into account the background of the viewer and tend to give some advice. For this reason, the reviewers regularly give a recommendation for the viewers that are likely to enjoy the movie being examined. In this case the opinion of the reviewer him/her-self toward the movie is unclear, as it can be seen in the sentence: "This movie is perfect for kids". Consequently, we have asked the annotators to indicate whenever this type of sentence appears, in order to avoid adding the complexity of a dedicated treatment. This annotation takes the form of a boolean variable attached to an expression as it is shown in Figure 2.

A second case is the comparison between the movie reviewed and the other ones such as the different elements of a saga or even related movies (such as movies with some actors in common or the same director). When this happens, a comparison occurs and the choice of the target of the opinion becomes ambiguous in sentences such as :"Obviously Harry Potter 1 is better than this one.". Once again the comparison label dedicated to handling these cases is defined in Figure 2.

Finally, some sentences may contain some polarized content conveying the attitude of the reviewer without holding an explicit target. Other may have no target at all when they consist of a sentiment expression. Such sentences have been referred to in previous work as Speaker’s emotional state (mohammad2016practical) or polar fact (jakob2010extracting). Since these sentences are hard to annotate (both in terms of target choice and boundary selection) we ask the annotators to specifically identify them using the sentiment tag. This enables us to separately treat the sentences in which the target is known but does not appear, as for example in "I must say that what I heard sounded good." where the target is obviously the music even if its not stated, and the sentences in which the target is really ambiguous or inexistent.

These three labels are incorporated in the annotation tool under the form of boolean variables tied to the span level annotation that can be selected. When at least one of the 3 labels recommandation, comparison, sentiment is active, we do not ask the annotators to perform the second step of token-level annotation since we do not consider these spans as real opinions.

4.4 Annotation schema

The annotation campaign has been run on a remotely hosted platform running the Webanno tool (de2016web). This choice was motivated by the simplicity of the configuration of multiple tag layers and the possibility of performing this configuration online. When logged into the platform, an annotator can select a transcript of a movie review assigned to him/her and each annotation added is automatically saved.

The annotation task is split in two consecutive subtasks described in Figure 2.

Figure 2: Annotation schema

We additionally asked the annotators to identify the name of the movie reviewed when available.

The scheme is a coarse-to-fine annotation where the worker has to successively identify the textual spans containing an opinion; identify the corresponding target, then the polarity; and finally select the words that guided his/her choice. The possible labels for the categorization tasks are defined in advance:

The taxonomy of targets is derived from the one of (zhuang2006movie) and corresponds to the hierarchy reported in Table 2. Once the target is identified, the corresponding polarity is also chosen on a five-level scale, from very negative to very positive.

Movie
Elements
Movie
People
Support
Overall Producer Price
Screenplay
Actor
actress
Availability
Character
design
Composer
singer
soundmaker
Other
Vision and
special effects
Director
Music and
sound effects
Other people
involved in
movie making
Atmosphere
and mood
Table 2: Predefined targets for movie review opinion annotation

In the context of this paper, we will only report results concerning Movie Elements to focus the discussion on a reduced set of labels. The results concerning Movie People and Support will be provided on the page of the dataset.

We detail the experimental protocol in the next section.

4.5 Protocol

We provided examples of annotated reviews in the annotation guide and trained three recruited workers on 150 reviews before beginning the annotation campaign. Then each of the 850 remaining reviews was annotated once by one of the workers. Each annotator was given an access on a remotely hosted Webanno server where he/she could log him/her-self and annotate the transcripts of the review via a parameterized interface. Note that due to the explicitness of the reviews, we only provided the transcripts of the videos to each annotator which did not have to watch the videos (but were aware of the oral nature of the original content). An example of annotated review provided as an example in the annotation guide is given below in Figure 3:

Figure 3: Extract from the annotation of the review of the movie : Cheaper by the Dozen

Since the tasks have been shared among different workers, an issue is the variability of the annotations. In the next section we focus on issues raised by the multi-annotator setting.

5 Validation of the annotation

We examine the quality of the annotation by two means: using a measure of the inter-annotator agreement on a data subset; and performing a study of the most influential linguistic features used by a structured linear model on the whole annotated corpus.

5.1 Inter-annotator agreement

We measure the inter-annotator agreement by computing the Cohen’s kappa coefficient on two groups of 25 reviews that were annotated by two different annotators. We only gathered double annotations on a small subset of the dataset for annotation cost reasons. Since we are in a multilabel setting (spans of different opinions can overlap), we compute an agreement for each label: The compared objects are binary sequences labeled as 1 if the label is active and 0 otherwise. We denote by the letter A (resp. B), the reviews annotated by reviewer 1 and 3 (resp. 2 and 3) and report the results at the span and sentence level in Table 3 and Table 4.

Atmosphere
and mood
0.00
(69)
0.00
(149)
0.00
(12)
-0.01
(19)
Character
design
0.00
(12)
0.00
(33)
0.00
(1)
0.00
(4)
Music and
Sound effects
0.48
(78)
- (0)
0.57
(7)
- (0)
Overall
0.32
(1818)
0.46
(2268)
0.41
(188)
0.55
(201)
Screenplay
0.23
(194)
0.14
(187)
0.23
(16)
0.32
(18)
Vision and
Special effect
0.08
(120)
0.32
(43)
0.25
(8)
0.50
(4)
Table 3: Cohen’s kappa at the span and sentence level for the target annotations and total number of segments annotated by the two workers
Negative
0.30
(928)
0.41
(1210)
0.51
(106)
0.64
(141)
Positive
0.22
(675)
0.34
(792)
0.59
(145)
0.55
(83)
Mixed -
Neutral
0.00
(47)
0.40
(205)
0.00
(5)
0.53
(22)
Opinion
presence
0.37
(1650)
0.52
(2207)
0.44
(256)
0.58
(246)
Table 4: Cohen’s kappa at the span and sentence level for the polarity annotations and total number of segments annotated by the two workers

We additionally defined a global kappa which indicates the confidence with which an opinion can be recognized. We refer to this table as Opinion presence. The corresponding obtained kappas refer to moderate agreement (landis1977measurement), which is very encouraging for subjective phenomena such as opinions.

Regarding the target, the repartition of labels is imbalanced : the Overall label is strongly dominant whereas Character design or Music and Sound effects are very rare. Drawing some conclusions on the rare labels is impossible but we still observe that some moderate agreement can be measured for the overall class at the sentence level.

Concerning the polarity annotations, the labels are slightly better balanced leading to higher confidence in the results. Relaxing the annotation at the sentence level raises the agreement from low to moderate which indicates that the low results at the span level are implied by the absence of hard annotation guidelines for the identification of the span frontiers.

5.2 Study of linguistic features using a CRF-based model

Since the results provided by the previous measures of inter-annotator agreement are not relevant for rare labels due to the size of the used sample, we additionally train a linear structured prediction model for the task of opinion classification both at the token and the sentence level. By taking as input features the tokens themselves, we show that the learned model focuses on relevant vocabulary even for rare labels.

We first consider the task of aspect and polarity prediction based on the span level annotations: we take as input features the sum of the one hot encoding of each word and the ones situated in a 5-token window. Each output object is a sequence of labels (one per token) corresponding to the span-level annotation previously described. Next, we treat the same task at the sentence level. The input features consist of the sum of the one-hot encoding of each token in the sentence and the output representation is built in the following way : we omit the polarity intensity information and introduce a

Mixed class indicating whether a sentence contains both positive and negative opinions. We also include sentences containing only neutral opinions in this class. Otherwise if the sentence contains at least one positive (respectively negative) opinion it is labeled as positive (respectively negative).

A linear Conditional Random Field (CRF) (lafferty2001conditional) model is trained for each label using the python-crfsuite library111https://python-crfsuite.readthedocs.io/en/latest/. We discarded the 150 texts used for the training of the annotators and split the remaining 850 texts in 5 folds. We tuned the parameters to optimize the macro-f1 score by cross-validation. We report the F1 score for each label averaged over the 5 folds in Table 5.

Sentence level Span level
Positive 0.67 (2218) 0.39 (26071)
Negative 0.56 (1795) 0.26 (22988)
Mixed 0.11 (299)
No polarity 0.87 (8737) 0.92 (243850)
Table 5: F1 score for token and sentence level polarity prediction and corresponding number of occurrences in the dataset

The reported scores are obtained both at the token level and at the sentence level. A crucial aspect is the dependency in the number of examples of each label. The results obtained for rare labels such as Mixed is high precision / low recall. This behavior is due to the presence of specific vocabularies for which the predictor is guaranteed to accurately predict the polarity. We can display the vocabulary on which the model makes its prediction by analyzing the weights learned by our model. Let

two sequences of vectors of length

, . A linear chain conditional random field parameterizes the conditional distribution under the form :

Where is the vector of learned weights, is an input dependent normalization term and is a set of feature functions. In the setting described here, these feature functions can be grouped in two categories : (i) output-output feature functions that do not depend on input data and (ii) input-output feature functions of the form :

We only consider input-output feature functions and report the couples with highest weights in the Table 6. We consider these weights as scores since couples with higher values tend to increase the likelihood of the sequence .

Label Sentence level Span level
Mixed ’okay’, ’average’
Positive
’hilarious’
’amazing’
’great’,’cool’,
’good’
Negative
’disappointing’,
’disappointed’,
’boring’
’terrible’,’not’,
’bad’
No
polarity
’Thanks’,
’Thank’,’review’
punctuation,
but’,’and’
Table 6: Highest score input features for polarity label prediction at the sentence and the token level

The top scored vocabulary raises two remarks:

1) The polarized sentences - spans are mainly recognized through evaluative adjectives which are obviously linked to the corresponding label.

2) The absence of polarity is treated in a different way at the sentence and span level.

At the sentence level, the absence of polarity is systematic in sentences that introduce or conclude the review. The displayed vocabulary is characteristic of concluding sentences. At the span level, the punctuation and conjunctions which separate different opinions play an important role. These tokens receive a high score since they appear specifically at the boundary of an opinion.

Finally we train a model for sentence-level target prediction and report the results in Table 7:

Corresponding
label (F1 score)
Highest score tokens
Overall (0.63)
’boring’,’disappointing’,’great’,
’awesome’, ’terrible’,’wonderful’
Screenplay
(0.35)
’plot’,’storyline’,’story’,
’interesting’, ’predictable’
Vision and
Special
effect (0.26)
’beautifully’,’animation’,
’effects’, ’cinematography’,
’visually’,’graphics’,’picture’
Music and
Sound
effects (0.04)
’soundtrack’,’music’,’bands’,
’quality’,’song’, ’sound’,’good’
Character
design (0.09)
’characters’,’character’,’oh’,
’awful’,’portrayal’, ’spinning’
Atmosphere
and mood (0.35)
’funny’, ’fun’, ’hilarious’,
’funniest’, ’cheesy’, ’laughing’,
Table 7: Highest score input features for aspect prediction at the sentence level

Once again the low results are characterized by a low recall: a few words characterizing the presence of the target appear in the top score vocabulary but as the score decreases, some non characteristic words are quickly raised (’good’, ’great’ for Music and Sound effects, ’oh’, ’awful’, ’He’ for Character design). These labels are specifically hard to predict due to the diversity of the vocabulary implied and the low number of examples available.

The Overall category is characterized by polarity words only. This is coherent with our annotation instructions : An opinion is labeled as Overall if it targets the overall movie or if no category in the proposed hierarchy fits the opinion expressed. As a consequence the Overall opinion is characterized by polarity words indicating an opinion but do not indicate a specific aspect.

6 Conclusion and perspectives

In this paper, we have presented the protocol and results of a fine-grained opinion annotation campaign for spoken language, based on a multimodal movie review dataset. The resulting annotations show low inter-annotator agreements at the token level but achieve better values by relaxing the annotation granularity, placing it at the sentence level. Besides, the linear structured predictor learns meaningful features even for the prediction of scarce labels. In future work, we plan to jointly use the different levels of granularity (from the review to the token level) in a hierarchical prediction framework in order to increase the accuracy of the predictor on each individual task.

References