SubjQA: A Dataset for Subjectivity and Review Comprehension

04/29/2020 ∙ by Johannes Bjerva, et al. ∙ Københavns Uni Megagon Labs 4

Subjectivity is the expression of internal opinions or beliefs which cannot be objectively observed or verified, and has been shown to be important for sentiment analysis and word-sense disambiguation. Furthermore, subjectivity is an important aspect of user-generated data. In spite of this, subjectivity has not been investigated in contexts where such data is widespread, such as in question answering (QA). We therefore investigate the relationship between subjectivity and QA, while developing a new dataset. We compare and contrast with analyses from previous work, and verify that findings regarding subjectivity still hold when using recently developed NLP architectures. We find that subjectivity is also an important feature in the case of QA, albeit with more intricate interactions between subjectivity and QA performance. For instance, a subjective question may or may not be associated with a subjective answer. We release an English QA dataset (SubjQA) based on customer reviews, containing subjectivity annotations for questions and answer spans across 6 distinct domains.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Subjectivity is ubiquitous in our use of language (Banfield, 1982; Quirk et al., 1985; Wiebe et al., 1999; Benamara et al., 2017)

, and is therefore an important aspect to consider in Natural Language Processing (NLP). For example, subjectivity can be associated with different senses of the same word.

boiling is objective in the context of hot water, but subjective in the context of a person boiling with anger (Wiebe and Mihalcea, 2006). The same applies to sentences in discourse contexts (Pang and Lee, 2004). While early work has shown subjectivity to be an important feature for low-level tasks such as word-sense disambiguation and sentiment analysis, subjectivity in NLP has not been explored in many contexts where it is prevalent.

In recent years, there is renewed interest in areas of NLP for which subjectivity is important, and a specific topic of interest is question answering (QA). This includes work on aspect extraction Poria et al. (2016), opinion mining Sun et al. (2017) and community question answering Gupta et al. (2019). Many of these QA systems are based on representation learning architectures. However, it is unclear whether findings of previous work on subjectivity still apply to such architectures, including transformer-based language models (Devlin et al., 2018; Radford et al., 2019).

The interactions between QA and subjectivity are even more relevant today as users’ natural search criteria have become more subjective. Their questions can often be answered by online customer reviews, which tend to be highly subjective as well. Although QA over customer reviews have gained traction recently with the availability of new datasets and architectures (Gupta et al., 2019; Grail and Perez, 2018; Fan et al., 2019; Xu et al., 2019a; Li et al., 2019), these are agnostic with respect to how subjectivity is expressed in the questions and the reviews. Furthermore, the datasets are either too small (< 2000 questions) or have target-specific question types (e.g., yes-no). Consequently, most QA systems are only trained to find answers from factual data, such as Wikipedia articles and News (Rajpurkar et al., 2018; Reddy et al., 2019; Joshi et al., 2017; Trischler et al., 2017).

In this work, we investigate the relation between subjectivity and question answering (QA) in the context of customer reviews. As no such QA dataset exists, we construct a new dataset, SubjQA. In order to capture subjectivity, our data collection method builds on the recent developments in opinion extraction and matrix factorization, instead of relying on the linguistic similarity between the questions and the reviews (Gupta et al., 2019). SubjQA includes over 10,000 English examples spanning 6 domains that cover both products and services. We find that a large percentage of the questions and, respectively, answers in SubjQA are subjective. In our dataset, we found 73% of the questions are subjective and 74% of the answers are subjective. Experiments show that existing QA systems trained to find factual answers struggle to understand subjective questions and reviews. For instance, fine-tuning BERT (Devlin et al., 2018), a state-of-the-art QA model, yields on SQuAD Rajpurkar et al. (2016), but only achieves an average score of across the different domains of SubjQA.

We develop a subjectivity-aware QA model by extending an existing model in a multi-task learning paradigm. The model is trained to predict the subjectivity label and answer span simultaneously, and does not require subjectivity labels at test time. We found our QA model achieves on an average over different domains of SubjQA.


  • [noitemsep,leftmargin=1em]

  • We release a challenging QA dataset with subjectivity labels for questions and answers, spanning 6 domains;

  • We investigate the relationship between subjectivity and a modern NLP task;

  • We develop a subjectivity-aware QA model;

  • We verify the findings of previous work on subjectivity, using recent NLP architectures;

2 Subjectivity

Written text, as an expression of language, contains information on several linguistic levels, many of which have been thoroughly explored in NLP.111Subjectivity is not restricted to written texts, although we focus on this modality here. For instance, both the semantic content of text and the (surface) forms of words and sentences, as expressed through syntax and morphology, have been at the core of the field for decades. However, another level of information can be found when trying to observe or encode the so-called private states of the writer (Quirk et al., 1985). Examples of private states include the opinions and beliefs of a writer, and can concretely be said to not be available for verification or objective observation. It is this type of state which is referred to as subjectivity (Banfield, 1982; Banea et al., 2011).

Whereas subjectivity has been investigated in isolation, it can be argued that subjectivity is only meaningful given sufficient context. In spite of this, most previous work has focused on annotating words (Heise, 2001), word senses (Durkin and Manning, 1989; Wiebe and Mihalcea, 2006), or sentences (Pang and Lee, 2004), with the notable exception of Wiebe et al. (2005), who investigate subjectivity in phrases in the context of a text or conversation. The absence of work investigating broader contexts can perhaps be attributed to the relatively recent emergence of models in NLP which allow for contexts to be incorporated efficiently, e.g. via architectures based on transformers (Vaswani et al., 2017).

As subjectivity relies heavily on context, and we have access to methods which can encode such context, what then of access to data which encodes subjectivity? We argue that in order to fully investigate research questions dealing with subjectivity in contexts, a large-scale dataset is needed. We choose to frame this as a QA dataset, as it not only offers the potential to investigate interactions in a single contiguous document, but also allows interactions between contexts, where parts may be subjective and other parts may be objective. Concretely, one might seek to investigate the interactions between an objective question and a subjective answer.

3 Data Collection

Figure 1: Our data collection pipeline

We found two limitations of existing datasets and collection strategies that motivated us to create a new QA dataset to understand subjectivity in QA. First, data collection methods Gupta et al. (2019); Xu et al. (2019a) often rely on the linguistic similarity between the questions and the reviews (e.g. information retrieval). However, subjective questions may not always use the same words/phrases as the review. Consider the examples below. The answer span ‘vegan dishes’ is semantically similar to the question Q. The answer to the more subjective question Q has little linguistic similarity to the question.

Example 1

Q: Is the restaurant vegan friendly?

Review: …many vegan dishes on its menu.

Q: Does the restaurant have a romantic vibe?

Review: Amazing selection of wines, perfect for a date night.

Secondly, existing review-based datasets are small and not very diverse in terms of question topics and types Xu et al. (2019b); Gupta et al. (2019). We, therefore, consider reviews about both products and services from 6 different domains, namely TripAdvisor, Restaurants, Movies, Books, Electronics and Grocery. We use the data of Wang et al. (2010) for TripAdvisor, and Yelp222

data for Restaurants. We use the subsets for which an open-source opinion extractor was available 

(Li et al., 2019). We use the data of McAuley and Yang (2016) that contains reviews from product pages of spanning multiple categories. We target categories that had more opinion expressions than others, determined by an opinion extractor.

Figure 1 depicts our data collection pipeline which builds upon recent developments in opinion extraction and matrix factorization. An opinion extractor is crucial to identify subjective or opinionated expressions, which other IR-based methods cannot. On the other hand, matrix factorization helps identify which of these expressions are related based on their co-occurrence in the review corpora, instead of their linguistic similarities. To the best of our knowledge, we are the first to explore such a method to construct a challenging subjective QA dataset.

Given a review corpus, we extract opinions about various aspects of the items being reviewed (Opinion Extraction). Consider the following review snippets and extractions.

Example 2

Review: ..character development was quite impressive.

:‹‘impressive’, ‘character development’›

Review: 3 stars for good power and good writing.

:‹‘good’, ‘writing’›

In the next (Neighborhood Model Construction) step, we characterize the items being reviewed and their subjective extractions using latent features between two items. In particular, we use matrix factorization techniques Riedel et al. (2013) to construct a neighborhood model via a set of weights , where each corresponds to a directed association strength between extraction and . For instance, and in Example 2 could have a similarity score . This neighborhood model forms the core of data collection. We select a subset of extractions from as topics (Topic Selection) and ask crowd workers to translate them to natural language questions (Question Generation). For each topic, a subset of its neighbors from and reviews which mention them are selected (Review Selection). In this manner, question-review pairs are generated based on the neighborhood model.

Finally, we present each question-review pair to crowdworkers who highlight an answer span in the review. Additionally, they provide subjectivity scores for both the questions and the answer span.

3.1 Opinion Extraction

An opinion extractor processes all reviews and finds extractions ‹X,Y› where X represents an opinion expressed on aspect Y. Table 1 shows sample extractions from different domains. We use OpineDB Li et al. (2019), a state-of-the-art opinion extractor, for restaurants and hotels. For other domains where OpineDB was not available, we use the syntactic extraction patterns of  Abbasi Moghaddam (2013).

Domain Opinion Span Aspect Span
Restaurants huge lineup
Hospitality no free wifi
Books hilarious book
Movies not believable characters
Electronics impressive sound
Grocery high sodium level
Table 1: Example extractions from different domains

3.2 Neighborhood Model Construction

We rely on matrix factorization to learn dense representations for items and extractions, and identify similar extractions. As depicted in Figure 2, we organize the extractions into a matrix where each row corresponds to an item being reviewed and each column corresponds to an extraction. The value denotes the frequency of extraction in reviews of item . Given and a latent feature model , we obtain extraction embeddings using non-negative matrix factorization. Concretely, each value is obtained from the dot product of two extractions of size :


For each extraction, we find its neighbors

based on the cosine similarity of their embeddings.

333Details about hyper-parameters are included in the Appendix.

Figure 2: Learning representations of extractions via non-negative matrix factorization

3.3 Topic and Review Selection

We next identify a subset of extractions to be used as topics for the questions. In order to maximize the diversity and difficulty in the dataset, we use the following criteria developed iteratively based on manual inspection followed by user experiments.

  1. [leftmargin=*,noitemsep,topsep=0pt]

  2. Cosine Similarity: We prune neighbors of an extraction which have low cosine similarity (< ). Irrelevant neighbors can lead to noisy topic-review pairs which would be marked non-answerable by the annotators.

  3. Semantic Similarity: We prune neighbors that are linguistically similar (> similarity 444using GloVe embeddings provided by Spacy) as they yield easy topic-review pairs.

  4. Diversity: To promote diversity in topics and reviews, we select extractions which have many ( ) neighbors.

  5. Frequency: To ensure selected topics are also popular, we select a topic if: a) its frequency is higher than the median frequency of all extractions, and b) it has at least one neighbor that is more frequent than the topic itself.

We pair each topic with reviews that mention one of its neighbors. The key benefit of a factorization-based method is that it is not only based on linguistic similarity, and forces a QA system to understand subjectivity in questions and reviews.

3.4 Question Generation

Each selected topic is presented to a human annotator together with a review that mentions that topic. We ask the annotator to write a question about the topic that can be answered by the review. For example, ‹‘good’, ‘writing’› could be translated to “Is the writing any good?" or “How is the writing?".

3.5 Answer-Span and Subjectivity Labeling

Lastly, we present each question and its corresponding review to human annotators (crowdworkers), who provides a subjectivity score to the question on a 1 to 5 scale based on whether it seeks an opinion (e.g., “How good is this book?") or factual information (e.g., “is this a hard-cover?"). Additionally, we ask them to highlight the shortest answer span in the review or mark the question as unanswerable. They also provide subjectivity scores for the answer spans. We provide details of our neighborhood model construction and crowdsourcing experiments in the Appendix.

4 Dataset Analysis

In this section, we analyze the questions and answers to understand the properties of our SubjQA dataset. We present the dataset statistics in Section 4.1. We then analyze the diversity and difficulty of the questions. We also discuss the distributions of subjectivity and answerability in our dataset. Additionally, we manually inspect 100 randomly chosen questions from the development set in Section 4.3 to understand the challenges posed by subjectivity of the questions and/or the answers.

Domain Train Dev Test Total
TripAdvisor 1165 230 512 1686
Restaurants 1400 267 266 1683
Movies 1369 261 291 1677
Books 1314 256 345 1668
Electronics 1295 255 358 1659
Grocery 1124 218 591 1725
Table 2: No. of examples in each domain split.

4.1 Data Statistics

Table 2 summarizes the number of examples we collected for different domains. To generate the train, development, and test splits, we partition the topics into training (80%), dev (10%) and test (10%) sets. We partition the questions and reviews based on the partitioning of the topics.

Domain Review len Q len A len % answerable
TripAdvisor 187.25 5.66 6.71 78.17
Restaurants 185.40 5.44 6.67 60.72
Movies 331.56 5.59 7.32 55.69
Books 285.47 5.78 7.78 52.99
Electronics 249.44 5.56 6.98 58.89
Grocery 164.75 5.44 7.25 64.69
Table 3: Domain statistics. Len denotes n tokens.

4.2 Difficulty and Diversity of Questions

Domain # questions # aspects % boolean Q
TripAdvisor 1411 171 16.13
Restaurants 1553 238 17.29
Movies 1556 228 15.56
Books 1517 231 16.90
Electronics 1535 314 14.94
Grocery 1333 163 14.78
Table 4: Diversity of questions and topics
Figure 3: The distribution of prefixes of questions. The outermost ring shows unigram prefixes (e.g., 57.9% questions start with how). The middle and innermost rings correspond to bigrams and trigrams, respectively.
Reasoning Percent. Example
Lexical 18% Q: How small was the hotel bathroom?
R: …Bathroom on the small side with older fixtures…
Paraphrase 28% Q: How amazing was the end?
R: …The ending was absolutely awesome, it makes the experience not so …
Indirect 43% Q: How was the plot of the movie?
R: …simply because there’s so much going on, so much action, so many complex ..
Insufficient 11% Q: How do you like the episode?
R: For a show that I think was broadcast in HighDef, it seems impossible that the…
Table 5: Types of reasoning required for the various domains.

As can be seen in Table 3, reviews in different domains tend to vary in length. Answer spans tend to be 6-7 tokens long, compared to 2-3 tokens in SQuAD. Furthermore, the average linguistic similarity of the questions and the answer spans was low: 0.7705 computed based on word2vec. These characteristics of SubjQA contribute to making it an interesting and challenging QA dataset.

Table 4

shows the number of distinct questions and topics in each domain. On average we collected 1500 questions covering 225 aspects. We also automatically categorize the boolean questions based on a lexicon of question prefixes. Unlike other review-based QA datasets 

Gupta et al. (2019), SubjQA contains more diverse questions, the majority of which are not yes/no questions. The questions are also linguistically varied, as indicated by the trigram prefixes of the questions (Figure 3). Most of the frequent trigram prefixes in SubjQA (e.g., how is the, how was the,how do you) are almost missing in SQuAD and  Gupta et al. (2019). The diversity of questions in SubjQA demonstrate challenges unique to the dataset.

4.3 Data Quality Assessment

We randomly sample 100 answerable questions to manually categorize them according to their reasoning types. Table 5 shows the distribution of the reasoning types and representative examples. As expected, since a large fraction of the questions are subjective, they cannot be simply answered using a keyword-search over the reviews or by paraphrasing the input question. Answering such questions requires a much deeper understanding of the reviews. Since the labels are crowdsourced, a small fraction of the answer spans are noisy.

We also categorized the answers based on answer-types. We observed that 64% of the answer spans were independent clauses (e.g., the staff was very helpful and friendly), 25% were noun phrases (e.g., great bed) and 11% were incomplete clauses/spans (e.g., so much action). This supports our argument that often subjective questions cannot be answered simply by an adjective or noun phrase.

4.4 Answerability and Subjectivity

The dataset construction relies on a neighborhood model generated automatically using factorization. It captures co-occurrence signals instead of linguistic signals. Consequently, the dataset generated is not guaranteed to only contain answerable questions. As expected, about 65% of the questions in the dataset are answerable from the reviews (see Table 7). However, unlike Gupta et al. (2019)

, we do not predict answerability using a classifier. The answerability labels are provided by the crowdworkers instead, and are therefore more reliable.

Table 7 shows the subjectivity distribution in questions and answer spans across different domains. A vast majority of the questions we collected are subjective, which is not surprising since we selected topics from opinion extractions. A large fraction of the subjective questions (70%) were also answerable from their reviews.

We also compare the subjectivity of questions with the subjectivity of answers. As can be seen in Table 6, the subjectivity of an answer is strongly correlated with the subjectivity of the question. Subjective questions often have answers that are also subjective. Similarly, factual questions, with few exceptions, have factual answers. This indicates that a QA system must understand how subjectivity is expressed in a question to correctly find its answer. Most domains have 75% subjective questions on average. However, the BERT-QA model fine-tuned on each domain achieves 80% F1 on subjective questions in movies and books, but only achieves 67-73% F1 on subjective questions in grocery and electronics. Future QA systems for user-generated content, such as for customer support, should therefore model subjectivity explicitly.

subj. Q fact. Q
subj. A 79.8% 1.31%
fact. A 1.29% 17.58%
Table 6: Subjectivity distribution in SubjQA.
Domain % subj. Q % answerable % subj. A
TripAdvisor 74.49 83.20 75.20
Restaurants 76.11 65.72 76.29
Movies 74.41 62.09 74.59
Books 75.77 58.86 75.35
Electronics 69.80 65.37 69.98
Grocery 73.21 70.22 73.15
Table 7: Statistics on subjective Q, answerability, and subjective A per domain in SubjQA.

5 Subjectivity Modeling

We now turn to experiments on subjectivity, first investigating claims made by previous work, and whether they still hold when using recently developed architectures, before investigating how to model subjectivity in QA.

5.1 Subjectivity in Sentiment Analysis

Pang and Lee (2004)

have shown that subjectivity is an important feature for sentiment analysis. Sorting sentences by their estimated subjectivity scores, and only using the top

such sentences, allows for a more efficient and better-performing sentiment analysis system, than when considering both subjective and objective sentences equally. We first investigate whether the same findings hold true when subjectivity is estimated using transformer-based architectures. Our setup is based on a pre-trained BERT-based uncased model.555 Following the approach of Devlin et al. (2018), we take the final hidden state corresponding to the special [CLS]

token of an input sequence as its representation. We then predict the subjectivity of the sentence by passing its representation through a feed-forward neural network, optimized with SGD. We compare this with using subjectivity scores of

TextBlob666, a sentiment lexicon-based method, as a baseline. We consider sentences with a high TextBlob subjectivity score () as subjective.

We evaluate the methods on subjectivity data from Pang and Lee (2004)777 and the subjectivity labels made available in our dataset (SubjQA). Unsurprisingly, a contextually-aware classifier vastly outperforms a word-based classifier, highlighting the importance of context in subjectivity analysis (see Table 8). Furthermore, predicting subjectivity in SubjQA is more challenging than in IMDB, because SubjQA spans multiple domains.

Word-based (TextBlob) 61.90 57.50
BERT fine-tuned 88.20 62.77
Table 8: Subjectivity prediction accuracies on IMDB data (Pang and Lee, 2004) and our dataset (SubjQA).

We further investigate if our subjectivity classifier helps with the sentiment analysis task. We implement a sentiment analysis classifier which takes the special [CLS] token of an input sequence as the representation. We train this classifier by replicating conditions described in Pang and Lee (2004). As shown in Figure 4, giving a contextually-aware subjectivity classifier access to subjective sentences improves the performance on sentiment analysis, outperforming a baseline of using all sentences, and objective sentences.

Figure 4: Sentiment Analysis accuracy using top N subj. sentences (blue), top N fact. sentences (orange dashed), compared to the all sentences baseline (black).

5.2 Subjectivity-Aware QA Model

Given our importance of subjectivity in other NLP tasks, we investigate whether it is also an important feature for QA using SubjQA. We approach this by implementing a subjectivity-aware QA model, as an extension of one of our baseline models in a multitask learning (MTL) paradigm (Caruana, 1997). One advantage of using MTL is that we do not need to have access to subjectivity labels at test time, as would be the case if we required subjectivity labels as a feature for each answer span. We base our model on FastQA (Weissenborn et al., 2017). Each input paragraph is encoded with a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) over a sequence of word embeddings and contextual features (). This encoding, , is passed through a hidden layer and a non-linearity:


We extend this implementation by adding two hidden layers of task-specific parameters () associated with a second learning objective:


In training, we randomly sample between the two tasks (QA and Subjectivity classification).

Figure 5: F1 scores of pre-trained out-of-the-box models on different domains in SubjQA.
Fact. A Subj. A Fact. Q Subj. Q Overall
F1 E F1 E F1 E F1 E F1 E
Tripadvisor 17.50 20.88 1.28 7.43 18.85 21.60 1.16 7.37 1.01 7.42
Restaurants 10.36 12.38 8.37 11.49 13.85 15.77 8.19 11.07 5.71 8.65
Movies 14.49 14.63 5.17 8.02 14.27 14.41 5.44 8.28 3.08 5.84
Books 13.95 14.10 7.18 9.82 14.68 14.83 7.05 9.68 4.06 6.67
Electronics 14.15 18.70 0.28 7.06 13.29 18.22 0.40 7.24 -0.01 7.26
Grocery 9.69 11.74 -0.16 3.75 10.71 12.32 -0.48 3.41 -1.57 2.20
Average 13.35 15.40 3.69 7.93 14.27 16.19 3.63 7.84 2.05 6.34
Table 9: MTL gains/losses over the fine-tuning condition (F1 and Exact match), across subj./fact. QA.

5.3 Baselines

We use four pre-trained models to investigate how their performances on SubjQA compare with a factual dataset, SQuAD Rajpurkar et al. (2016), created using Wikipedia. Specifically, we evaluate BiDaF (Seo et al., 2017), FastQA (Weissenborn et al., 2017), JackQA (Weissenborn et al., 2018)888 and BERT Devlin et al. (2018),999BERT-Large, Cased (Whole Word Masking) all pre-trained on SQuAD. Additionally, we fine tune the models on each domain in SubjQA.

Figure 6: Gain in F1 with models fine-tuned on different domains over the pre-trained model.

Figure 5 shows the F1 scores of the pre-trained models. We report the Exact match scores in Appendix A.1. Pre-trained models achieve F1 scores as high as 92.9% on the SQuAD. On the other hand, the best model achieves an average F1 of 30.5% across all domains and 36.5% F1 at best on any given domain in SubjQA. The difference in performance can be attributed to both differences in domain (Wikipedia vs. customer reviews) and how subjectivity is expressed across different domains.

Figure 6 shows the absolute gains in F1 scores of models fine-tuned on specific domains, over the pre-trained model. After fine-tuning on each domain, the best model achieves an average F1 of 74.1% across the different domains, with a minimum of 63.3% and a maximum of 80.5% on any given domain. While fine-tuning significantly boosts the F1 scores in each domain, they are still lower than the F1 scores on the SQuAD dataset. We argue that this is because the models are agnostic about subjective expressions in questions and reviews. To validate our hypothesis, we compare the gain in F1 scores of the BERT model on subjective questions and factual questions. We find that the difference in F1 gains is as high as 23.4% between factual and subjective questions. F1 gains differ by as much as 23.0% for factual vs. subjective answers.

5.4 Subjectivity-Aware Modeling

After fine-tuning over each domain in the MTL setting, the subjectivity-aware model achieves an average F1 of 76.3% across the different domains, with a minimum of 58.8% and a maximum of 82.0% on any given domain. Results from the subjectivity-aware model are shown in Table 9. Under both the F1 and the Exact match metrics, incorporating subjectivity in the model as an auxiliary task boosts performance across all domains. Although there are gains also for subjective questions and answers, it is noteworthy that the highest gains can be found for factual questions and answers. This can be explained by the fact that existing techniques already are tuned for factual questions. Our MTL extension helps in identifying factual questions, which further improves the results. However, even if subjective questions are identified, the system is still not tuned to adequately deal with this input.

6 Related Work

We are witnessing an exponential rise in user-generated content. Much of this data contains subjective information ranging from personal experiences to opinions about a specific aspects of a product. This information is useful for supporting decision making in product purchases. However, subjectivity has largely been studied in the context of sentiment analysis Hu and Liu (2004) and opinion mining Blair-Goldensohn et al. (2008), with a focus on text polarity. There is a renewed interested in incorporating subjective opinion data into a general data management system Li et al. (2019); Kobren et al. (2019) and providing an interface for querying subjective data. These systems employ trained components for extracting opinion data, labeling it and even responding to user questions.

In this work, we revisit subjectivity in the context of review QA. McAuley and Yang (2016); Yu et al. (2012) also use review data, as they leverage question types and aspects to answer questions. However, no prior work has modeled subjectivity explicitly using end-to-end architectures.

Furthermore, none of the existing review-based QA datasets are targeted at understanding subjectivity. This can be attributed to how these datasets are constructed. Large-scale QA datasets, such as SQuAD Rajpurkar et al. (2016), NewsQA Trischler et al. (2017), CoQA Reddy et al. (2019) are based on factual data. We are the first to attempt to create a review-based QA dataset for the purpose of understanding subjectivity.

7 Conclusion

In this paper, we investigate subjectivity in question answering, by leveraging end-to-end architectures. We release SubjQA, a question-answering corpus which contains subjectivity labels for both questions and answers. The dataset allows i) evaluation and development of architectures for subjective content, and ii) investigation of subjectivity and its interactions in broad and diverse contexts. We further implement a subjectivity-aware model and evaluate it, along with 4 strong baseline models. We hope this dataset opens new avenues for research on end-to-end architectures for querying subjective content, and for research into subjectivity in NLP in general.


  • S. Abbasi Moghaddam (2013) Aspect-based opinion mining in online reviews. Ph.D. Thesis, Applied Sciences: School of Computing Science. Cited by: §3.1.
  • C. Banea, R. Mihalcea, and J. Wiebe (2011) Multilingual sentiment and subjectivity analysis. Multilingual natural language processing 6, pp. 1–19. Cited by: §2.
  • A. Banfield (1982) Unspeakable sentences: the sentence representing non-reflective consciousness and the absence of the narrator. Routledge. Cited by: §1, §2.
  • F. Benamara, M. Taboada, and Y. Mathieu (2017) Evaluative language beyond bags of words: linguistic insights and computational applications. Computational Linguistics 43 (1), pp. 201–264. Cited by: §1.
  • S. Blair-Goldensohn, K. Hannan, R. McDonald, T. Neylon, G. Reis, and J. Reynar (2008) Building a sentiment summarizer for local service reviews. Cited by: §6.
  • R. Caruana (1997) Multitask learning. Machine Learning 28 (1), pp. 41–75. Cited by: §5.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Cited by: §1, §1, §5.1, §5.3.
  • K. Durkin and J. Manning (1989) Polysemy and the subjective lexicon: semantic relatedness and the salience of intraword senses. Journal of Psycholinguistic Research 18 (6), pp. 577–612. Cited by: §2.
  • M. Fan, C. Feng, M. Sun, P. Li, and H. Wang (2019) Reading customer reviews to answer product-related questions. In Proceedings of the 2019 SIAM International Conference on Data Mining, SDM 2019, Calgary, Alberta, Canada, May 2-4, 2019, pp. 567–575. External Links: Link, Document Cited by: §1.
  • Q. Grail and J. Perez (2018) ReviewQA: a relational aspect-based opinion reading dataset. CoRR abs/1810.12196. External Links: Link, 1810.12196 Cited by: §1.
  • M. Gupta, N. Kulkarni, R. Chanda, A. Rayasam, and Z. C. Lipton (2019) AmazonQA: A review-based question answering task. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019

    pp. 4996–5002. External Links: Link, Document Cited by: §1, §1, §1, §3, §3, §4.2, §4.4.
  • D. R. Heise (2001) Project Magellan: collecting cross-cultural affective meanings via the internet. Electronic Journal of Sociology. Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §5.2.
  • M. Hu and B. Liu (2004) Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177. Cited by: §6.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1601–1611. External Links: Link, Document Cited by: §1.
  • A. Kobren, P. Bario, O. Yakhnenko, J. Hibschman, and I. Langmore (2019) Constructing high precision knowledge bases with subjective and factual attributes. arXiv preprint arXiv:1905.12807. Cited by: §6.
  • Y. Li, A. Feng, J. Li, S. Mumick, A. Y. Halevy, V. Li, and W. Tan (2019) Subjective databases. PVLDB 12 (11), pp. 1330–1343. Cited by: §1, §3.1, §3, §6.
  • J. J. McAuley and A. Yang (2016) Addressing complex and subjective product-related queries with customer reviews. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, pp. 625–635. Cited by: §3, §6.
  • B. Pang and L. Lee (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21-26 July, 2004, Barcelona, Spain, pp. 271–278. Cited by: §1, §2, §5.1, §5.1, §5.1, Table 8.
  • S. Poria, E. Cambria, and A. F. Gelbukh (2016)

    Aspect extraction for opinion mining with a deep convolutional neural network

    Knowl.-Based Syst. 108, pp. 42–49. Cited by: §1.
  • R. Quirk, G. S., L. G., and S. J. (1985) A comprehensive grammar of the english language. Longman, New York. Cited by: §1, §2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pp. 784–789. External Links: Link, Document Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 2383–2392. External Links: Link Cited by: §1, §5.3, §6.
  • S. Reddy, D. Chen, and C. D. Manning (2019) CoQA: A conversational question answering challenge. TACL 7, pp. 249–266. External Links: Link Cited by: §1, §6.
  • S. Riedel, L. Yao, A. McCallum, and B. M. Marlin (2013) Relation extraction with matrix factorization and universal schemas. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pp. 74–84. Cited by: §3.
  • M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2017) Bidirectional attention flow for machine comprehension. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §5.3.
  • S. Sun, C. Luo, and J. Chen (2017) A review of natural language processing techniques for opinion mining systems. Information Fusion 36, pp. 10–25. External Links: Link, Document Cited by: §1.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2017) NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017, pp. 191–200. External Links: Link Cited by: §1, §6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.
  • H. Wang, Y. Lu, and C. Zhai (2010) Latent aspect rating analysis on review text data: a rating regression approach. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, July 25-28, 2010, pp. 783–792. Cited by: §3.
  • D. Weissenborn, P. Minervini, I. Augenstein, J. Welbl, T. Rocktäschel, M. Bosnjak, J. Mitchell, T. Demeester, T. Dettmers, P. Stenetorp, and S. Riedel (2018) Jack the reader - A machine reading framework. In Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018, System Demonstrations, pp. 25–30. External Links: Link, Document Cited by: §5.3.
  • D. Weissenborn, G. Wiese, and L. Seiffe (2017) FastQA: A simple and efficient neural architecture for question answering. CoRR abs/1703.04816. External Links: Link, 1703.04816 Cited by: §5.2, §5.3.
  • J. Wiebe, R. F. Bruce, and T. P. O’Hara (1999) Development and use of a gold-standard data set for subjectivity classifications. In 27th Annual Meeting of the Association for Computational Linguistics, University of Maryland, College Park, Maryland, USA, 20-26 June 1999, Cited by: §1.
  • J. Wiebe and R. Mihalcea (2006) Word sense and subjectivity. In ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006, Cited by: §1, §2.
  • J. Wiebe, T. Wilson, and C. Cardie (2005) Annotating expressions of opinions and emotions in language. Language Resources and Evaluation 39 (2-3), pp. 165–210. Cited by: §2.
  • H. Xu, B. Liu, L. Shu, and P. S. Yu (2019a) Review conversational reading comprehension. CoRR abs/1902.00821. External Links: Link, 1902.00821 Cited by: §1, §3.
  • H. Xu, B. Liu, L. Shu, and P. S. Yu (2019b) Bert post-training for review reading comprehension and aspect-based sentiment analysis. arXiv preprint arXiv:1904.02232. Cited by: §3.
  • J. Yu, Z. Zha, and T. Chua (2012) Answering opinion questions on products by exploiting hierarchical organization of consumer reviews. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, July 12-14, 2012, Jeju Island, Korea, pp. 391–401. Cited by: §6.

Appendix A Appendices

a.1 Additional Experimental Results

Figure 7 shows the exact scores achieved by the pretrained out-of-the-box models on various domains in SubjQA. Figure 8 shows the exact scores of the models fine-tuned on each domain in SubjQA.

Figure 7: Exact scores of pre-trained out-of-the-box models on different domains.
Figure 8: Gain in Exact scores with models fine-tuned on different domains.

a.2 Neighborhood Model Construction

For constructing the matrix for factorization, we focus on frequently reviewed items and frequent extractions. In particular, we consider items which have more than 10,000 reviews and extractions that were expressed in more than 5000 reviews. Once the matrix is constructed, we factorize it using non-negative factorization method using 20 as the dimension of the extraction embedding vector.

In the next step, we construct the neighborhood model by finding top-10 neighbors for each extraction based on cosine similarity of the extraction and the neighbor. We further select topics from the extractions, and prune the neighbors based on the criteria we described earlier.

a.3 Crowdsourcing Details

Figure 9 illustrates the instructions that were shown to the crowdworkers for the question generation task. Figure 10 shows the interface for the answer-span collection and subjectivity labeling tasks. The workers assign subjectivity scores (1-5) to each question and the selected answer span. They can also indicate if a question cannot be answered from the given review.

Figure 9: The instructions shown to crowdworkers for the question writing task.
Figure 10: The interface for the answer-span collection and subjectivity labeling tasks.