Review Conversational Reading Comprehension

02/03/2019 ∙ by Hu Xu, et al. ∙ University of Illinois at Chicago 0

Seeking information about products and services is an important activity of online consumers before making a purchase decision. Inspired by recent research on conversational reading comprehension (CRC) on formal documents, this paper studies the task of leveraging knowledge from a huge amount of reviews to answer multi-turn questions from consumers or users. Questions spanning multiple turns in a dialogue enables users to ask more specific questions that are hard to ask within a single question as in traditional machine reading comprehension (MRC). In this paper, we first build a dataset and then propose a novel task-adaptation approach to encoding the formulation of CRC task into a pre-trained language model. This task-adaptation approach is unsupervised and can greatly enhance the performance of the end CRC task that has only limited supervision. Experimental results show that the proposed approach is highly effective and has competitive performance as supervised approach. We plan to release the datasets and the code in May 2019.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Seeking information to assess whether some products or services suit one’s needs is a vital activity for consumer decision making. In online businesses, one major hindrance is that customers have limited access to answers to their specific questions or concerns about products and user experiences. Given the ever-changing environment of products and services, it is very hard, if not impossible, to pre-compile an up-to-date knowledge base to answer user questions as in KB-QA Kwok et al. (2001); Fader et al. (2014); Yin et al. (2015); Xu et al. (2016). As a compromise, community question-answering (CQA) McAuley and Yang (2016) is leveraged to enable existing customers or sellers to answer customer questions. However, one obvious drawback of this approach is that many questions are not answered, and even if they are answered, the answers and the following up questions are delayed, which is not suitable for interactive QA. Although existing studies have used information retrieval (IR) techniques McAuley and Yang (2016); Yu and Lam (2018) to identify a whole review as an answer to a question, it is time-consuming to read a whole review and the approach has difficulty to answer questions in multiple turns.

A Laptop Review:
I purchased my Macbook Pro Retina from
my school since I had a student discount ,
but I would gladly purchase it from Amazon
for full price again if I had too . The Retina
is great , its amazingly fast when it
boots up because of the SSD storage
and the clarity of the screen is amazing as well .
Turns of Questions from a Customer:
: how is retina display ?
: speed of booting up ?
: why ?
: what ’s the capacity of that ? (NO ANSWER)
: is the screen clear ?
Table 1: An Example of conversational review reading comprehension (best viewed in colors): we show dialogue with 5-turn questions that a customer asks and the review with textual span answers.

Inspired by recent research in Conversational Reading Comprehension (CRC) Reddy et al. (2018); Choi et al. (2018), we explore the possibility of turning reviews as a source of valuable knowledge of experiences and to provide a natural way of answering customers’ multiple-turn questions in a dialogue setting. The conversational setting of machine reading comprehension (MRC) enables more specific questions and allow customers to either omit or co-reference information in context. As an example in a laptop domain shown in Table 1, a customer may have 5 turns of questions based on the context. The customer first has an opinion question targeting an aspect “retina display” of a to-be-purchased laptop. Then the customer carries (and omit) the question type opinion from the first question to the second and continually asking the second aspect “boot-up speed”. For the third question, the customer carries the aspect of the second question but change the question type to opinion explanation. Later, the customer can co-reference the aspect “SSD” from the previous answer and ask for the capacity (a sub-aspect) of “SSD”. Unfortunately, there is no answer in this review for the fourth question so the review may say “I don’t know”. But the customer can still ask other aspects as in the fifth question. We formally define this problem as follows and call it review conversational reading comprehension (RCRC).

Problem Definition: Given a review that consists of a sequence of tokens , a history of past questions and answers as the context and the current question , find a sequence of tokens (a textual span) in that answers based on , where , , and , or return NO ANSWER () if the review does not contain any answer for .

RCRC is a novel QA task that requires the understanding of both the current question and dialogue context . Compared to the traditional single-turn MRC, the key challenge is how to understand the context and the current question given it may have a co-reference resolution or context carryover.

To the best of our knowledge, there are no existing review datasets for RCRC. We first build a dataset called based on laptop and restaurant reviews from SemEval 2016 Task 5222http://alt.qcri.org/semeval2016/task5/

. We choose this dataset to better align with existing research on review-based tasks in sentiment analysis. Each review is annotated with a few dialogues focusing on some topics. Note that although one dialogue is annotated on a single review, a trained RCRC model can potentially be deployed among an open set of reviews

Chen et al. (2017) where the context may potentially contain answers from different reviews. Given the wide spectrum of domains in online business (e.g., thousands of categories on Amazon.com) and the prohibitive cost of annotation, is designed to have limited supervision as in other tasks of sentiment analysis.

We adopt BERT (Bidirectional Encoder Representation from Transformers Devlin et al. (2018)) as our base model since its variants achieve dominant performance on MRC Rajpurkar et al. (2016, 2018) and CRC Reddy et al. (2018) tasks. However, BERT is designed to learn features for a wide spectrum of NLP tasks333https://gluebenchmark.com/leaderboard with a large amount of training examples444For example, the number of training examples for MRC and CRC are typically more than 100K.. The task-awareness of BERT can be hindered by the weak supervision of the dataset. To resolve this challenge, we introduce a novel pre-tuning stage between pre-training and end-task fine-tuning for BERT. The pre-tuning stage is formulated in a similar fashion as the RCRC task but requires no annotated RCRC data and just domain QA pairs (from CQA) and reviews, which are readily available onlineMcAuley and Yang (2016). We bring certain characteristics of the RCRC task (inputs/outputs) to pre-tuning to encourage BERT’s weight to be prepared for understanding the current question and locate the answer if there exists one. The proposed pre-tuning step is general and can potentially be used in MRC or CRC tasks in other domains.

The main contributions of this paper are as follows. (1) It proposes a practical new task on reviews that allows multi-turn conversational QA. (2) To address this problem, an annotated dataset is first created. (3) It then proposes a pre-tuning stage to learn task-aware representation. Experimental results show that the proposed approach achieves competitive performance even compared with the supervised approach on a large-scale training data.

2 Related Works

MRC (or CRC) has been studied in many domains with formal written texts, e.g., Wikipedia (WikiReading Hewlett et al. (2016), SQuAD Rajpurkar et al. (2016, 2018), WikiHop Welbl et al. (2018), DRCD Shao et al. (2018), QuAC Choi et al. (2018), HotpotQA Yang et al. (2018)), fictional stories (MCTest Richardson et al. (2013), CBT Hill et al. (2015), NarrativeQA Kočiskỳ et al. (2018)), general Web documents (MS MARCO Nguyen et al. (2016), TriviaQA Joshi et al. (2017), SearchQA Dunn et al. (2017)) and news articles (NewsQA Trischler et al. (2016), CNN/Daily Mail Hermann et al. (2015), and RACE Lai et al. (2017)). Recently, CRC Reddy et al. (2018); Huang et al. (2018); Zhu et al. (2018) gains increasing popularity as it allows natural multi-turn questions. Examples are QuAC Choi et al. (2018) and CoQA Reddy et al. (2018). CoQA is built from multiple sources, such as Wikipedia, Reddit, News, Mid/High School Exams, Literature, etc. To the best of our knowledge, CRC has not been used on reviews, which are primarily subjective. Our dataset is compatible with the format of CoQA datasets so all CoQA-based models can be easily adapted to our dataset. Answers from are intended to be extractive (similar to SQuAD Rajpurkar et al. (2016, 2018)) rather than abstractive (generative) (such as in MS MARCO Nguyen et al. (2016) and CoQA Reddy et al. (2018)) because we believe online businesses are cost-sensitive so relying on human written answers are more reliable than machine generated answers.

Traditionally, knowledge bases (KBs) (such as Freebase Dong et al. (2015); Xu et al. (2016); Yao and Van Durme (2014) or DBpedia Lopez et al. (2010); Unger et al. (2012)) have been used for question-answering Yu and Lam (2018). However, the ever-changing environment of online businesses launches new products and services appear constantly, making it prohibitive to build a high-quality KB to cover all new products, services and subjective experiences from customers. Community QA (CQA) is widely adopted by online businesses McAuley and Yang (2016) to help users get answers for their questions. However, since the answers are written by humans, it often takes a long time to get a question answered or even not answered at all as we discussed in the introduction section. There exist researches that align reviews to questions in CQA as an information retrieval task McAuley and Yang (2016); Yu and Lam (2018), but a whole review is hard to read and not suitable for follow-up questions. We novelly use CQA data for CRC (or potentially for MRC), which play a significant role in encouraging domain representation learning on questions and contexts, which are largely ignored in existing research on MRC (or CRC).

3 Preliminary

In this section, we briefly review BERT (Bidirectional Encoder Representation from Transformers Devlin et al. (2018)), which is one of the key innovations of unsupervised contextualized representation learning Peters et al. (2018); Howard and Ruder (2018); Radford et al. ; Devlin et al. (2018). The idea behind these innovations is that although the word embedding Mikolov et al. (2013); Pennington et al. (2014) layer is trained from large-scale corpora, relying on the limited supervised data from end-tasks to train the contextualized representation is insufficient. Unlike ELMo Peters et al. (2018) and ULMFiT Howard and Ruder (2018) that are designed to provide additional features for an end task, BERT adopts a fine-tuning approach that requires almost no specific architecture design for end tasks, but parameter intensive models on BERT itself. As such, BERT requires pre-training on large-scale data (Wikipedia articles) to fill intensive parameters in exchange for human structured architecture designs for specific end-tasks that carry human’s understanding of data of those tasks.

One training example of BERT is formulated as , where [CLS] and [SEP] are special tokens and is a document splited into two sides of sentences and . The key performance gain of BERT comes from two novel pre-training objectives: masked language model (MLM) and next text sentence prediction.
Masked Language Model enables learning bidirectional language models and essentially encourages a BERT model to predict randomly masked words given their contexts. This is crucial for RCRC. For example, an example can be “its amazingly [MASK] when it boots up because of the [MASK] storage”. These two [MASK]’s encourage BERT to guess that the first mark could be “fast”and the second mask could be“SSD” so as to learn some common knowledge on aspects of laptops and their potential opinions.
Next Sentence Prediction further encourages BERT to learn inter-sentence representations by predicting whether two sides around the first [SEP] are from the same document or not. We remove this objective in our pre-tuning as the text format is different from BERT pre-training (discussed in the next Section).

In summary, we can see that the pre-trained BERT severely lacks RCRC task-awareness as there is no formulation for either context , the current turn question or possible answer spans as Wikipedia contains almost no questions or domain knowledge about online businesses. We resolve these issues in the next section.

4 Task-awareness Pre-tuning

To address the limitation of BERT on task-awareness, we introduce an intermediate stage of pre-tuning between BERT pre-training and fine-tuning on RCRC. This works in a similar spirit to the invention of BERT (or any other pre-trained language models) because it is also insufficient to learn the end task definition (or setting) solely on the limited supervised data (of that task). The task-awareness is determined by the inputs and outputs of RCRC, which introduce two directions for pre-tuning: (1) understanding the text inputs, including both domains and text formats (e.g., contexts, current questions). (2) understanding the goal of RCRC, including both having a text span or no answer. As such, we first define the textual format that is shared by both the RCRC and BERT pre-tuning in Section 4.1. Then we introduce an auxiliary pre-tuning objective in Section 4.3.

4.1 Textual Format

Inspired by the recent implementation of DrQA for CoQA Reddy et al. (2018) and BERT for SQuAD, we formuate an input example for pre-tuning (or RCRC) from the context , the current question , and the review as follows:

where past QA pairs in are concatenated and separated by two special tokens [Q] and [A] and then concatenate with the current question as the left side of BERT and the right side is the review document. This format will be used for both pre-tuning and RCRC task fine-tuning. Note that the answer for a question with no answer in the context is written as a single word “unknown”. One can observe that although this format is simple and intuitive for humans to read, BERT’s pre-trained weights have no idea the semantics behind this format (e.g., where is the current question, how many turns in the context and where is the previous turn), let alone the special tokens [Q] and [A] never appear during BERT pre-training.

4.2 Pre-tuning Data Generation

Based on the format defined in Section 4.1, we can observe that getting BERT to be familiar with domain reviews is as easy as continually training BERT on reviews. However, enabling BERT to understand the context and the current question is more challenging as the pre-training data of BERT has almost no question. To resolve this issue, we combine QA pairs (from CQA data) and reviews to formulate the pre-tuning examples, as shown in Algorithm 1. Note that these two kinds of data are often readily available across a wide range of products in Amazon.com and Yelp.com.

Input : : a set of QA pairs;
: a set of reviews;
: maximum turns in context.
Output : : pre-tuning data.
1 for   do
2       for  do
3            
4       end for
5       if   then
6            
7       end if
8      else
9            
10       end if
11       if   then
12            
13       end if
14      
15 end for
Algorithm 1 Data Generation Algorithm

To ensure the topic of a pre-tuning example is consistent between QAs and reviews, we assume QA pairs and reviews are organized under each entity (a laptop or a restaurant in our experiment) that customers focus on. The inputs to Algorithm 1 are a set of QA pairs and reviews belonging to the same entity and the maximum turns in the context that is the same as the RCRC datasets. The output is the pre-tuning data as initialized in Line 1, where each example is denoted as . Here is the input example and is the two pointers for the auxiliary objective (discussed in Section 4.3). Given a QA pair in Line 2, we first build the left side of input example in Line 3-9. After initializing input in Line 3, we randomly determine the number of turns as context in Line 4 and concatenate these turns of QA pairs in Line 5-8, where ensures the current QA pair is not chosen. In Line 9, we concatenate the current question . Lines 10-23 build the right side of input example and the outputs pointers In Line 10, we randomly draw a review with sentences. To challenge the pre-tuning stage to discover the semantic relatedness between and (as for the auxiliary objective), we first decide whether to allow the right side of contains (Line 16) for or a fake random answer Lines 11-12. We also come up with two pointers and initialized in Lines 13 and 17. Then, we insert into review by randomly pick one from the locations in Lines 19-20. This gives us , which has tokens. We further update and to allow them to point to the chunk boundaries of . Otherwise, BERT should detect as no on the right side and point to [CLS] (). Finally, examples are aggregated in Line 25.

Algorithm 1 is run times to allow for enough samplings of data. As we can see, although labeled training examples for RCRC are expensive to obtain, harvest a large amount of pre-tuning data is easy. Following the success of BERT, we still randomly mask some words in each example to learn contextualized representations on domain texts.

4.3 Auxilary Objective

Besides adapting input to domains and RCRC task, it is also desirable to allow pre-tuning to adapt BERT to the goal of RCRC tasks, which is to predict a token span or NO ANSWER. Besides MLM from BERT, we further introduce an auxiliary objective called answer chunk detection to align BERT to a similar fashion as RCRC, except that we only predict the token spans of an answer chunk from CQA. Further, these tasks challenge BERT to be prepared for predicting NO ANSWER from a review by detecting a negative randomly drawn answer.

Let

be the BERT’s transformer model. We first obtain the hidden representation of BERT as

. Then the hidden representation is passed to two separate dense layers followed by softmax functions: and , where , and and is the size of the hidden dimension (e.g., 768 for ). Training involves minimizing the averaged cross entropy on the two pointers and generated in Algorithm 1:

where and

are one-hot vectors representing the two starting and ending positions. For a positive example (with true answer

randomly inserted in the review), and are expected to be and , respectively, where is the position of the first [SEP]. For a negative example (with a random answer (not ) mixed into the review), indicates the two pointers must point to [CLS].

After pre-tuning, we fine-tune on the RCRC task in a similar fashion to the auxiliary objective, except this time there is no need to perform MLM.

5 Experiments

We aim to answer the following research questions (RQs) in the experiment:
RQ1: What is the performance of using BERT compared against CoQA baselines ?
RQ2: Upon ablation studies of different applications of BERT, what is the performance gain of pre-tuning ?
RQ3: What is the performance of pre-tuning compared to using (large-scale) supervised data?

5.1 Pre-tuning datasets

To be consistent with existing research on review-based tasks such as sentiment analysis, we adopt SemEval 2016 Task 5 555We do not use SemEval 2014 Task 4 or SemEval 2015 Task 12 because they do not have review-level data. as the review source for RCRC, which contains two domains laptop and restaurant. Then we collect reviews and QA pairs for these two domains. For the laptop domain, we collect the reviews from He and McAuley (2016) and QA pairs from Xu et al. (2018) both under the laptop category of Amazon.com. We exclude products in the test data of to make sure the test data is not used for on any model parameters. This gives us 113,728 laptop reviews and 18,589 QA pairs. For the restaurant domain, we collect reviews from Yelp dataset challenges666https://www.yelp.com/dataset/challenge but crawl QA pairs from Yelp.com 777https://www.yelpblog.com/2017/02/qa. We select restaurants with at least 100 reviews as other restaurants seldom have any QA pairs. This ends with 753,096 restaurant reviews and 15,457 QA pairs.

To compare with a supervised pre-tuning approach, we further leverage the CoQA dataset Reddy et al. (2018). It comes with 7,199 documents (passages) and 108,647 QA pairs of supervised training data with domains in Children’s Story. Literature Mid/High School, News, and Wikipedia.

5.2 Datasets

To the best of our knowledge, there are no existing datasets for RCRC. We keep the split of training and testing of the SemEval 2016 Task 5 datasets and annotate dialogues of QAs on each review. To ensure our questions are real-world questions, annotators are first asked to read CQAs of the pre-tuning data. Each dialogue is annotated to focus on certain topics of a review. The textual spans are kept to be as short as possible but still human-readable. No-answer questions are also annotated, which have certain topical connections with the nearby questions or answers. Annotators are encouraged to label about 2 dialogues from a testing review to get enough testing examples. One training review is encouraged to have 1 dialogue to have good coverage of reviews. Each question is shortened as much as possible to omit existing information in the past turns. The annotated data is in the format of CoQA Reddy et al. (2018) to help future research. The statistics of dataset is shown in Table 2. We split 20% of the training reviews as the validation set for each domain.

Training Laptop Restaurant
# of reviews 446 350
# of dialogues 509 382
# of questions 1680 1485
% of no answers 24% 24%
Testing Laptop Restaurant
# of reviews 79 90
# of dialogues 179 163
# of questions 807 801
% of no answers 26.7% 27.9%
Table 2: Statistics of Datasets.

5.3 Compared Methods

We compare the following methods by training/fine-tuning on . All the baselines are run using their default hyper-parameters.
DrQA is a CRC baseline coming with the CoQA dataset888https://github.com/stanfordnlp/coqa-baselines. Note that this implementation of DrQA is different from DrQA for SQuAD Chen et al. (2017) in that it is modified to support answering no answer questions by having a special token unknown at the end of the document. So having a span with unknown indicates NO ANSWER. This baseline answers the research question RQ1.
DrQA+CoQA is the above baseline pre-tuned on CoQA dataset and then fine-tuned on . We use this baseline to show that even DrQA pre-trained on CoQA is sub-optimal for RCRC. This baseline is used to answer RQ1 and RQ3.
BERT is the vanilla BERT model directly fine-tuned on . We use this baseline for ablation study on the effectiveness of pre-tuning. All these BERT’s variants are used to answer RQ2.
BERT+review first tunes BERT on domain reviews using the same objectives as BERT pre-training and then fine-tunes on . We use this baseline to show that a simple domain-adaptation of BERT is not good.
BERT+CoQA first fine-tunes BERT on the supervised CoQA data and then fine-tunes on . We use this baseline to show that pre-tuning is very competitive even compared with models trained from large-scale supervised data. This also answers RQ3.
BERT+Pre-tuning first pre-tunes BERT as proposed and then fine-tunes on .

5.4 Hyper-parameters and Evaluation Metrics

We choose BERT base model as our pre-tuning and fine-tuning model, which has 12 layers, 768 hidden dimensions and 12 attention heads (in transformer) with total parameters of 110M. We cannot use the BERT large model as we cannot fit it into our GPU memory for training. We set the maximum length to be 256 with a batch size of 16. We perform pre-tuning for 10k steps as further increasing the pre-tuning steps doesn’t yield better results. We fine-tune 6 epochs, though most runs converged just within

3 epochs due to the pre-trained/tuned weights of BERT. Results are reported as averages of 3 runs of fine-tuning (3 different random seeds for tuning batch generation).

To be consistent with existing research, we leverage the same evaluation script from CoQA999We still made minimum necessary changes to allow the evaluation scripts being able to handle new domains.. Similar to the evaluation of SQuAD 2.0, CoQA script reports turn-level Exact Match (EM) and F1 scores for all turns in all dialogues. EM requires the answers to have exact string match with human annotated answer spans. F1 score is the averaged F1 scores of individual answers, which is typically higher than EM and is the major metric.

5.5 Result Analysis

Domain Laptop Rest.
Methods EM F1 EM F1
DrQAReddy et al. (2018) 28.5 36.6 41.6 50.3
DrQA+CoQAReddy et al. (2018) 40.4 51.4 47.7 58.5
BERT 38.57 48.67 46.87 55.07
BERT+review 34.53 43.83 47.23 53.7
BERT+CoQA(supervised) 47.1 58.9 56.57 67.97
BERT+Pre-tuning 46.0 57.23 54.57 64.43
Table 3: Results of RCRC on EM (Exact Match) and F1.

As shown in Table 3, BERT+Pre-tuning has significant performance gains over many baselines. To answer RQ1, we can see that BERT is better than DrQA baseline from CoQA. To answer RQ2, we notice that by leveraging BERT+Pre-tuning, we have about 9% performance gain. Note that directly using review documents to continually pre-training BERT does not yield better results for BERT+review. We suspect the task of RCRC still requires certain degrees of general language understanding and BERT+review has the effect of (catastrophic) forgetting Kirkpatrick et al. (2017) the strength of BERT. To answer RQ3, we notice that large-scale supervised CoQA data can boost the performance for both DrQA and BERT. However, our pre-tuning stage still has competitive performance and it requires no annotation at all.

6 Conclusions

In this paper, we propose a novel task called review conversational reading comprehension (RCRC). We investigate the possibility of interactive question answering by using reviews as knowledge of user experiences. We first build a dataset called , which is derived from popular review datasets for sentiment analysis. To resolve the issues of limited supervision introduced by the prohibitive cost of annotation, we introduce a novel pre-tuning stage to perform task-adaptation from a language model. This pre-tuning stage can potentially be used for any MRC or CRC task given it has no requirement on annotation but large QA and review corpora available online. Experimental results show that the pre-tuning approach is highly effective and outperforms existing baselines or highly competitive with supervised baselines trained from a large-scale dataset.

References