Seeking information to assess whether some products or services suit one’s needs is a vital activity for consumer decision making. In online businesses, one major hindrance is that customers have limited access to answers to their specific questions or concerns about products and user experiences. Given the ever-changing environment of products and services, it is very hard, if not impossible, to pre-compile an up-to-date knowledge base to answer user questions as in KB-QA Kwok et al. (2001); Fader et al. (2014); Yin et al. (2015); Xu et al. (2016). As a compromise, community question-answering (CQA) McAuley and Yang (2016) is leveraged to enable existing customers or sellers to answer customer questions. However, one obvious drawback of this approach is that many questions are not answered, and even if they are answered, the answers and the following up questions are delayed, which is not suitable for interactive QA. Although existing studies have used information retrieval (IR) techniques McAuley and Yang (2016); Yu and Lam (2018) to identify a whole review as an answer to a question, it is time-consuming to read a whole review and the approach has difficulty to answer questions in multiple turns.
|A Laptop Review:|
|I purchased my Macbook Pro Retina from|
|my school since I had a student discount ,|
|but I would gladly purchase it from Amazon|
|for full price again if I had too . The Retina|
|is great , its amazingly fast when it|
|boots up because of the SSD storage|
|and the clarity of the screen is amazing as well .|
|Turns of Questions from a Customer:|
|: how is retina display ?|
|: speed of booting up ?|
|: why ?|
|: what ’s the capacity of that ? (NO ANSWER)|
|: is the screen clear ?|
Inspired by recent research in Conversational Reading Comprehension (CRC) Reddy et al. (2018); Choi et al. (2018), we explore the possibility of turning reviews as a source of valuable knowledge of experiences and to provide a natural way of answering customers’ multiple-turn questions in a dialogue setting. The conversational setting of machine reading comprehension (MRC) enables more specific questions and allow customers to either omit or co-reference information in context. As an example in a laptop domain shown in Table 1, a customer may have 5 turns of questions based on the context. The customer first has an opinion question targeting an aspect “retina display” of a to-be-purchased laptop. Then the customer carries (and omit) the question type opinion from the first question to the second and continually asking the second aspect “boot-up speed”. For the third question, the customer carries the aspect of the second question but change the question type to opinion explanation. Later, the customer can co-reference the aspect “SSD” from the previous answer and ask for the capacity (a sub-aspect) of “SSD”. Unfortunately, there is no answer in this review for the fourth question so the review may say “I don’t know”. But the customer can still ask other aspects as in the fifth question. We formally define this problem as follows and call it review conversational reading comprehension (RCRC).
Problem Definition: Given a review that consists of a sequence of tokens , a history of past questions and answers as the context and the current question , find a sequence of tokens (a textual span) in that answers based on , where , , and , or return NO ANSWER () if the review does not contain any answer for .
RCRC is a novel QA task that requires the understanding of both the current question and dialogue context . Compared to the traditional single-turn MRC, the key challenge is how to understand the context and the current question given it may have a co-reference resolution or context carryover.
To the best of our knowledge, there are no existing review datasets for RCRC. We first build a dataset called based on laptop and restaurant reviews from SemEval 2016 Task 5222http://alt.qcri.org/semeval2016/task5/
. We choose this dataset to better align with existing research on review-based tasks in sentiment analysis. Each review is annotated with a few dialogues focusing on some topics. Note that although one dialogue is annotated on a single review, a trained RCRC model can potentially be deployed among an open set of reviewsChen et al. (2017) where the context may potentially contain answers from different reviews. Given the wide spectrum of domains in online business (e.g., thousands of categories on Amazon.com) and the prohibitive cost of annotation, is designed to have limited supervision as in other tasks of sentiment analysis.
We adopt BERT (Bidirectional Encoder Representation from Transformers Devlin et al. (2018)) as our base model since its variants achieve dominant performance on MRC Rajpurkar et al. (2016, 2018) and CRC Reddy et al. (2018) tasks. However, BERT is designed to learn features for a wide spectrum of NLP tasks333https://gluebenchmark.com/leaderboard with a large amount of training examples444For example, the number of training examples for MRC and CRC are typically more than 100K.. The task-awareness of BERT can be hindered by the weak supervision of the dataset. To resolve this challenge, we introduce a novel pre-tuning stage between pre-training and end-task fine-tuning for BERT. The pre-tuning stage is formulated in a similar fashion as the RCRC task but requires no annotated RCRC data and just domain QA pairs (from CQA) and reviews, which are readily available onlineMcAuley and Yang (2016). We bring certain characteristics of the RCRC task (inputs/outputs) to pre-tuning to encourage BERT’s weight to be prepared for understanding the current question and locate the answer if there exists one. The proposed pre-tuning step is general and can potentially be used in MRC or CRC tasks in other domains.
The main contributions of this paper are as follows. (1) It proposes a practical new task on reviews that allows multi-turn conversational QA. (2) To address this problem, an annotated dataset is first created. (3) It then proposes a pre-tuning stage to learn task-aware representation. Experimental results show that the proposed approach achieves competitive performance even compared with the supervised approach on a large-scale training data.
2 Related Works
MRC (or CRC) has been studied in many domains with formal written texts, e.g., Wikipedia (WikiReading Hewlett et al. (2016), SQuAD Rajpurkar et al. (2016, 2018), WikiHop Welbl et al. (2018), DRCD Shao et al. (2018), QuAC Choi et al. (2018), HotpotQA Yang et al. (2018)), fictional stories (MCTest Richardson et al. (2013), CBT Hill et al. (2015), NarrativeQA Kočiskỳ et al. (2018)), general Web documents (MS MARCO Nguyen et al. (2016), TriviaQA Joshi et al. (2017), SearchQA Dunn et al. (2017)) and news articles (NewsQA Trischler et al. (2016), CNN/Daily Mail Hermann et al. (2015), and RACE Lai et al. (2017)). Recently, CRC Reddy et al. (2018); Huang et al. (2018); Zhu et al. (2018) gains increasing popularity as it allows natural multi-turn questions. Examples are QuAC Choi et al. (2018) and CoQA Reddy et al. (2018). CoQA is built from multiple sources, such as Wikipedia, Reddit, News, Mid/High School Exams, Literature, etc. To the best of our knowledge, CRC has not been used on reviews, which are primarily subjective. Our dataset is compatible with the format of CoQA datasets so all CoQA-based models can be easily adapted to our dataset. Answers from are intended to be extractive (similar to SQuAD Rajpurkar et al. (2016, 2018)) rather than abstractive (generative) (such as in MS MARCO Nguyen et al. (2016) and CoQA Reddy et al. (2018)) because we believe online businesses are cost-sensitive so relying on human written answers are more reliable than machine generated answers.
Traditionally, knowledge bases (KBs) (such as Freebase Dong et al. (2015); Xu et al. (2016); Yao and Van Durme (2014) or DBpedia Lopez et al. (2010); Unger et al. (2012)) have been used for question-answering Yu and Lam (2018). However, the ever-changing environment of online businesses launches new products and services appear constantly, making it prohibitive to build a high-quality KB to cover all new products, services and subjective experiences from customers. Community QA (CQA) is widely adopted by online businesses McAuley and Yang (2016) to help users get answers for their questions. However, since the answers are written by humans, it often takes a long time to get a question answered or even not answered at all as we discussed in the introduction section. There exist researches that align reviews to questions in CQA as an information retrieval task McAuley and Yang (2016); Yu and Lam (2018), but a whole review is hard to read and not suitable for follow-up questions. We novelly use CQA data for CRC (or potentially for MRC), which play a significant role in encouraging domain representation learning on questions and contexts, which are largely ignored in existing research on MRC (or CRC).
In this section, we briefly review BERT (Bidirectional Encoder Representation from Transformers Devlin et al. (2018)), which is one of the key innovations of unsupervised contextualized representation learning Peters et al. (2018); Howard and Ruder (2018); Radford et al. ; Devlin et al. (2018). The idea behind these innovations is that although the word embedding Mikolov et al. (2013); Pennington et al. (2014) layer is trained from large-scale corpora, relying on the limited supervised data from end-tasks to train the contextualized representation is insufficient. Unlike ELMo Peters et al. (2018) and ULMFiT Howard and Ruder (2018) that are designed to provide additional features for an end task, BERT adopts a fine-tuning approach that requires almost no specific architecture design for end tasks, but parameter intensive models on BERT itself. As such, BERT requires pre-training on large-scale data (Wikipedia articles) to fill intensive parameters in exchange for human structured architecture designs for specific end-tasks that carry human’s understanding of data of those tasks.
One training example of BERT is formulated as , where [CLS] and [SEP] are special tokens and is a document splited into two sides of sentences and .
The key performance gain of BERT comes from two novel pre-training objectives: masked language model (MLM) and next text sentence prediction.
Masked Language Model enables learning bidirectional language models and essentially encourages a BERT model to predict randomly masked words given their contexts. This is crucial for RCRC. For example, an example can be “its amazingly [MASK] when it boots up because of the [MASK] storage”. These two [MASK]’s encourage BERT to guess that the first mark could be “fast”and the second mask could be“SSD” so as to learn some common knowledge on aspects of laptops and their potential opinions.
Next Sentence Prediction further encourages BERT to learn inter-sentence representations by predicting whether two sides around the first [SEP] are from the same document or not. We remove this objective in our pre-tuning as the text format is different from BERT pre-training (discussed in the next Section).
In summary, we can see that the pre-trained BERT severely lacks RCRC task-awareness as there is no formulation for either context , the current turn question or possible answer spans as Wikipedia contains almost no questions or domain knowledge about online businesses. We resolve these issues in the next section.
4 Task-awareness Pre-tuning
To address the limitation of BERT on task-awareness, we introduce an intermediate stage of pre-tuning between BERT pre-training and fine-tuning on RCRC. This works in a similar spirit to the invention of BERT (or any other pre-trained language models) because it is also insufficient to learn the end task definition (or setting) solely on the limited supervised data (of that task). The task-awareness is determined by the inputs and outputs of RCRC, which introduce two directions for pre-tuning: (1) understanding the text inputs, including both domains and text formats (e.g., contexts, current questions). (2) understanding the goal of RCRC, including both having a text span or no answer. As such, we first define the textual format that is shared by both the RCRC and BERT pre-tuning in Section 4.1. Then we introduce an auxiliary pre-tuning objective in Section 4.3.
4.1 Textual Format
Inspired by the recent implementation of DrQA for CoQA Reddy et al. (2018) and BERT for SQuAD, we formuate an input example for pre-tuning (or RCRC) from the context , the current question , and the review as follows:
where past QA pairs in are concatenated and separated by two special tokens [Q] and [A] and then concatenate with the current question as the left side of BERT and the right side is the review document. This format will be used for both pre-tuning and RCRC task fine-tuning. Note that the answer for a question with no answer in the context is written as a single word “unknown”. One can observe that although this format is simple and intuitive for humans to read, BERT’s pre-trained weights have no idea the semantics behind this format (e.g., where is the current question, how many turns in the context and where is the previous turn), let alone the special tokens [Q] and [A] never appear during BERT pre-training.
4.2 Pre-tuning Data Generation
Based on the format defined in Section 4.1, we can observe that getting BERT to be familiar with domain reviews is as easy as continually training BERT on reviews. However, enabling BERT to understand the context and the current question is more challenging as the pre-training data of BERT has almost no question. To resolve this issue, we combine QA pairs (from CQA data) and reviews to formulate the pre-tuning examples, as shown in Algorithm 1. Note that these two kinds of data are often readily available across a wide range of products in Amazon.com and Yelp.com.
To ensure the topic of a pre-tuning example is consistent between QAs and reviews, we assume QA pairs and reviews are organized under each entity (a laptop or a restaurant in our experiment) that customers focus on. The inputs to Algorithm 1 are a set of QA pairs and reviews belonging to the same entity and the maximum turns in the context that is the same as the RCRC datasets. The output is the pre-tuning data as initialized in Line 1, where each example is denoted as . Here is the input example and is the two pointers for the auxiliary objective (discussed in Section 4.3). Given a QA pair in Line 2, we first build the left side of input example in Line 3-9. After initializing input in Line 3, we randomly determine the number of turns as context in Line 4 and concatenate these turns of QA pairs in Line 5-8, where ensures the current QA pair is not chosen. In Line 9, we concatenate the current question . Lines 10-23 build the right side of input example and the outputs pointers In Line 10, we randomly draw a review with sentences. To challenge the pre-tuning stage to discover the semantic relatedness between and (as for the auxiliary objective), we first decide whether to allow the right side of contains (Line 16) for or a fake random answer Lines 11-12. We also come up with two pointers and initialized in Lines 13 and 17. Then, we insert into review by randomly pick one from the locations in Lines 19-20. This gives us , which has tokens. We further update and to allow them to point to the chunk boundaries of . Otherwise, BERT should detect as no on the right side and point to [CLS] (). Finally, examples are aggregated in Line 25.
Algorithm 1 is run times to allow for enough samplings of data. As we can see, although labeled training examples for RCRC are expensive to obtain, harvest a large amount of pre-tuning data is easy. Following the success of BERT, we still randomly mask some words in each example to learn contextualized representations on domain texts.
4.3 Auxilary Objective
Besides adapting input to domains and RCRC task, it is also desirable to allow pre-tuning to adapt BERT to the goal of RCRC tasks, which is to predict a token span or NO ANSWER. Besides MLM from BERT, we further introduce an auxiliary objective called answer chunk detection to align BERT to a similar fashion as RCRC, except that we only predict the token spans of an answer chunk from CQA. Further, these tasks challenge BERT to be prepared for predicting NO ANSWER from a review by detecting a negative randomly drawn answer.
be the BERT’s transformer model. We first obtain the hidden representation of BERT as. Then the hidden representation is passed to two separate dense layers followed by softmax functions: and , where , and and is the size of the hidden dimension (e.g., 768 for ). Training involves minimizing the averaged cross entropy on the two pointers and generated in Algorithm 1:
are one-hot vectors representing the two starting and ending positions. For a positive example (with true answerrandomly inserted in the review), and are expected to be and , respectively, where is the position of the first [SEP]. For a negative example (with a random answer (not ) mixed into the review), indicates the two pointers must point to [CLS].
After pre-tuning, we fine-tune on the RCRC task in a similar fashion to the auxiliary objective, except this time there is no need to perform MLM.
We aim to answer the following research questions (RQs) in the experiment:
RQ1: What is the performance of using BERT compared against CoQA baselines ?
RQ2: Upon ablation studies of different applications of BERT, what is the performance gain of pre-tuning ?
RQ3: What is the performance of pre-tuning compared to using (large-scale) supervised data?
5.1 Pre-tuning datasets
To be consistent with existing research on review-based tasks such as sentiment analysis, we adopt SemEval 2016 Task 5 555We do not use SemEval 2014 Task 4 or SemEval 2015 Task 12 because they do not have review-level data. as the review source for RCRC, which contains two domains laptop and restaurant. Then we collect reviews and QA pairs for these two domains. For the laptop domain, we collect the reviews from He and McAuley (2016) and QA pairs from Xu et al. (2018) both under the laptop category of Amazon.com. We exclude products in the test data of to make sure the test data is not used for on any model parameters. This gives us 113,728 laptop reviews and 18,589 QA pairs. For the restaurant domain, we collect reviews from Yelp dataset challenges666https://www.yelp.com/dataset/challenge but crawl QA pairs from Yelp.com 777https://www.yelpblog.com/2017/02/qa. We select restaurants with at least 100 reviews as other restaurants seldom have any QA pairs. This ends with 753,096 restaurant reviews and 15,457 QA pairs.
To compare with a supervised pre-tuning approach, we further leverage the CoQA dataset Reddy et al. (2018). It comes with 7,199 documents (passages) and 108,647 QA pairs of supervised training data with domains in Children’s Story. Literature Mid/High School, News, and Wikipedia.
To the best of our knowledge, there are no existing datasets for RCRC. We keep the split of training and testing of the SemEval 2016 Task 5 datasets and annotate dialogues of QAs on each review. To ensure our questions are real-world questions, annotators are first asked to read CQAs of the pre-tuning data. Each dialogue is annotated to focus on certain topics of a review. The textual spans are kept to be as short as possible but still human-readable. No-answer questions are also annotated, which have certain topical connections with the nearby questions or answers. Annotators are encouraged to label about 2 dialogues from a testing review to get enough testing examples. One training review is encouraged to have 1 dialogue to have good coverage of reviews. Each question is shortened as much as possible to omit existing information in the past turns. The annotated data is in the format of CoQA Reddy et al. (2018) to help future research. The statistics of dataset is shown in Table 2. We split 20% of the training reviews as the validation set for each domain.
|# of reviews||446||350|
|# of dialogues||509||382|
|# of questions||1680||1485|
|% of no answers||24%||24%|
|# of reviews||79||90|
|# of dialogues||179||163|
|# of questions||807||801|
|% of no answers||26.7%||27.9%|
5.3 Compared Methods
We compare the following methods by training/fine-tuning on .
All the baselines are run using their default hyper-parameters.
DrQA is a CRC baseline coming with the CoQA dataset888https://github.com/stanfordnlp/coqa-baselines. Note that this implementation of DrQA is different from DrQA for SQuAD Chen et al. (2017) in that it is modified to support answering no answer questions by having a special token unknown at the end of the document. So having a span with unknown indicates NO ANSWER. This baseline answers the research question RQ1.
DrQA+CoQA is the above baseline pre-tuned on CoQA dataset and then fine-tuned on . We use this baseline to show that even DrQA pre-trained on CoQA is sub-optimal for RCRC. This baseline is used to answer RQ1 and RQ3.
BERT is the vanilla BERT model directly fine-tuned on . We use this baseline for ablation study on the effectiveness of pre-tuning. All these BERT’s variants are used to answer RQ2.
BERT+review first tunes BERT on domain reviews using the same objectives as BERT pre-training and then fine-tunes on . We use this baseline to show that a simple domain-adaptation of BERT is not good.
BERT+CoQA first fine-tunes BERT on the supervised CoQA data and then fine-tunes on . We use this baseline to show that pre-tuning is very competitive even compared with models trained from large-scale supervised data. This also answers RQ3.
BERT+Pre-tuning first pre-tunes BERT as proposed and then fine-tunes on .
5.4 Hyper-parameters and Evaluation Metrics
We choose BERT base model as our pre-tuning and fine-tuning model, which has 12 layers, 768 hidden dimensions and 12 attention heads (in transformer) with total parameters of 110M. We cannot use the BERT large model as we cannot fit it into our GPU memory for training. We set the maximum length to be 256 with a batch size of 16. We perform pre-tuning for 10k steps as further increasing the pre-tuning steps doesn’t yield better results. We fine-tune 6 epochs, though most runs converged just within3 epochs due to the pre-trained/tuned weights of BERT. Results are reported as averages of 3 runs of fine-tuning (3 different random seeds for tuning batch generation).
To be consistent with existing research, we leverage the same evaluation script from CoQA999We still made minimum necessary changes to allow the evaluation scripts being able to handle new domains.. Similar to the evaluation of SQuAD 2.0, CoQA script reports turn-level Exact Match (EM) and F1 scores for all turns in all dialogues. EM requires the answers to have exact string match with human annotated answer spans. F1 score is the averaged F1 scores of individual answers, which is typically higher than EM and is the major metric.
5.5 Result Analysis
|DrQAReddy et al. (2018)||28.5||36.6||41.6||50.3|
|DrQA+CoQAReddy et al. (2018)||40.4||51.4||47.7||58.5|
As shown in Table 3, BERT+Pre-tuning has significant performance gains over many baselines. To answer RQ1, we can see that BERT is better than DrQA baseline from CoQA. To answer RQ2, we notice that by leveraging BERT+Pre-tuning, we have about 9% performance gain. Note that directly using review documents to continually pre-training BERT does not yield better results for BERT+review. We suspect the task of RCRC still requires certain degrees of general language understanding and BERT+review has the effect of (catastrophic) forgetting Kirkpatrick et al. (2017) the strength of BERT. To answer RQ3, we notice that large-scale supervised CoQA data can boost the performance for both DrQA and BERT. However, our pre-tuning stage still has competitive performance and it requires no annotation at all.
In this paper, we propose a novel task called review conversational reading comprehension (RCRC). We investigate the possibility of interactive question answering by using reviews as knowledge of user experiences. We first build a dataset called , which is derived from popular review datasets for sentiment analysis. To resolve the issues of limited supervision introduced by the prohibitive cost of annotation, we introduce a novel pre-tuning stage to perform task-adaptation from a language model. This pre-tuning stage can potentially be used for any MRC or CRC task given it has no requirement on annotation but large QA and review corpora available online. Experimental results show that the pre-tuning approach is highly effective and outperforms existing baselines or highly competitive with supervised baselines trained from a large-scale dataset.
- Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051.
- Choi et al. (2018) Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. Quac: Question answering in context. arXiv preprint arXiv:1808.07036.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dong et al. (2015)
Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015.
Question answering over freebase with multi-column convolutional neural networks.In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 260–269.
- Dunn et al. (2017) Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
- Fader et al. (2014) Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2014. Open question answering over curated and extracted knowledge bases. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1156–1165. ACM.
- He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pages 507–517. International World Wide Web Conferences Steering Committee.
- Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701.
- Hewlett et al. (2016) Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. Wikireading: A novel large-scale language understanding task over wikipedia. arXiv preprint arXiv:1608.03542.
- Hill et al. (2015) Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 328–339.
- Huang et al. (2018) Hsin-Yuan Huang, Eunsol Choi, and Wen-tau Yih. 2018. Flowqa: Grasping flow in history for conversational machine comprehension. arXiv preprint arXiv:1810.06683.
- Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
- Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, page 201611835.
- Kočiskỳ et al. (2018) Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gáabor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association of Computational Linguistics, 6:317–328.
- Kwok et al. (2001) Cody Kwok, Oren Etzioni, and Daniel S Weld. 2001. Scaling question answering to the web. ACM Transactions on Information Systems (TOIS), 19(3):242–262.
- Lai et al. (2017) Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
Lopez et al. (2010)
Vanessa Lopez, Andriy Nikolov, Marta Sabou, Victoria Uren, Enrico Motta, and
Mathieu d’Aquin. 2010.
Scaling up question-answering to linked data.
International Conference on Knowledge Engineering and Knowledge Management, pages 193–210. Springer.
- McAuley and Yang (2016) Julian McAuley and Alex Yang. 2016. Addressing complex and subjective product-related queries with customer reviews. In Proceedings of the 25th International Conference on World Wide Web, pages 625–635. International World Wide Web Conferences Steering Committee.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
- (24) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training.
- Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- Reddy et al. (2018) Siva Reddy, Danqi Chen, and Christopher D Manning. 2018. Coqa: A conversational question answering challenge. arXiv preprint arXiv:1808.07042.
- Richardson et al. (2013) Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 193–203.
- Shao et al. (2018) Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying Tseng, and Sam Tsai. 2018. Drcd: a chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920.
- Trischler et al. (2016) Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2016. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830.
- Unger et al. (2012) Christina Unger, Lorenz Bühmann, Jens Lehmann, Axel-Cyrille Ngonga Ngomo, Daniel Gerber, and Philipp Cimiano. 2012. Template-based question answering over rdf data. In Proceedings of the 21st international conference on World Wide Web, pages 639–648. ACM.
- Welbl et al. (2018) Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association of Computational Linguistics, 6:287–302.
Xu et al. (2018)
Hu Xu, Sihong Xie, Lei Shu, and Philip S. Yu. 2018.
Dual attention network for product compatibility and function
Proceedings of AAAI Conference on Artificial Intelligence (AAAI).
- Xu et al. (2016) Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2016. Question answering on freebase via relation extraction and textual evidence. arXiv preprint arXiv:1603.00957.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
- Yao and Van Durme (2014) Xuchen Yao and Benjamin Van Durme. 2014. Information extraction over structured data: Question answering with freebase. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 956–966.
- Yin et al. (2015) Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. 2015. Neural generative question answering. arXiv preprint arXiv:1512.01337.
- Yu and Lam (2018) Qian Yu and Wai Lam. 2018. Aware answer prediction for product-related questions incorporating aspects. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 691–699. ACM.
- Zhu et al. (2018) Chenguang Zhu, Michael Zeng, and Xuedong Huang. 2018. Sdnet: Contextualized attention-based deep network for conversational question answering. arXiv preprint arXiv:1812.03593.