Knowledge Base††* Work done during Mengxi Wei’s internship at Alibaba Group. Question Answering (KBQA) aims to build systems that automatically respond to natural language questions with natural answers using information in knowledge bases. It has been an active research topic for a long time . Traditionally, systems are built as a pipeline of components, including NLU, KB retrieval, reasoning, and answer generation. The components can either base on rules or statistical models [2, 6, 26, 28, 7].
One limitation of the pipeline approach is that it requires much resource to scale up or adapt to different domains. Moving the pipeline to new domains often means preparing new sets of rules and/or annotating new data to train different statistical models, both of which are difficult and time-consuming.
End-to-end KBQA approaches [24, 8] have been attracting interests in recent years, because they can be trained directly with QA pairs collected from the web. These are easier to acquire than e.g. NLU frame annotations for pipeline KBQA systems. User-generated QA data are usually abundant and have facilitated training of several successful KBQA models [24, 8].
However, the quality of user-generated question-answer pairs is not guaranteed. Figure 1 is a screenshot from a community QA website. We observe : 1) Irrelevant answers, e.g. the second answer is a vulgar joke that does not contain any information that we can find in a reasonable KB; and 2) Inconsistent answers. In Figure 1, the third answer covers more complete information in the KB than the first. It is desirable to filter out noisy responses and to promote high quality answers.
In this paper, we attempt to address these challenges by organizing data differently from previous work. Instead of training on QA pairs (as most end-to-end KBQA systems do to the best of our knowledge), we organize answers to the same question into bags and train with the multi-instance learning principle. This formulation allows us to select and weight answers according to consensus among the answers in the same bag.
2 Related Work
has explored KBQA using neural networks. GenQA is among the first in this strand of research that retrieves facts from the KB and generates answers from these facts. Our model follows the approach of CoreQA , which improves GenQA by allowing multiple attentive reads from both the KB and the question.
A related, but different question answering approach is based on Machine Reading Comprehension (MRC: [22, 25]). Recent MRC research utilizes relation graphs among entities to integrate evidences in text 
. Our work is different from MRC in that the knowledge graph in KBQA is manually curated, while MRC extracts the answer from free text.
Multi-instance learning 
is a variant of supervised learning where inputs are bags of instances. One successful application in NLP is distant supervision of relation extractors by only learning from some of the instances[18, 27], or assigning different weights to the instances under mechanisms such as selective attention . Our task is a generation problem, where classification techniques developed for relation extraction cannot be applied directly.
We use curriculum learning to schedule training instances in our method. Curriculum learning 
is a paradigm that schedules simple training examples during early epochs of training, and gradually adds hard examples to the process. It has found successful applications in multi-task NLP and KBQA [Liu et al.2018].
3 End-to-end KBQA
The KBQA task takes a question and a KB as input and outputs a response based on the information in the KB.
Our baseline KBQA model follows the CoreQA  approach. The model encodes the question as with a bi-directional LSTM and the KB as by concatenating the embeddings of the subject-predicate-object triples. The model decodes with an LSTM decoder that generates from the output vocabulary and copy / retrieve from and , in similar fashion to the pointer-generator .
3.1 Question Encoder
The question encoder encodes a question
into a vector encoding and builds a short term memorythat covers every word in the question.
The question is a word sequence of length . For a word , we use a bi-directional LSTM encoder to obtain its forward state and its backward state . The short term memory of the sentence is the concatenated forward and backward states of the words, i.e. , where . The question is encoded by the concatenation of the last states in both directions, i.e. .
3.2 KB Encoder
The KB encoder encodes a KB (set of relation triples) into a short term memory .
The KB consists of relation triples of type <s, p, o>, in which s (subject) is the entity, (predicate) is the name of the relation, and (object) is the value of the relation. For example, <Cao Cao, nickname, A-man> is a fact about the entity Cao Cao, stating that his nickname is A-man.
Denoting the embeddings of s, p, and o as , , and respectively, we represent a fact with the concatenation of these three embeddings: . We consider fact representations as the short term memory of the KB, i.e , where is the number of facts.
The decoder is an LSTM  network that generates answers from the question encoding while attending to the short term question memory and the short term KB memory . The output at time is a mixture of three modes: prediction (pr), copy (cp), and retrieve (re), as in Eq. (1).
where , , and are prediction mode, copy mode, and retrieval mode output distributions respectively, and is the mode selector implemented as a 2 layer NN with softmax activation.
The state of the RNN is updated with the previous state , the previously generated word, and the attention context based on attentive reads of the question and the KB.
3.3.1 Prediction mode
The prediction mode generates new words from the vocabulary, considering attentive reads from the question and the KB memory, as in Eq. (2).
where is the current state of the decoder, is the attentive read from , and is the attentive reading from .
3.3.2 Copy mode
where is the attentive read of the question memory and is the accumulated attention history on the question .
3.3.3 Retrieval mode
The retrieval mode measures the probability of retrieving the predicate value from the KB, as in Eq. (4).
where is the attentive read of and is the accumulated attention history on the KB triples.
In the three modes, , , and are 2 layer MLPs with softmax activation.
CoreQA is trained on the negative log likelihood loss. Given questions and answers , the negative log likelihood loss sums over all answers, as in Eq. (5).
where is the -th word in the -th answer and is the length of the gold answer. The loss inherently assumes that all answers in the dataset and dataset are of the same quality, which is often not the case as shown by the example in Figure 1.
4 Multi-Instance KBQA
4.1 Question Bags
We start to depart from prior end-to-end KBQA efforts as we organize question-answer pairs into question bags. A bag consists of a question and every answer to that question in the dataset. We perform instance selection or weighting on the bag level, following the principle of multi-instance learning.
Consider the question bag in Figure 2. There are four user-generated answers, A1–A4, towards the question “What is the nickname of Cao Cao”? A3 and A4 try to answer the question directly with relevant knowledge. A2 covers two possible values (A-man, Mengde) of the KB predicate. A1 is an uninformative answer. By organizing Q and A1-A4 into a bag, we select or weight the answer instances according to their relevance to the question, so that the model learns from A2-A4 but ideally not A1.
We define a QA bag to be a tuple , where are the set of answers corresponding to the question
, and propose new loss functions to select or weight the answers within a bag.
4.2 Answer Selection
Similar to distantly supervised relation extractors, we first make the assumption that at least one answer in the question bag is reasonable. Accordingly, instead of summing the loss over all answers, we only train on one answer per bag which is the easiest to learn. As is shown in Figure 2(a), only the loss on one of the answers will be back-propagated, so uninformative answers like “Check the history book yourself” will not affect training, as they do not utilize KB information and are harder to generate.
Given QA bags , we define the minimum bag loss in in Eq. (6).
For each bag, we only calculate loss on one answer that is the closest to the network output, as in Eq. (7).
note that we add a length normalizatoin term, , so that we do not unjustly penalize long answers. Inspired by , we set ( in our experiments).
4.3 Answer Weighting
Bag level minimum loss considers one instance from each bag. We also attempt to weight instances in a bag according to their relevance to the question, so that every answer can contribute to the training process, as in Figure 2(b).
The weight of an instance in a bag is then calculated based on consensus among the answers, as in Eq. (8).
where is the instance index within bag , i.e. . is the weight for the instance in bag (explained in following paragraphs), and is normalized by in this work.
We weight the answers by their similarity to other answers in the same bag, assuming that an unusual answer to a question is likely to be an outlier. Specifically, we train a two-class Chinese InferSent
model that predicts if two answers come from the same bag and encode each answer in a bag with InferSent. We calculate cosine similarity among the answers. The weight of an answer is its similarity to its nearest neighbor. Specifically, denoting the InferSent encoding of an answeras , in Eq. (8), where and are answers in bag and .
Content weighting does not take KB information into account. To its remedy, we weight an answer instance by the importance of KB entities mentioned by the answer. We first measure the importance of an entity by its frequency in a bag. Consider the example in Figure 2. Denoting “A-man” as and “Mengde” as , we have entity count (as “A-man” appears in A2, A3, and A4) and (as “Mengde” appears in A2). We then score an answer by the sum of entity weights that occur in the answer: i.e. in Eq. (8), favoring answers that mention more important KB entities.
4.4 Curriculum Learning
Questions in real world datasets can have either one or multiple answers. We schedule training under the curriculum learning principle, as illustrated in Figure 2(c). Specifically, assuming that we have both single- and multi-instance bags, train for iterations, and the current iteration is , we always use the single-instance bags, but sample from multi-instance bags with probability in each iteration, so that we warm up training with single instance bags in early iterations. Note that this is different from [Liu et al.2018], as they schedule single answers based on perceived difficulty, but we schedule question bags based on the number of answers within, which naturally indicates bag ambiguity.
5.1 Data Collection
We create a new dataset (PQA) in this paper. PQA is a combination of user generated QA pairs and merchant-created product KBs with relatively stable schema. We believe that it reflects a lot of real world KBQA use cases.
CQA is collected from an encyclopedic community QA forum . The QA pairs are first collected from the forum and the KB is constructed automatically. The questions and answers are then grounded to the KB.
5.1.2 PQA: Data collection and processing
PQA is a community product QA dataset we collect from an e-commerce website. We focus on the mobile phone product domain, because products have relatively stable KB schema: most products share a number of “core” predicates, such as display_size, cpu_model, and internal_storage. This is typical for product QA applications, but not the case in CQA.
We pre-select a number of popular mobile phone products and collect both the product KB and the community QA pairs regarding the products.
Filtering and preprocessing
Product questions can either be about facts (How large is the screen of this phone?) or opinions (“Does the screen look better than that of an iPhone?”). This paper works on factual questions only, so we build a TextCNN classifier to filter the data. We manually annotate 1,000 questions to train a binary classifier that obtains F1 on factual question detection.
We preprocess the KB to remove unit words (e.g. “5.99 inch” “5.99”) from predicate values. We then ground the KB to QA pairs by string matching.
Differences from CQA
PQA is different from CQA in four aspects: 1) smaller size, 2) stable KB schema, 3) the existence of “advertising” answers submitted by the merchants which are often irrelevant to the question, and 4) the KB attached to PQA consists of manually curated product properties, instead of automatically extracted triples. These features are more realistic for the product QA use case. We evaluate our method on both data sets to more comprehensively measure its performance.
5.2 Experimental Settings
Our baseline is a re-implementation of CoreQA  based on the released code.
5.3 Multi-Instance Experiments
We first evaluate the proposed methods on the multi-instance portion of the CQA encyclopedic QA dataset released by , as our method is designed to work with questions with multiple answers. The dataset consists of 64K bags for training/validation and 16K bags for testing. Each question bag has 3.2 answers on average. As shown in Table 1, both answer selection (Selection) and weighting (Weight:KB) obtain better accuracy than the baseline (CoreQA). Instance selection leads to 17.11 absolute point improvement in accuracy, but no improvement on Rouge. We hypothesize that instance selection limits generation naturalness, because the models are in effect trained on less examples. Answer weighting achieves better balance between adequacy and fluency, improving accuracy by 9.87 absolute points and Rouge by 1.53 absolute points. We use KB weighting in following experiments.
We also measured the performance of answer selection without length normalization (-LengthNrm) and answer weighting using content instead of KB information (Weight:Con) for ablation study. The results confirm the necessity of performing length normalization and utilizing KB information in answer weighting.
5.4 Mixed-Instance Experiments
We next experiment on complete real word datasets that contain both single- and multiple-instance bags. In addition to CQA, we also evaluate on a product QA dataset (PQA) that we collect from a large e-commerce website, with 87K question bags for training/validation and 22K bags for testing. Half of the questions have multiple answers, with 2.4 answers on average. Unlike CQA, which automatically extracts the KB, the PQA KB is composed of real product properties on the website and is closer to the production KBQA setting.
In Table 2, the first line is the result reported by . CoreQA is our re-implementation of , Weight is trained with KB-based instance weighting and is identical to the Weight:KB setting in Table 1, and Curriculum trains first on single instance bags and gradually adds multi-instance bags (cf. § 4.4).
On both datasets, instance weighting significantly improves the accuracy scores and curriculum learning further improves both accuracy and naturalness. On the CQA public dataset, instance weighting with a curriculum schedule leads to 13.98 and 1.58 absolute points improvement on accuracy and ROUGE respectively. The trend is similar for the PQA dataset, showing that the method works for different domains and use cases.
6 Conclusions and Future Work
We trained end-to-end KBQA models with multi-instance learning principles. We showed that selection and weighting of answers to the same question helps reducing noise in the training data and boosts both output adequacy and naturalness.
Our approach is independent of the underlying QA model. In future, we plan to integrate our approach with more QA models and explore more ways to utilize the information in the bag.
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston.
Proceedings of the 26th annual international conference on machine learning, pages 41–48.
-  Daniel G Bobrow, Ronald M Kaplan, Martin Kay, Donald A Norman, Henry Thompson, and Terry Winograd. 1977. Gus, a frame-driven dialog system. Artificial intelligence, 8(2):155–173.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine
Supervised learning of universal sentence representations from
natural language inference data.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680.
-  Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1-2):31–71.
-  Yao Fu and Yansong Feng. 2018. Natural answer generation with heterogeneous memory. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 185–195.
-  Ralph Grishman. 1979. Response generation in question - answering systems. In 17th Annual Meeting of the Association for Computational Linguistics.
-  Dilek Hakkani-Tür, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. In Interspeech, pages 715–719.
-  Shizhu He, Cao Liu, Kang Liu, and Jun Zhao. 2017. Generating natural answers by incorporating copying and retrieving mechanisms in sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 199–208.
-  Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780, November.
-  Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751.
-  Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
-  Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04).
-  Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2124–2133.
-  Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. 2018. Denoising distantly supervised open-domain question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1736–1745.
- [Liu et al.2018] Cao Liu, Shizhu He, Kang Liu, and Jun Zhao. 2018. Curriculum learning for natural answer generation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pages 4223–4229.
-  Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. 2018. Mem2Seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1468–1478.
-  Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
-  Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
-  Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 148–163.
-  Abigail See, Peter Liu, and Christopher Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Association for Computational Linguistics.
-  Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Florian, and Daniel Gildea. 2018. Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks. arXiv preprint arXiv:1809.02040.
Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li.
Modeling coverage for neural machine translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–85, Berlin, Germany, August.
-  Wei Wang, Ming Yan, and Chen Wu. 2018. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1705–1714.
-  Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
-  Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. 2016. Neural generative question answering. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2972–2978.
-  Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541.
John M Zelle and Raymond J Mooney.
Learning to parse database queries using inductive logic programming.In Proceedings of the national conference on artificial intelligence (AAAI), pages 1050–1055.
-  Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1753–1762.
-  Luke S Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pages 658–666.