Multi-Instance Learning for End-to-End Knowledge Base Question Answering

by   Mengxi Wei, et al.

End-to-end training has been a popular approach for knowledge base question answering (KBQA). However, real world applications often contain answers of varied quality for users' questions. It is not appropriate to treat all available answers of a user question equally. This paper proposes a novel approach based on multiple instance learning to address the problem of noisy answers by exploring consensus among answers to the same question in training end-to-end KBQA models. In particular, the QA pairs are organized into bags with dynamic instance selection and different options of instance weighting. Curriculum learning is utilized to select instance bags during training. On the public CQA dataset, the new method significantly improves both entity accuracy and the Rouge-L score over a state-of-the-art end-to-end KBQA baseline.


page 1

page 2

page 3

page 4


Neural Generative Question Answering

This paper presents an end-to-end neural network model, named Neural Gen...

User Personalized Satisfaction Prediction via Multiple Instance Deep Learning

Community based question answering services have arisen as a popular kno...

Hyper-dimensional computing for a visual question-answering system that is trainable end-to-end

In this work we propose a system for visual question answering. Our arch...

RuBQ: A Russian Dataset for Question Answering over Wikidata

The paper presents RuBQ, the first Russian knowledge base question answe...

Understand, Compose and Respond - Answering Visual Questions by a Composition of Abstract Procedures

An image related question defines a specific visual task that is require...

Improving Question Answering over Incomplete KBs with Knowledge-Aware Reader

We propose a new end-to-end question answering model, which learns to ag...

Knowledge Base Relation Detection via Multi-View Matching

Relation detection is a core component for Knowledge Base Question Answe...

1 Introduction

Knowledge Base* Work done during Mengxi Wei’s internship at Alibaba Group. Question Answering (KBQA) aims to build systems that automatically respond to natural language questions with natural answers using information in knowledge bases. It has been an active research topic for a long time [2]. Traditionally, systems are built as a pipeline of components, including NLU, KB retrieval, reasoning, and answer generation. The components can either base on rules or statistical models [2, 6, 26, 28, 7].

Figure 1: Screenshot of a community QA website.

One limitation of the pipeline approach is that it requires much resource to scale up or adapt to different domains. Moving the pipeline to new domains often means preparing new sets of rules and/or annotating new data to train different statistical models, both of which are difficult and time-consuming.

End-to-end KBQA approaches [24, 8] have been attracting interests in recent years, because they can be trained directly with QA pairs collected from the web. These are easier to acquire than e.g. NLU frame annotations for pipeline KBQA systems. User-generated QA data are usually abundant and have facilitated training of several successful KBQA models [24, 8].

However, the quality of user-generated question-answer pairs is not guaranteed. Figure 1 is a screenshot from a community QA website. We observe : 1) Irrelevant answers, e.g. the second answer is a vulgar joke that does not contain any information that we can find in a reasonable KB; and 2) Inconsistent answers. In Figure 1, the third answer covers more complete information in the KB than the first. It is desirable to filter out noisy responses and to promote high quality answers.

In this paper, we attempt to address these challenges by organizing data differently from previous work. Instead of training on QA pairs (as most end-to-end KBQA systems do to the best of our knowledge), we organize answers to the same question into bags and train with the multi-instance learning principle. This formulation allows us to select and weight answers according to consensus among the answers in the same bag.

Figure 2: Proposed multi-instance KBQA approaches.

2 Related Work

A number of recent work [24, 5, 14, 15, Liu et al.2018]

has explored KBQA using neural networks. GenQA 

[24] is among the first in this strand of research that retrieves facts from the KB and generates answers from these facts. Our model follows the approach of CoreQA [8], which improves GenQA by allowing multiple attentive reads from both the KB and the question.

A related, but different question answering approach is based on Machine Reading Comprehension (MRC:  [22, 25]). Recent MRC research utilizes relation graphs among entities to integrate evidences in text [20]

. Our work is different from MRC in that the knowledge graph in KBQA is manually curated, while MRC extracts the answer from free text.

Multi-instance learning [4]

is a variant of supervised learning where inputs are bags of instances. One successful application in NLP is distant supervision of relation extractors by only learning from some of the instances 

[18, 27], or assigning different weights to the instances under mechanisms such as selective attention [13]. Our task is a generation problem, where classification techniques developed for relation extraction cannot be applied directly.

We use curriculum learning to schedule training instances in our method. Curriculum learning [1]

is a paradigm that schedules simple training examples during early epochs of training, and gradually adds hard examples to the process. It has found successful applications in multi-task NLP 

[16] and KBQA [Liu et al.2018].

3 End-to-end KBQA

The KBQA task takes a question and a KB as input and outputs a response based on the information in the KB.

Our baseline KBQA model follows the CoreQA [8] approach. The model encodes the question as with a bi-directional LSTM and the KB as by concatenating the embeddings of the subject-predicate-object triples. The model decodes with an LSTM decoder that generates from the output vocabulary and copy / retrieve from and , in similar fashion to the pointer-generator  [19].

3.1 Question Encoder

The question encoder encodes a question

into a vector encoding and builds a short term memory

that covers every word in the question.

The question is a word sequence of length . For a word , we use a bi-directional LSTM encoder to obtain its forward state and its backward state . The short term memory of the sentence is the concatenated forward and backward states of the words, i.e. , where . The question is encoded by the concatenation of the last states in both directions, i.e. .

3.2 KB Encoder

The KB encoder encodes a KB (set of relation triples) into a short term memory .

The KB consists of relation triples of type <s, p, o>, in which s (subject) is the entity, (predicate) is the name of the relation, and (object) is the value of the relation. For example, <Cao Cao, nickname, A-man> is a fact about the entity Cao Cao, stating that his nickname is A-man.

Denoting the embeddings of s, p, and o as , , and respectively, we represent a fact with the concatenation of these three embeddings: . We consider fact representations as the short term memory of the KB, i.e , where is the number of facts.

3.3 Decoder

The decoder is an LSTM [9] network that generates answers from the question encoding while attending to the short term question memory and the short term KB memory . The output at time is a mixture of three modes: prediction (pr), copy (cp), and retrieve (re), as in Eq. (1).


where , , and are prediction mode, copy mode, and retrieval mode output distributions respectively, and is the mode selector implemented as a 2 layer NN with softmax activation.

The state of the RNN is updated with the previous state , the previously generated word, and the attention context based on attentive reads of the question and the KB.

3.3.1 Prediction mode

The prediction mode generates new words from the vocabulary, considering attentive reads from the question and the KB memory, as in Eq. (2).


where is the current state of the decoder, is the attentive read from , and is the attentive reading from .

3.3.2 Copy mode

The copy mode measures the probability of copying word

from the question, as in Eq. (3).


where is the attentive read of the question memory and is the accumulated attention history on the question [21].

3.3.3 Retrieval mode

The retrieval mode measures the probability of retrieving the predicate value from the KB, as in Eq. (4).


where is the attentive read of and is the accumulated attention history on the KB triples.

In the three modes, , , and are 2 layer MLPs with softmax activation.

CoreQA is trained on the negative log likelihood loss. Given questions and answers , the negative log likelihood loss sums over all answers, as in Eq. (5).


where is the -th word in the -th answer and is the length of the gold answer. The loss inherently assumes that all answers in the dataset and dataset are of the same quality, which is often not the case as shown by the example in Figure 1.

4 Multi-Instance KBQA

4.1 Question Bags

We start to depart from prior end-to-end KBQA efforts as we organize question-answer pairs into question bags. A bag consists of a question and every answer to that question in the dataset. We perform instance selection or weighting on the bag level, following the principle of multi-instance learning.

Consider the question bag in Figure 2. There are four user-generated answers, A1–A4, towards the question “What is the nickname of Cao Cao”? A3 and A4 try to answer the question directly with relevant knowledge. A2 covers two possible values (A-man, Mengde) of the KB predicate. A1 is an uninformative answer. By organizing Q and A1-A4 into a bag, we select or weight the answer instances according to their relevance to the question, so that the model learns from A2-A4 but ideally not A1.

We define a QA bag to be a tuple , where are the set of answers corresponding to the question

, and propose new loss functions to select or weight the answers within a bag.

4.2 Answer Selection

Similar to distantly supervised relation extractors, we first make the assumption that at least one answer in the question bag is reasonable. Accordingly, instead of summing the loss over all answers, we only train on one answer per bag which is the easiest to learn. As is shown in Figure 2(a), only the loss on one of the answers will be back-propagated, so uninformative answers like “Check the history book yourself” will not affect training, as they do not utilize KB information and are harder to generate.

Given QA bags , we define the minimum bag loss in in Eq. (6).


For each bag, we only calculate loss on one answer that is the closest to the network output, as in Eq. (7).


note that we add a length normalizatoin term, , so that we do not unjustly penalize long answers. Inspired by [23], we set ( in our experiments).

4.3 Answer Weighting

Bag level minimum loss considers one instance from each bag. We also attempt to weight instances in a bag according to their relevance to the question, so that every answer can contribute to the training process, as in Figure 2(b).

The weight of an instance in a bag is then calculated based on consensus among the answers, as in Eq. (8).


where is the instance index within bag , i.e. . is the weight for the instance in bag (explained in following paragraphs), and is normalized by in this work.

Content weighting.

We weight the answers by their similarity to other answers in the same bag, assuming that an unusual answer to a question is likely to be an outlier. Specifically, we train a two-class Chinese InferSent 


model that predicts if two answers come from the same bag and encode each answer in a bag with InferSent. We calculate cosine similarity among the answers. The weight of an answer is its similarity to its nearest neighbor. Specifically, denoting the InferSent encoding of an answer

as , in Eq. (8), where and are answers in bag and .

KB weighting.

Content weighting does not take KB information into account. To its remedy, we weight an answer instance by the importance of KB entities mentioned by the answer. We first measure the importance of an entity by its frequency in a bag. Consider the example in Figure 2. Denoting “A-man” as and “Mengde” as , we have entity count (as “A-man” appears in A2, A3, and A4) and (as “Mengde” appears in A2). We then score an answer by the sum of entity weights that occur in the answer: i.e. in Eq. (8), favoring answers that mention more important KB entities.

4.4 Curriculum Learning

Questions in real world datasets can have either one or multiple answers. We schedule training under the curriculum learning principle, as illustrated in Figure 2(c). Specifically, assuming that we have both single- and multi-instance bags, train for iterations, and the current iteration is , we always use the single-instance bags, but sample from multi-instance bags with probability in each iteration, so that we warm up training with single instance bags in early iterations. Note that this is different from  [Liu et al.2018], as they schedule single answers based on perceived difficulty, but we schedule question bags based on the number of answers within, which naturally indicates bag ambiguity.

5 Experiments

5.1 Data Collection

We experiment with two datasets in this paper: CQA and PQA. CQA [8] is an open domain community QA dataset on encyclopedic knowledge that is used by a number of KBQA systems [8, Liu et al.2018].

We create a new dataset (PQA) in this paper. PQA is a combination of user generated QA pairs and merchant-created product KBs with relatively stable schema. We believe that it reflects a lot of real world KBQA use cases.

5.1.1 Cqa

CQA is collected from an encyclopedic community QA forum [8]. The QA pairs are first collected from the forum and the KB is constructed automatically. The questions and answers are then grounded to the KB.

5.1.2 PQA: Data collection and processing


PQA is a community product QA dataset we collect from an e-commerce website. We focus on the mobile phone product domain, because products have relatively stable KB schema: most products share a number of “core” predicates, such as display_size, cpu_model, and internal_storage. This is typical for product QA applications, but not the case in CQA.

We pre-select a number of popular mobile phone products and collect both the product KB and the community QA pairs regarding the products.

Filtering and preprocessing

Product questions can either be about facts (How large is the screen of this phone?) or opinions (“Does the screen look better than that of an iPhone?”). This paper works on factual questions only, so we build a TextCNN [10]classifier to filter the data. We manually annotate 1,000 questions to train a binary classifier that obtains F1 on factual question detection.

We preprocess the KB to remove unit words (e.g. “5.99 inch” “5.99”) from predicate values. We then ground the KB to QA pairs by string matching.

Differences from CQA

PQA is different from CQA in four aspects: 1) smaller size, 2) stable KB schema, 3) the existence of “advertising” answers submitted by the merchants which are often irrelevant to the question, and 4) the KB attached to PQA consists of manually curated product properties, instead of automatically extracted triples. These features are more realistic for the product QA use case. We evaluate our method on both data sets to more comprehensively measure its performance.

5.2 Experimental Settings

Our baseline is a re-implementation of CoreQA [8] based on the released code.

We follow the parameter setting of [8] for the most part: we use embedding size of 200, KB state size of 200, RNN state size of 600, and optimize with Adam [11]. Embeddings are initiated randomly.

Following [Liu et al.2018], we report Accuracy, BLEU(-2) [17], and Rouge(-L) [12] to evaluate the adequacy and fluency of our outputs.

CoreQA 50.58 23.78 37.35
Selection 67.69 28.35 35.86
-LengthNrm 52.30 23.58 33.16
Weight:KB 60.45 32.49 38.88
Weight:Con 56.75 28.92 37.05
Table 1: KBQA results: Multi-instance bags on CQA.

5.3 Multi-Instance Experiments

We first evaluate the proposed methods on the multi-instance portion of the CQA encyclopedic QA dataset released by [8], as our method is designed to work with questions with multiple answers. The dataset consists of 64K bags for training/validation and 16K bags for testing. Each question bag has 3.2 answers on average. As shown in Table 1, both answer selection (Selection) and weighting (Weight:KB) obtain better accuracy than the baseline (CoreQA). Instance selection leads to 17.11 absolute point improvement in accuracy, but no improvement on Rouge. We hypothesize that instance selection limits generation naturalness, because the models are in effect trained on less examples. Answer weighting achieves better balance between adequacy and fluency, improving accuracy by 9.87 absolute points and Rouge by 1.53 absolute points. We use KB weighting in following experiments.

We also measured the performance of answer selection without length normalization (-LengthNrm) and answer weighting using content instead of KB information (Weight:Con) for ablation study. The results confirm the necessity of performing length normalization and utilizing KB information in answer weighting.



[8] 56.6 - -
CoreQA 57.58 27.76 37.67
Weight 70.18 28.88 36.78
Curriculum 71.56 27.21 39.25


CoreQA 52.50 18.63 32.55
Weight 59.92 22.02 32.73
Curriculum 64.27 22.85 33.47
Table 2: KBQA results: Complete data on CQA/PQA.

5.4 Mixed-Instance Experiments

We next experiment on complete real word datasets that contain both single- and multiple-instance bags. In addition to CQA, we also evaluate on a product QA dataset (PQA) that we collect from a large e-commerce website, with 87K question bags for training/validation and 22K bags for testing. Half of the questions have multiple answers, with 2.4 answers on average. Unlike CQA, which automatically extracts the KB, the PQA KB is composed of real product properties on the website and is closer to the production KBQA setting.

In Table 2, the first line is the result reported by [8]. CoreQA is our re-implementation of [8], Weight is trained with KB-based instance weighting and is identical to the Weight:KB setting in Table 1, and Curriculum trains first on single instance bags and gradually adds multi-instance bags (cf. § 4.4).

On both datasets, instance weighting significantly improves the accuracy scores and curriculum learning further improves both accuracy and naturalness. On the CQA public dataset, instance weighting with a curriculum schedule leads to 13.98 and 1.58 absolute points improvement on accuracy and ROUGE respectively. The trend is similar for the PQA dataset, showing that the method works for different domains and use cases.

6 Conclusions and Future Work

We trained end-to-end KBQA models with multi-instance learning principles. We showed that selection and weighting of answers to the same question helps reducing noise in the training data and boosts both output adequacy and naturalness.

Our approach is independent of the underlying QA model. In future, we plan to integrate our approach with more QA models and explore more ways to utilize the information in the bag.


  • [1] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    , pages 41–48.
  • [2] Daniel G Bobrow, Ronald M Kaplan, Martin Kay, Donald A Norman, Henry Thompson, and Terry Winograd. 1977. Gus, a frame-driven dialog system. Artificial intelligence, 8(2):155–173.
  • [3] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 670–680.
  • [4] Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1-2):31–71.
  • [5] Yao Fu and Yansong Feng. 2018. Natural answer generation with heterogeneous memory. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 185–195.
  • [6] Ralph Grishman. 1979. Response generation in question - answering systems. In 17th Annual Meeting of the Association for Computational Linguistics.
  • [7] Dilek Hakkani-Tür, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. In Interspeech, pages 715–719.
  • [8] Shizhu He, Cao Liu, Kang Liu, and Jun Zhao. 2017. Generating natural answers by incorporating copying and retrieving mechanisms in sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 199–208.
  • [9] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780, November.
  • [10] Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751.
  • [11] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [12] Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04).
  • [13] Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2124–2133.
  • [14] Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. 2018. Denoising distantly supervised open-domain question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1736–1745.
  • [Liu et al.2018] Cao Liu, Shizhu He, Kang Liu, and Jun Zhao. 2018. Curriculum learning for natural answer generation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pages 4223–4229.
  • [15] Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. 2018. Mem2Seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1468–1478.
  • [16] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
  • [17] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
  • [18] Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 148–163.
  • [19] Abigail See, Peter Liu, and Christopher Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Association for Computational Linguistics.
  • [20] Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Florian, and Daniel Gildea. 2018. Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks. arXiv preprint arXiv:1809.02040.
  • [21] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016.

    Modeling coverage for neural machine translation.

    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–85, Berlin, Germany, August.
  • [22] Wei Wang, Ming Yan, and Chen Wu. 2018. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1705–1714.
  • [23] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  • [24] Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. 2016. Neural generative question answering. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2972–2978.
  • [25] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541.
  • [26] John M Zelle and Raymond J Mooney. 1996.

    Learning to parse database queries using inductive logic programming.

    In Proceedings of the national conference on artificial intelligence (AAAI), pages 1050–1055.
  • [27] Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1753–1762.
  • [28] Luke S Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pages 658–666.