Lifelong and Interactive Learning of Factual Knowledge in Dialogues

07/31/2019 ∙ by Sahisnu Mazumder, et al. ∙ University of Illinois at Chicago 1

Dialogue systems are increasingly using knowledge bases (KBs) storing real-world facts to help generate quality responses. However, as the KBs are inherently incomplete and remain fixed during conversation, it limits dialogue systems' ability to answer questions and to handle questions involving entities or relations that are not in the KB. In this paper, we make an attempt to propose an engine for Continuous and Interactive Learning of Knowledge (CILK) for dialogue systems to give them the ability to continuously and interactively learn and infer new knowledge during conversations. With more knowledge accumulated over time, they will be able to learn better and answer more questions. Our empirical evaluation shows that CILK is promising.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dialogue systems, including question-answering (QA) systems are now commonly used in practice. Early such systems were built mainly based on rules and information retrieval techniques Banchs and Li (2012); Ameixa et al. (2014); Lowe et al. (2015); Serban et al. (2015)

. Recent deep learning models

Vinyals and Le (2015); Xing et al. (2017); Li et al. (2017c) learn from large corpora. However, since they do not use explicit knowledge bases (KBs), they often suffer from generic and dull responses Xing et al. (2017); Young et al. (2018). KBs have been used to deal with the problem Ghazvininejad et al. (2018); Le et al. (2016); Young et al. (2018); Long et al. (2017); Zhou et al. (2018). Many task-oriented dialogue systems Eric and Manning (2017); Madotto et al. (2018) also use KBs to support information-seeking conversations.

One major shortcoming of existing systems that use KBs is that the KBs are fixed once the dialogue systems are deployed. However, it is almost impossible for the initial KBs to contain all possible knowledge that the user may ask, not to mention that new knowledge appears constantly. It is thus highly desirable for dialogue systems to learn by themselves while in use, i.e., learning on the job in lifelong learning Chen and Liu (2018). Clearly, the system can (1) extract more knowledge from the Web or other sources, and (2) learn directly from users during conversations. This paper focuses on the latter and makes an attempt to propose an engine for Continuous and Interactive Learning of Knowledge (CILK) to give the dialogue system the ability to acquire/learn new knowledge from the user during conversation. Specifically, it focuses on learning new knowledge interactively from the user when the system is unable to answer a user’s WH-question. The acquired new knowledge makes the system better able to answer future user questions, and no longer be limited by the fixed knowledge provided by the human developers.

The type of knowledge that the CILK engine focuses on is the facts that can be expressed as triples, (, , ), which means that the head entity and the tail entity can be linked by the relation . An example of a fact is (Boston, LocatedInCountry, USA), meaning that Boston is located in USA. This paper only develops the core engine. It does not study other dialogue functions like response generation, semantic parsing, fact extraction from user utterances, entity linking, etc., which have been studied extensively before and are assumed to be available for use. Thus, this paper works only with structured queries (, , ), e.g., (Boston, LocatedInCountry, ?) meaning “In what Country is Boston located ?,” or (, , ), e.g., (?, PresidentOf, USA) meaning “Who is the President of USA?” It assumes that a semantic parser is available that can convert natural language queries from users into query triples. Similarly, it assumes an information extraction tool like OpenIE Angeli et al. (2015) is employed to extract facts as triples (, , ) from user’s utterances during conversation. Building a full-fledged dialogue system that can also learn during conversation is a huge undertaking and is out of the scope of this paper. We thus only investigate the core knowledge learning engine here. We also assume that the user has good intentions (i.e., user answers questions with 100% conformity about the veracity of his/her facts)111We envision that the proposed engine is incorporated into a dialogue system in a multi-user environment. The system can perform cross-verification with other users by asking them whether the knowledge (facts) from a user is correct.; but is not omniscient (opposed to the teacher-student learning setup).

Problem Definition: Given a user query / question (, , ?) [or (?, , )], where and (or ) may not be in the KB (i.e., unknown), our goal is two-fold: (i) answering the user query or rejecting the query to remain unanswered in the case when the correct answer is believed to not exist in the KB and (ii) learning / acquiring some knowledge (facts) from the user to help the answering task. We only focus on the setting where the query cannot be answered directly with the current KB and need inference over existing facts, as considering structured query, it’s trivial to retrieve the answer if the answer triple is already in KB. We further distinguish two types of queries: (1) closed-world queries, where (or ) and are known to the KB, and (2) open-world queries, where either one or both (or ) and are unknown to the KB.

It is easy to see that the problem is essentially a lifelong learning problem Chen and Liu (2018), where each query to be processed is a task and the knowledge gained is retained in the KB. To process a new query/task, the knowledge learned and accumulated from the past queries can be leveraged.

For each new open-world query, the proposed approach works in two steps:

Step 1 - Interact with the user: It converts open-world queries (2) to closed-world queries (1) by asking the user questions related to (or ) and to make them known to the KB (added to KB). The reason for the conversion will be clear below. The user answers, called supporting facts (SFs), are the new knowledge to be added to KB. This step is also called interactive knowledge learning. Note, closed-world queries (1) do not need this step.

Step 2 - Infer the query answer: It solves closed-world queries (1) by inferring the query answer. The main idea is to use each entity in the KB to form a candidate triple (, , ) (or (, , )), which is then scored. The entity with the highest score is predicted as the answer of the query.

Scoring each candidate is modeled as a knowledge base completion (KBC) problem Lao and Cohen (2010); Bordes et al. (2011). KBC aims to infer new facts (knowledge) from existing facts in a KB and is defined as a link prediction problem: Given a query triple, (, , ?) [or (?, , )], it predicts a tail entity [head entity ] which makes the query triple true and thus should be added to the KB. KBC makes the closed-world assumption that , and are all known to exist in the KB Lao et al. (2011); Bordes et al. (2011, 2013); Nickel et al. (2015). This is not suitable for knowledge learning in conversations because in a conversation, the user can ask or say anything, which may contain entities and relations that are not in the KB. CILK removes the closed-world assumption and allows all (or ) and/or to be unknown (not in the KB). Step 1 above basically asks the user questions to make (or ) and/or known to the KB. Then, an existing KBC model as a query inference model can be applied to retrieve an answer entity from KB.

USER: (Boston, LocatedInCountry, ?)In what Country is Boston located?”      [Query]
CILK: I do not know what “located in Country” means? Can you provide me an example?
[Ask for Clue]
USER: (London, LocatedInCountry, UK). “London is located in UK.” [SF1]
CILK: Got it. Can you tell me a fact about “Boston”? [Ask for Entity Fact]
USER: (Harvard University, UniversityLocatedIn, Boston). “Harvard university is located in Boston.”           [SF2]
CILK: (Boston, LocatedInCountry, USA)Boston is located in USA.” [Answer]
Figure 1: An example of interactive learning and inference. Note that CILK only works with triples. Each triple above is assumed to be extracted from the sentence after it. Ask for Clue and Ask for Entity Fact are interaction query types, discussed in Sec. 3. SF denotes supporting fact.

Figure 1 shows an example. CILK acquires supporting facts SF1 and SF2 to accomplish the goal of knowledge learning and utilizes these pieces of knowledge along with existing KB facts to answer the user query (i.e., to infer over the query relation ”LocatedInCountry”). CILK aims to achieve these two sub-goals. The new knowledge (SFs) is added to the KB for future use222The inferred query answer is not added to the KB as it may be incorrect. But it can be added in a multi-user environment through cross-verification (see footnote 1 and Sec. 4).. We evaluate CILK using two real-world KBs: Nell and WordNet and obtain promising results.

2 Related Work

To the best of our knowledge, no existing system can perform the proposed task. We reported a priliminary research in Mazumder et al. (2018).

CILK is related to interactive language learning Wang et al. (2016, 2017), which is mainly about language grounding, not about knowledge learning. Li et al. (2017a, b) and Zhang et al. (2017) train chatbots using human teachers who can ask and answer the chatbot questions. Ono et al. (2017)Otsuka et al. (2013)Ono et al. (2016) and Komatani et al. (2016) allow a system to ask the user whether its prediction of category of a term is correct or not. Compared to these works, CILK performs interactive knowledge learning and inference (over existing and acquired knowledge) while conversing with users after the dialogue system has been deployed (i.e., learning on the job Chen and Liu (2018)) without any teacher supervision or help.

NELL Mitchell et al. (2015) updates its KB using facts extracted from the Web (complementary to our work). We do not do Web fact extraction.

KB completion (KBC) has been studied in recent years Lao et al. (2011); Bordes et al. (2011, 2015); Mazumder and Liu (2017). But they mainly handle facts with known entities and relations. Neelakantan et al. (2015) work on fixed unknown relations with known embeddings, but does not allow unknown entities. Xiong et al. (2018) also deal with queries involving unknown relations, but known entities in the KB. Shi and Weninger (2018) handles unknown entities by exploiting an external text corpus. None of the KBC methods perform conversational knowledge learning like CILK.

3 Proposed Technique

As discussed in Sec. 1, given a query (, , ?) [or (, , )]333Either or or both may not exist in the KB from the user, CILK interacts with the user to acquire supporting facts to answer the query. Such an interactive knowledge learning and inference task is realized by the cooperation of three primary components of CILK:  Knowledge base (KB) , Interaction Module and Inference Model . The interaction module decides whether to ask or not and formulates questions to ask the user for supporting facts. The acquired supporting facts are added to the KB and used in training the Inference Model which then performs inference over the query (i.e., answers the query).

In the following subsections, we formalize the interactive knowledge learning problem (Sec. 3.1), describe the Inference Model (Sec. 3.2) and discuss how CILK interacts and processes a query from the user (Sec. 3.3).

3.1 Problem Formulation

CILK’s KB is a triple store {(, , )} , where is the entity set and is the relation set. Let be a query of the form (, , ?) [or (, , )] issued to CILK, where is termed as query entity and as the query relation. If and/or (we also say ), we call an open-world query. Otherwise, is referred to as a closed-world query, i.e., both and exist in . Given and a query , the query inference task is defined as follows: If is of the form (, , ?), the goal is to predict a tail entity such that (, , ) holds. We call such a tail query. If is of the form (?, , ), the goal is to predict a head entity such that (, , ) holds. We call such a head query. In the open-world setting, it’s quite possible that the answer entity (for a tail query) or (for a head query) does not exist in the KB (in ). In such cases, the inference model cannot find the true answer. We thus further extend the goal of query inference task to either finding answer entity () for or rejecting to indicate that the answer does not exist in .

Given an open-world (head / tail) query from user , CILK interacts with to acquire a set of supporting facts (SFs) [i.e., a set of clue triples involving query relation and/or a set of entity fact triples involving query entity ] for learning and (discussed in Sec 3.3). In Figure 1, (London, LocatedInCountry, UK) is a clue of query relation “LocatedInCountry” and (Harvard University, UniversityLocatedIn, Boston) is an entity fact involving query entity “Boston”. In this interaction process, CILK decides and asks questions to the user for knowledge acquisition in multiple dialogue turns (see Figure 1). This is step 1 as discussed in Sec. 1 and will be further discussed in Sec. 3.3.

Once SFs are gathered, it uses to infer , which is step 2 in Sec. 1 and will be detailed in Sec. 3.2. We refer to the whole interaction process involving multi-turn knowledge acquisition followed by the query inference step as a dialogue session. In summary, CILK is assumed to operate in multiple dialogue sessions with different users and acquire knowledge in each session and thereby, continuously learns new knowledge over time.

3.2 Inference Model

Given a query , the Inference Model attempts to infer by predicting the answer entity from . In particular, it selects each entity and forms number of candidate triples {, …, }, where is of the form (, , ) for a tail query [or (, , ) for a head query] and then score each to quantify the relevancy of of being an answer to . The top ranked entity is returned as the predicted answer of . We deal with the case of query rejection by later.

We use the neural knowledge base embedding (KBE) approach Bordes et al. (2011, 2013); Yang et al. (2014) to design . Given a KB represented as a triple store, a neural KBE method learns to encode relational information in the KB using low-dimensional representations (embeddings) of entities and relations and uses the learned representations to predict the correctness of unseen triples. In particular, the goal is to learn representations for entities and relations such that valid triples receive high scores (or low energies) and invalid triples receive low scores (or high energies) defined by a scoring function

. The embeddings can be learned via a neural network. In a typical (linear) KBE model, given a triple (

, , ), input entity , and relation

correspond to high-dimensional vectors (either “one-hot” index vector or “n-hot” feature vector)

, and respectively, which are then projected into low dimensional vectors , and using an entity embedding matrix and relation embedding matrix as given by- , and . The scoring function is then used to compute a validity score of the triple.

Any KBE model can be used for learning . For evaluation, we adopt DistMult Yang et al. (2014) for its state-of-the art performance over many other KBE models Kadlec et al. (2017). The scoring function of DistMult is defined as follows:


where is the diagonal matrix in .

The parameters of , i.e., and , are learned by minimizing a margin-based ranking objective , which encourages the scores of positive triples to be higher than those of negative triples:


where, is a set of triples observed in , treated as positive triples. is a set of negative triples obtained by corrupting either head entity or tail entity of each +ve triple (, , ) in by replacing it with a randomly chosen entity and respectively from such that the corrupted triples (, , ), (, , ) . Note, is trained continuously by sampling a set of +ve triples and correspondingly constructing a set of -ve triples as the KB expands with acquired supporting facts to improve its inference capability over new queries (involving new query relations and entities). Thus, the embedding matrices and also grow linearly over time.

Rejection in KB Inference. For a query with no answer entity existing in , CILK attempts to reject the query from being answered. To decide whether to reject the query or not, CILK maintains a threshold buffer that stores entity and relation specific prediction thresholds and updates it continuously over time, as described below.

Besides the dataset for training , CILK also creates a validation dataset , consisting of a set of validation query tuples of the form (, , ). Here, is either a head or tail query involving query entity and relation , {} is the set of positive (true answer) entities in and {} is the set of negative entities randomly sampled from such that .

Let be the validation query tuple set involving entity and be the validation query tuple set involving relation . Then, we compute , (i.e., prediction threshold for , where is either or ) as the average of the mean scores of triples involving +ve entities and mean scores of triples involving -ve entities, computed over all in , given by-


where and . Here, if is a tail query and if is a head query. can be explained in a similar way.

Given a head or tail query involving query entity and relation , we compute the prediction threshold for as .

Inference Decision Making. If is the predicted answer entity by for query and , CILK responds to user with answer . Otherwise, gets rejected.

Input: query issued by user at session-; : CILK’s KB at session-; : Performance Buffer at session-; : Threshold Buffer at session-; : trained Inference Model at session-;

: probability of treating an acquired supporting fact as training triple;

: % of entities or relations in that belong to the diffident set.

Output: predicted entity as answer of query in session-.

1:  if   or  IsDiffident(, , then
2:      AskUserforCLUE() {acquire supporting facts to learn ’s embedding}
3:  end if
4:  if   or  IsDiffident(, , then
5:      AskUserforEntityFacts() {Acquire supporting facts to learn ’s embedding}
6:  end if
7:  if  then
8:      Add clue triples from into and randomly mark % of as training triples and (1-)% as validation triples respectively in .
9:  end if
10:  if  then
11:      Add fact triples from into and randomly mark % of these triples as training triples and (1-) % as validation triples.
12:  end if
13:  , SampleTripleSet(, )
14:  , SampleTripleSet(, )
15:   TrainInfModel(, )
16:   UpdatePerfandThreshBuffer                             ()
17:   PredictAnswerEntity(, , )
Algorithm 1 CILK Knowledge Learning and Inference

3.3 Working of CILK

Given a query involving unknown query entity and/or relation , CILK has to ask the user to provide supporting facts to learn embeddings of and in order to infer . However, the user in a given session can only provide very few supporting facts, which may not be sufficient for learning good embeddings of and . Moreover, to accumulate a sufficiently good validation dataset for learning and , CILK needs to gather more triples from users involving and . But, asking for SFs for any entity and/or relation can be annoying to the user and also, is unnecessary if CILK has already learned good emmbeddings of that entity and/or relation (i.e., CILK has performed well in predicting true answer entity for queries involving that entity and/or relation in past dialogue sessions with other users). Thus, it is more reasonable to ask for SFs for the known entities and/or relations for which CILK is not confident about performing inference accurately, besides the unknown ones.

To minimize the rate of user interaction and justify the knowledge acquisition process, CILK uses a performance buffer to store the performance statistics of CILK in past dialogue sessions. We use Mean Reciprocal Rank (MRR) to measure the performance of (discussed in Sec. 4.1). In particular, and denote the avg. MRR achieved by while answering queries involving and respectively, evaluated on the validation dataset . At the end of each dialogue session, CILK detects the set of bottom % query relations and entities in based on MRR scores evaluated on the validation dataset. We call these sets the diffident relation and entity sets respectively for the next dialogue session. If the query relation and/or entity issued in the next session belongs to the diffident relation or entity set, CILK asks the user for supporting facts444Note, if (unknown) or appears the first time in a user query, then it cannot be in the diffident set. But the system has to ask the user question by default.. Otherwise, it proceeds with inference, answering or rejecting the query.

Algorithm 1 shows the interactive knowledge learning and inference process of CILK on a query in a given dialogue session-. Let , , and be the current version of KB, performance buffer, threshold buffer and inference model of CILK at the point when session- starts. Then, the interactive knowledge learning and inference proceeds as follows:

  If or is diffident in , the interaction module of CILK asks the user to provide clue(s) involving [Line 1-3]. Similarly, if or is diffident in , asks the user to provide entity fact(s) involving [Line 4-6].

  If the user provides and/or , augments with triples from and respectively and expands to [Line 7-12]. In this process, % of the triples in and are randomly marked as training triples and rest % are marked as validation triples while storing them in .

  Next, a set of training triples , and a set of validation triples , are sampled randomly from involving and respectively [Line 13-14] for training and evaluating . While sampling, we set the ratio of number of training triples to that of validation triples as to maintain a fixed training and validation set distribution. The size for is set at most (tuned based on real-time training requirements).

  Next, is trained with and gets updated to [Line 15]. Note that, training with encourages to learn the embeddings of both and before inferring . Then, we evaluate with in order to update the performance buffer into and threshold buffer into [Line 16]. Finally, is invoked by CILK to either infer for predicting an answer entity from [Line 17] or reject to indicated that the true answer does not exist in . Note, CILK trains and infers [Line 13-17] only if .

4 Experiments

As indicated earlier, the proposed CILK system is best used in a multi-user environment, so it naturally observes many more query triples (hence, accumulates more facts) from different users over time. Presently CILK fulfills its knowledge learning requirement by only adding the supporting facts into the KB. The predicted query triples are not added as they are unverified knowledge. However, in practice, CILK can store these predicted triples in the KB as well after checking their correctness through cross-verification while conversing with other users in some future related conversations by smartly asking them. Note that CILK may not verify its prediction with the same user who asked the question/query because he/she may not know the answer(s) for . However, there is no problem that it acquires the correct answer(s) of when it asks to some other user in a future related conversation and answers . At this point, CILK can incorporate into its KB and also, train itself using triple . We do not address the issue here.

4.1 Evaluation Setup

Evaluation of CILK with real users in a crowd-source based setup would be very difficult to conduct and prohibitively time-consuming (and expensive) as it needs a large number of real-time and continuous user interaction. Thus, we design a simulated interactive environment for the evaluation.

KB Statistics WordNet Nell
# Relations ( / ) 18 / 12 150 / 142
# Entities ( / ) 13, 595 / 13, 150 11, 443 / 10, 547
# Triples ( / ) 53, 573 / 33, 159 66, 529 / 51,252
# Test relations (kwn / unk) 18 (12 / 6) 25 (17 / 8)
# initial Train / intial valid / test (or query) triples () 29846 / 3323 / 1180 46056 / 5196 / 1250
Test (or query) triples () statistics [ or ]
% triples with only unk 8.05 19.36
% triples with only unk 30.25 21.84
% triples both and unk 5.25 10.16
Table 1: Dataset statistics [kwn = known, unk = unknown]

We create a simulated user (a program) to interact with CILK, where the simulated user issues a query to CILK and CILK answers the query. The (simulated) user has (1) a knowledge base () for answering questions from CILK, and (2) an query dataset () from which the user issues queries to CILK.555Using and , we can create simulated dialogues as well. Utterances in a dialogue can be created using a language template for each triple. Likewise, extraction of triples from utterances can be done using templates as well. Here, consists of a set of structured query triples of the form (, , ?) and (?, , ) readable by CILK. In practice, the user only issues queries to CILK, but cannot evaluate the performance of the system unless the user knows the answer. To evaluate the performance of CILK on in the simulated setting, we also collect the answer set for each query (discussed shortly).

As CILK is supposed to perform continuous online knowledge acquisition and learning, we evaluate its performance on the streaming query dataset. We assume that, CILK has been deployed with an initial knowledge base () and the inference model has been trained over all triples in

for a given number of epochs

. We call the base KB of CILK which serves as its knowledge base at the time point () when our evaluation starts. And the training process of using triples in is referred to as the initial training phase of CILK onwards. In the initial training phase, we randomly split triples into a set of training triples and a set of validation triples with 9:1 ratio (we use ) and train with . is used to tune model hyper-parameters and populate initial performance and threshold buffers and respectively. , , , and get updated continuously after in the online training and evaluation phase (with new acquired triples) during interaction with the simulated user.

Rel - K / Ent -K Rel - K / Ent -UNK Rel - UNK / Ent - K Rel - UNK / Ent -UNK Overall
MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10
EntTh-BTr 0.46 34.57 57.23 0.04 3.50 4.38 0.20 16.21 25.80 0.07 4.83 8.06 0.33 25.03 40.89
RelTh-BTr 0.45 12.71 16.32 0.04 7.89 7.89 0.21 12.30 16.51 0.07 9.67 9.67 0.33 12.09 15.39
MinTh-BTr 0.45 33.81 57.99 0.03 2.63 3.50 0.22 15.93 28.05 0.07 4.84 8.06 0.33 24.43 41.91
MaxTh-BTr 0.45 34.72 56.87 0.04 5.26 6.14 0.20 15.92 25.79 0.07 6.45 9.67 0.33 25.27 40.95
MaxTh-EntTr 0.42 26.07 42.74 0.26 19.29 22.80 0.19 11.79 15.17 0.23 17.74 20.96 0.33 20.77 31.60
MaxTh-RelTr 0.45 34.48 55.93 0.003 2.63 3.51 0.13 11.25 18.01 0.11 8.06 16.13 0.30 23.46 38.09
EntTh-BTr 0.37 26.80 47.28 0.06 4.47 7.22 0.15 9.58 19.97 0.04 1.64 7.36 0.22 16.18 29.78
RelTh-BTr 0.37 17.01 25.05 0.06 3.78 4.13 0.16 8.72 17.67 0.03 3.28 4.92 0.23 11.35 17.49
MinTh-BTr 0.37 26.63 47.30 0.06 5.33 8.60 0.15 10.24 23.21 0.03 1.64 5.72 0.23 16.41 30.57
MaxTh-BTr 0.37 27.57 47.58 0.06 4.30 7.57 0.16 10.69 19.61 0.03 4.92 8.20 0.23 17.16 30.03
MaxTh-EntTr 0.34 21.82 42.65 0.13 3.95 7.91 0.22 16.48 20.56 0.06 4.06 4.06 0.24 15.46 27.44
MaxTh-RelTr 0.37 26.60 47.07 0.04 3.44 5.85 0.20 12.18 17.67 0.06 3.28 10.67 0.23 16.67 29.29
Table 2: Comparison of predictive performance of various versions of CILK. For each KB dataset, we compare the first four (Threshold) variants denoted ase“X-BTr” and last three (dataset sampling strategy) variants denoted as “MaxTh-X” and marked the highest H@1 and H@10 values (among each of the groups of four and three) in bold. Thus, some columns have at max. two values marked bold (due to the two comparison groups). MaxTh-BTr in the table is the version of CILK proposed in Sec. 3.

The relations and entities in are regarded as known relations and known entities to CILK till . Thus, the initial inference model is trained and validated with triples involving only known relations and known entities (in ). During the online training and evaluation phase, CILK faces queries (from ) involving both known and unknown relations and entities. More specifically, if a relation (entity) appearing in a query exists in , we consider that query relation (entity) as known query relation (entity). Otherwise, it is referred to as unknown query relation (entity).

We create simulated user’s KB , base KB () and query dataset from two standard KB datasets: (1) WordNet Bordes et al. (2013) and (2) Nell Gardner et al. (2014). From each KB dataset, we first build a fairly large triple store and use it as the original KB () and then, create of user, base KB () of CILK and from , as discussed below (Table 1 shows the results).

Simulated User, Base KB Creation and Query Dataset Generation. In Nell, we found 150 relations with triples, and we randomly selected 25 relations for . We shuffle the list of 25 relations, select 34% of them as unknown relations and consider the rest (66%) as known relations.

For each known relation , we randomly shuffle the list of distinct triples for , choose (maximum) 250 triples and randomly select 20% as test and add a randomly chosen subset of the rest of the triples along with the leftovers (not in the list of 250), into and the other subset are added to (to provide supporting facts involving poorly learned known relations and/or entities, if asked [see Sec 3.3]).

For each unknown relation , we remove all triples of from , randomly choose 20% triples among them and reserve them as query triples for unknown . Rest 80% triples of unknown are added to (for providing clues). In this process, we also make sure that the query instances involving unknown are excluded from . Thus, the user cannot provide the query triple itself as a clue to CILK (during inference) and also, to simulate the case that the user does not know the answer of its issued query. Note, if the user cannot provide a clue for an unknown query relation or a fact for an unknown query entity (not likely), CILK will not be able to correctly answer the query.

WordNet Nell
EntTh-BTr 0.85 0.24 0.82 0.15
RelTh-BTr 0.20 0.92 0.26 0.72
MinTh-BTr 0.90 0.18 0.86 0.10
MaxTh-BTr 0.83 0.33 0.72 0.31
Table 3: Performance of CILK Threshold variants on Rejection and prediction decisions. Here, AE (AE) means true answer entity exists (does not exist) in KB. “Pr(predAE)” means the probability of predicting an answer, given the true answer exists in KB. “Pr(RejectAE)” means probability of rejecting the query, given true answer does not exist in KB.

At this point, consists of query triples involving both known and unknown relations, but all known entities. To create queries in having unknown entities, we randomly choose 20% of the entities in triples, remove all triples involving those entities from and add them to . Now, gets reduced to (base KB). Next, for each query triple (, , ) , we convert the triple into a head query (?, , ) [or a tail query (, , ?)] by randomly deleting the head or tail entity. We also collect the answer set for each based on observed triples in for CILK evaluation. Note, the generated query triples (with answer entity) in are not directly in or .

The WordNet dataset being small, we use all its 18 relations for creating , , following Nell. As mentioned earlier, the triples in are randomly split into 90% training and 10% validation datasets for simulating initial training phase of CILK.

Hyper-parameter Settings. Embedding dimensions of entity and relations are empirically set as 250 for WordNet and Nell, initial training epochs as 100 for WordNet (140 for Nell), training batch size 128, as 500, as 50, , , random seed as 1000, 4 negative triples generated per positive triple, online training epoch as 5 (2) for each closed (open) world query processing, and learning rate 0.001 for both KB datasets. L2-regularization parameter set as 0.001. Adam optimizer is used for optimization.

Compared Models. Since there is no existing work that solves our proposed problem, we compare various versions of CILK, constructed based on different types of prediction threshold for query rejection (Sec. 3.2) and various online training and validation dataset ) sampling strategies [see Line 13-14 of Algorithm 1] as discussed below:

  CILK variants based on prediction threshold types, namely EntTh-BTr, RelTh-BTr, MinTh-BTr and MaxTh-BTr  (see Table 2). For EntTh-BTr, we use , for RelTh-BTr, we use , for MinTh-BTr, we use and MaxTh-BTr uses as proposed in Sec 3.2. Here, “BTr” indicates that the CILK variant samples triples involving both query entity and relation from KB to build and .

 CILK variants based on dataset sampling strategies: MaxTh-BTr (as explained above), MaxTh-EntTr and MaxTh-RelTr (see Table 2). Given the query entity and query relation , MaxTh-EntTr only samples triples involving and MaxTh-RelTr samples only triples involving to build and . Note, if the sampled dataset () is , CILK skips online training (validation) steps for that session.

(#C, #EF) WordNet Nell
MRR H@1 H@10 MRR H@1 H@10
(1, 1) 0.30 22.09 37.83 0.23 16.89 31.14
(1, 2) 0.32 23.00 39.25 0.25 18.11 31.30
(1, 3) 0.33 25.27 40.95 0.23 17.16 30.03
(1, 3)-U 0.31 23.52 38.15 0.21 15.77 28.64
(2, 2) 0.32 23.43 39.05 0.23 16.82 30.33
Table 4: Overall Performance of MaxTh-BTr (CILK), varying the maximum number of clues (#C) and entity facts (#EF) acquired from user per dialogue session (if asked by the interaction module ).

Evaluation Metrics.

 We use two common KBE evaluation metrics:

mean reciprocal rank (MRR) and Hits@k (H@k). MRR is the average inverse rank of the top ranked true answer entity for all queries Bordes et al. (2013). Hits@k is the proportion of test queries for which the true answer entity has appeared in top-k (ranked) predictions. Higher MRR and Hits@k indicate better performance.

4.2 Results and Analysis

For evaluation on a given KB (WordNet or Nell), we randomly generate a chronological ordering of all query instances in , which are fed to the trained CILK (after the initial training phase is over) in a streaming fashion, and then evaluate CILK on the overall query dataset. The avg. test query processing time of CILK is 1.25 sec (on a Nvidia Titan RTX GPU). While evaluating a query , if the true answer of does not exist in KB and rejects , we consider it as a correct prediction. For such , Reciprocal Rank (RR) cannot be computed. Thus, we exclude while computing MRR, but consider it in computing Hits.

% Test Data Observed WordNet Nell
MRR H@1 H@10 MRR H@1 H@10
Overall Performance
50% 0.37 27.50 47.19 0.29 20.77 38.87
100% 0.37 27.67 46.71 0.29 20.82 38.65
On Open-word Queries
50% 0.16 11.87 20.11 0.09 4.81 16.47
100% 0.18 12.90 22.91 0.13 8.58 19.54
Table 5: Performance of MaxTh-BTr (CILK) on test queries observed over time, given the model has made a prediction.

Table 2 shows the performance of CILK variants on the query dataset, evaluated in terms of MRR, H@1 and H@10 for both KBs. We present the overall result on the whole query dataset as well as results on subsets of query datasets, denoted as (Rel-X, Ent-Y), where X and Y can be either known (‘K’) or unknown (‘UNK’) and ‘Rel’ denotes query relation and ‘Ent’ denotes query entity. So, here, (Rel-K, Ent-UNK) denotes the subset of the query dataset that contains query triples involving only known query relations and unknown query entities (with respect to ). For all variants, we fix the maximum number of clue triples and entity fact triples provided by the simulated user for each query (when asked) as 1 and 3 respectively.

From Table 2, we see that, MaxTh-BTr (version of CILK in Sec. 3) achieves the overall best results compared to other variants for both KB datasets. Among different threshold versions, MaxTh-BTr and MinTh-BTr perform better than the rest. The relatively poor result of RelTh-BTr shows threshold strategy plays a vital role in performance improvement. Considering different dataset sampling strategies, again we see MaxTh-BTr performs better than other versions. As the triples involving both query entity and relation are selected for online training in MaxTh-BTr, CILK gets specifically trained on relevant (query-specific) triples before the query is answered. For other variants, either triples involving query relation (for MaxTh-EntTr) or triples involving query entity (for MaxTh-RelTr) are discarded, causing a drop in performance.

In Table 3, we compare different CILK threshold variants based on how often it predicts (or rejects) the query, when the true answer exists (does not exist) in its current KB, given by Pr(pred AE) [ Pr(Reject AE) ]. For both datasets, EntTh-BTr has a tendency to predict more and reject less. Whereas, RelTh-BTr is more precautious in prediction. MinTh-BTr is the least precautious in prediction among all. MaxTh-BTr adopts the best of both worlds (EntTh-BTr and RelTh-BTr), showing moderate strategy in prediction and rejection behavior.

Table 4 shows comparative performances of MaxTh-BTr on varying the maximum number of clue triples and entity fact triples provided by the user (when asked). Comparing (1, 1), (1, 2), (1, 3) we see a clear performance improvement in MaxTh-BTr with the increase in (acquired) entity fact triples (specially, for WordNet). This shows that if user interacts more and provides more information for a given query, CILK can gradually improve its performance over time [i.e., with more accumulated triples in its KB]. For Nell, performance improves for both (1, 2) and (1, 3) compared to that in (1, 1), (1, 2) variant being the best overall. Comparing (1, 3) and (2, 2) for both KBs, we see that acquiring more entity facts dominates the overall performance improvement compared to acquiring more clues. This is because, as a past query relation is more probable to appear in future query compared to a past query entity, CILK can gradually learn the relation embedding with less clues per query unlike that for an entity. (1, 3)-U denotes the set up, where CILK asks for clues or entity facts only if the query triple has unknown entity and/or relation, i.e. we disable the use of performance buffer (see Sec 3.3). Due to lack of sufficient training triples to learn an unknown query relation and entity, the overall performance degrades. This shows the importance and effectiveness of the performance buffer in improving performance of CILK with limited user interactions.

In Table 5, we show the performance of MaxTh-BTr on (predicted) test queries over time. Considering overall performance, the improvement is marginal. However, for open-world queries, there is a substantial improvement in performance as CILK relatively acquires more facts for open-world queries than that of closed-world ones.

5 CILK: Use Cases in Dialogue Systems

There are many applications for CILK. Conversational QA systems Kiyota et al. (2002); Bordes et al. (2014), conversational recommendation systems Anelli et al. (2018); Zhang et al. (2018)

, information-seeking conversational agents

Yang et al. (2018), etc., that deal with real-world facts, are all potential use cases for CILK.

Recently, Young et al. (2018); Zhou et al. (2018) showed that dialogue models augmented with commonsense facts improve dialogue generation performance. It’s quite apparent that continuous knowledge learning using CILK can help these models grow their KBs over time and thereby, improve their response generation quality.

The proposed version of CILK has been designed based on a set of assumptions (see Sec. 1) to reduce the complexity of the modeling. For example, we do not handle the case of intentional or unintentional false knowledge injection by users to corrupt the system’s KB. Also, we do not deal with fact extraction errors of the peripheral information extraction module or query parsing errors of the semantic parsing modules, which can affect the knowledge learning of CILK. We believe these are separate research problems and are out of the scope of this work. In future, we plan to model an end-to-end approach of knowledge learning where all peripheral components of CILK can be jointly learned with CILK itself. We also plan to solve the cold start problem when there is little training data for a new relation when it is first added to the KB.

Clearly, CILK does not learn all forms of knowledge. For example, it does not learn new concepts and topics, user traits and personality, and speaking styles. They also form a part of our future work.

6 Conclusion

In this paper, we proposed a continuous (or lifelong) and interactive knowledge learning engine CILK for dialogue systems. It exploits the situation when the system is unable to answer a WH-question from the user (considering its existing KB) by asking the user for some knowledge and based on it to infer the query answer. We evaluated the engine on two real-world factual KB data sets and observed promising results. This also shows the potentiality of CILK to serve as a factual knowledge learning engine for future conversational agents.


This work was partially supported by a grant from National Science Foundation (NSF IIS 1838770) and a research gift from Northrop Grumman.


  • Ameixa et al. (2014) David Ameixa, Luisa Coheur, Pedro Fialho, and Paulo Quaresma. 2014. Luke, i am your father: dealing with out-of-domain requests by using movies subtitles. In International Conference on Intelligent Virtual Agents, pages 13–21. Springer.
  • Anelli et al. (2018) Vito Walter Anelli, Pierpaolo Basile, Derek Bridge, Tommaso Di Noia, Pasquale Lops, Cataldo Musto, Fedelucio Narducci, and Markus Zanker. 2018. Knowledge-aware and conversational recommender systems. In ACM RecSys.
  • Angeli et al. (2015) Gabor Angeli, Melvin Johnson Premkumar, and Christopher D Manning. 2015. Leveraging linguistic structure for open domain information extraction. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing

    , pages 344–354.
  • Banchs and Li (2012) Rafael E Banchs and Haizhou Li. 2012. Iris: a chat-oriented dialogue system based on the vector space model. In Proceedings of the ACL 2012 System Demonstrations, pages 37–42. ACL.
  • Bordes et al. (2014) Antoine Bordes, Sumit Chopra, and Jason Weston. 2014. Question answering with subgraph embeddings. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 615–620.
  • Bordes et al. (2015) Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075.
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems.
  • Bordes et al. (2011) Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. 2011. Learning structured embeddings of knowledge bases. In

    Twenty-Fifth AAAI Conference on Artificial Intelligence

  • Chen and Liu (2018) Zhiyuan Chen and Bing Liu. 2018.

    Lifelong machine learning

    Morgan & Claypool Publishers.
  • Eric and Manning (2017) Mihail Eric and Christopher D Manning. 2017. Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue.
  • Gardner et al. (2014) Matt Gardner, Partha Talukdar, Jayant Krishnamurthy, and Tom Mitchell. 2014. Incorporating vector space similarity in random walk inference over knowledge bases. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 397–406.
  • Ghazvininejad et al. (2018) Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Kadlec et al. (2017) Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. 2017. Knowledge base completion: Baselines strike back. Proceedings of the 2nd Workshop on Representation Learning for NLP, ACL.
  • Kiyota et al. (2002) Yoji Kiyota, Sadao Kurohashi, and Fuyuko Kido. 2002. Dialog navigator: A question answering system based on large text knowledge base. In Proceedings of the 19th international conference on Computational linguistics, pages 1–7. ACL.
  • Komatani et al. (2016) Kazunori Komatani, Tsugumi Otsuka, Satoshi Sato, and Mikio Nakano. 2016. Question selection based on expected utility to acquire information through dialogue. In International Workshop on Spoken Dialogue Systems (IWSDS).
  • Lao and Cohen (2010) Ni Lao and William W Cohen. 2010. Relational retrieval using a combination of path-constrained random walks. Machine learning, pages 53–67.
  • Lao et al. (2011) Ni Lao, Tom Mitchell, and William W Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 529–539. ACL.
  • Le et al. (2016) Phong Le, Marc Dymetman, and Jean-Michel Renders. 2016. Lstm-based mixture-of-experts for knowledge-aware dialogues. arXiv preprint arXiv:1605.01652.
  • Li et al. (2017a) Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. 2017a. Dialogue learning with human-in-the-loop. International Conference on Learning Representations.
  • Li et al. (2017b) Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. 2017b. Learning through dialogue interactions. International Conference on Learning Representations.
  • Li et al. (2017c) Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017c. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169.
  • Long et al. (2017) Yinong Long, Jianan Wang, Zhen Xu, Zongsheng Wang, Baoxun Wang, and Zhuoran Wang. 2017. A knowledge enhanced generative conversational service agent. In Proceedings of the 6th Dialog System Technology Challenges (DSTC6) Workshop.
  • Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 285–294.
  • Madotto et al. (2018) Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. 2018. Mem2seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1468–1478.
  • Mazumder and Liu (2017) Sahisnu Mazumder and Bing Liu. 2017. Context-aware path ranking for knowledge base completion. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1195–1201. AAAI Press.
  • Mazumder et al. (2018) Sahisnu Mazumder, Nianzu Ma, and Bing Liu. 2018. Towards a continuous knowledge learning engine for chatbots. arXiv preprint arXiv:1802.06024.
  • Mitchell et al. (2015) T Mitchell, W Cohen, E Hruschka, P Talukdar, J Betteridge, A Carlson, B Dalvi, M Gardner, B Kisiel, J Krishnamurthy, et al. 2015. Never-ending learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 2302–2310. AAAI Press.
  • Neelakantan et al. (2015) Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. 2015. Compositional vector space models for knowledge base completion. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 156–166.
  • Nickel et al. (2015) Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2015.

    A review of relational machine learning for knowledge graphs.

    Proceedings of the IEEE, pages 11–33.
  • Ono et al. (2017) Kohei Ono, Ryu Takeda, Eric Nichols, Mikio Nakano, and Kazunori Komatani. 2017. Lexical acquisition through implicit confirmations over multiple dialogues. In In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue.
  • Ono et al. (2016) Kohei Ono, Ryu Takeda, Eric Nichols, Mikioand Nakano, and Kazunori Komatani. 2016. Toward lexical acquisition during dialogues through implicit confirmation for closed-domain chatbots. In Proceedings of Second Workshop on Chatbots and Conversational Agent Technologies (WOCHAT).
  • Otsuka et al. (2013) Tsugumi Otsuka, Kazunori Komatani, Satoshi Sato, and Mikio Nakano. 2013. Generating more specific questions for acquiring attributes of unknown concepts from users. In 14th Annual SIGDIAL Meeting on Discourse and Dialogue.
  • Serban et al. (2015) Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. 2015. A survey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:1512.05742.
  • Shi and Weninger (2018) Baoxu Shi and Tim Weninger. 2018. Open-world knowledge graph completion. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
  • Wang et al. (2016) Sida Wang, Percy Liang, and Christopher D Manning. 2016. Learning language games through interaction. In 54th Annual Meeting of the Association for Computational Linguistics, pages 2368–2378. ACL.
  • Wang et al. (2017) Sida I Wang, Samuel Ginn, Percy Liang, and Christopher D Manning. 2017. Naturalizing a programming language via interactive learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 929–938.
  • Xing et al. (2017) Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017. Topic aware neural response generation. In Thirty-First AAAI Conference on Artificial Intelligence.
  • Xiong et al. (2018) Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2018. One-shot relational learning for knowledge graphs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1980–1990.
  • Yang et al. (2014) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Embedding entities and relations for learning and inference in knowledge bases. International Conference on Learning Representations.
  • Yang et al. (2018) Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W Bruce Croft, Jun Huang, and Haiqing Chen. 2018. Response ranking with deep matching networks and external knowledge in information-seeking conversation systems. In ACM SIGIR.
  • Young et al. (2018) Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Subham Biswas, and Minlie Huang. 2018. Augmenting end-to-end dialogue systems with commonsense knowledge. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Zhang et al. (2017) Haichao Zhang, Haonan Yu, and Wei Xu. 2017. Listen, interact and talk: Learning to speak via interaction. arXiv preprint arXiv:1705.09906.
  • Zhang et al. (2018) Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W Bruce Croft. 2018. Towards conversational search and recommendation: System ask, user respond. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 177–186. ACM.
  • Zhou et al. (2018) Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. 2018. Commonsense knowledge aware conversation generation with graph attention. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 4623–4629. AAAI Press.