Unsupervised Learning of KB Queries in Task Oriented Dialogs

Task-oriented dialog (TOD) systems converse with users to accomplish a specific task. This task requires the system to query a knowledge base (KB) and use the retrieved results to fulfil user needs. Predicting the KB queries is crucial and can lead to severe under-performance if made incorrectly. KB queries are usually annotated in real-world datasets and are learnt using supervised approaches to achieve acceptable task completion. This need for query annotations prevents TOD systems from easily adapting to new domains. In this paper, we propose a novel problem of learning end-to-end TOD systems using dialogs that do not contain KB query annotations. Our approach first learns to predict the KB queries using reinforcement learning (RL) and then learns the end-to-end system using the predicted queries. However, predicting the correct query in TOD systems is uniquely plagued by correlated attributes, in which, due to data bias, certain attributes always occur together in the KB. This prevents the RL system to generalise and accuracy suffers as a result. We propose Correlated Attributes Resilient RL (CARRL), a modification to the RL gradient estimation, which mitigates the problem of correlated attributes and predicts KB queries better than existing weakly supervised approaches. Finally, we compare the performance of our end-to-end system trained using predicted queries to a system trained using annotated gold queries.


End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning

This paper presents a model for end-to-end learning of task-oriented dia...

Query Understanding via Entity Attribute Identification

Understanding searchers' queries is an essential component of semantic s...

Knowledge Base Inference for Regular Expression Queries

Two common types of tasks on Knowledge Bases have been studied – single ...

Multi-level Memory for Task Oriented Dialogs

Recent end-to-end task oriented dialog systems use memory architectures ...

Neural Enquirer: Learning to Query Tables with Natural Language

We proposed Neural Enquirer as a neural network architecture to execute ...

Personalized neural language models for real-world query auto completion

Query auto completion (QAC) systems are a standard part of search engine...

TexPrax: A Messaging Application for Ethical, Real-time Data Collection and Annotation

Collecting and annotating task-oriented dialog data is difficult, especi...

1 Introduction

Task-oriented dialog (TOD) systems converse with users to accomplish a specific task by interacting with a knowledge base (KB). For example, a restaurant reservation system Henderson et al. (2014b) interacts with a KB that contains restaurants and their attributes such as name, rating and address. Traditional TOD systems Williams and Young (2007) follow a pipeline architecture where the system is split into various modules. As each of these modules are trained separately, they require intermediate states to be hand-crafted and each utterance in the dialog to be annotated with these states.

To eliminate these hand-crafted states and reduce the amount of annotations, end-to-end TOD approaches Madotto et al. (2018); Reddy et al. (2019); Raghu et al. (2019) were proposed Bordes and Weston (2017). Figure 1 shows an example dialog used for training such end-to-end approaches. In the example, the user asks for a restaurant with certain attributes. The agent then requests for missing attributes. Once all the attributes are collected, the agent frames a KB query. She then suggests a highly rated restaurant based on the retrieved results followed by its phone number. In addition to the user and agent utterances, the dialog is annotated with a KB query that fetches the required knowledge to accomplish the task. We refer to this TOD approach of learning TOD using KB query annotated dialogs as aTOD.

Figure 1: Examples of dialogs used for training aTOD and uTOD systems. The table in the example shows the results retrieved by the KB query at turn 2.

In real-world scenarios, human agents chat with users on messaging platforms. When the need to query the KB arises, the agent usually fires the query on a back-end KB application and uses the retrieved results to formulate a response on the messaging platform. The dialogs retrieved from the platforms would only contain user and the agent utterances and the KB queries typically go undocumented. As aTOD approaches require the datasets to make available the precise queries along with dialog logs, additional human annotations are required to learn aTOD systems.

In response, we define a novel problem of learning an end-to-end task oriented dialog system where the dialogs do not contains KB query annotations. We refer to this problem as uTOD – TOD using unannotated dialogs. As a first step towards solving this task, we design a neural model which predicts the turn at which the KB query has to be made. Subsequently, our system learns to predict the KB query using reinforcement learning and then learns to generate agent responses. Additionally, we identify two datasets that can be used to evaluate a uTOD system’s ability to predict KB queries and the generated responses.

In order to predict KB queries in dialogs, a natural source of weak supervision is the KB entities present in the subsequent dialog. A correct KB query should be able to fetch all the entities used in the subsequent dialog exchanges. For example, in Figure 1, the query should be generated in such a way that it returns Peking restaurant and its phone number. If entities are not fetched accurately, then the response generator would be forced to memorize the missing entities and hence would fail to generalize. In addition to this, the weak supervision should also ensure that the results retrieved by the KB query are precise – if the query returns a lot more entities than necessary, then the response generator would have to deal with the additional responsibility of filtering the incorrect ones from the results. We use this insight to define our reward function for query generation. Unfortunately, natural extensions of existing weakly supervised approaches, like RL, fail to learn a good KB query predictor for our task. This is because KBs used for TOD usually have correlated attributes. For example, in a restaurant domain, most of the Japaneses restaurants are expensive. Such correlation between cuisine and price causes queries framed using just the cuisine and queries framed using both cuisine and price to return similar results. This confuses typical RL agents, which are unable to discern the right query in a given context. In response, we modify the gradient estimates for RL to make it resilient to these correlated attributes. The proposed correlated attributes resilient RL (CARRL) approach learns to predict KB queries better than existing weakly supervised approaches.

To summarize, we make the following contributions:

  1. We define a novel problem of learning TOD with unannotated dialogs (uTOD).

  2. We propose a novel correlated attributes resilient RL (CARRL) approach for predicting KB queries using weak supervision.

  3. We show that our proposed CARRL approach outperforms existing weakly supervised approaches for KB query prediction.

We will release our code and datasets for further use by the research community.

Figure 2: The architecture of existing TOD systems (aTOD).

2 Related Work

In this section, we first position our work in the area of task-oriented dialogs. We then discuss existing weakly supervised approaches for learning KB queries from natural language.

Task Oriented Dialogs: TOD systems can be divided into two: traditional and end-to-end trainable TOD systems. Traditional dialog systems Wen et al. (2017); Williams et al. (2017) use hand crafted states and intermediate supervision on dialogs. End-to-end TOD systems Madotto et al. (2018); Reddy et al. (2019); Wu et al. (2019); Raghu et al. (2019) do not require such hand crafted states and intermediate supervisions. But they require annotations of precise KB query on dialogs. While existing approaches require some form of annotations on dialogs, our approach can be trained using unannotated dialogs.

Dhingra et al. (2017) bypassed formulating KB query by inducing a soft posterior distribution over the full KB to indicate entities of interest. Eric et al. (2017) proposed an approach that stores the entire KB in a key-value structured memory, assuming the KB is small. Placing the entire KB in the memory would not be scalable when the KB is large. It also makes inferring over the KB results harder for the downstream response generator. Unlike the approaches that store the entire KB in memory, we generate KB queries using weak supervision to make the approach scalable.

Weak Supervision: Weakly supervised approaches alleviates the need for gold queries to convert natural language to logical forms. These approaches are trained using reinforcement learning where REINFORCE Williams (1992) algorithm is used to estimate the gradients of the expected reward. REINFORCE suffers from a cold start problem when the policy is randomly initialized. To mitigate the problem, liang2017neural used iterative ML to search for good queries and used it to bootstrap the training. liang2018memory systematically search for high reward queries. i.e., queries with non-zero rewards and use them to bootstrap REINFORCE. Weakly supervised approaches have also been proposed for sequential question answering Guo et al. (2018); Iyyer et al. (2017). Unlike these applications, to the best of our knowledge, we are the first to use weak supervision for generating KB queries in task-oriented dialogs.

3 Background

In this section, we first discuss existing aTOD systems Madotto et al. (2018); Reddy et al. (2019); Wu et al. (2019); Raghu et al. (2019) that requires annotated dialogs. We also briefly discuss MAPO Liang et al. (2017), a weakly supervised approach over which CARRL is built upon.

TOD with Annotations: A typical architecture used by aTOD systems is shown in the Figure 2. We represent a dialog between a user () and an agent () as where denotes the number of turns in the dialog. Let be the turn at which the KB query is made and be the set of results returned by KB query. When the user utterance is fed to the system, the external memory contains the dialog history = and the KB results . The contents of the external memory and the current user utterance together is usually referred to as dialog context . The context encoder generates a representation of the dialog context, which is then consumed by the response decoder. The response decoder is usually a sequence decoder which generates the response word-by-word.

For our experiments we use the context encoder, external memory and response decoder proposed in BossNet Raghu et al. (2019). The BossNet encoder is a multi-hop attention based encoder Sukhbaatar et al. (2015). The external memory is a bag-of-sequences memory with each utterance and KB result encoded using a bi-directional GRU. The BossNet decoder is copy augmented decoder with the ability to copy words from the memory. The decoder computes a generate distribution over words in the decode vocabulary and computes a copy distribution over words in the memory using hierarchical attention. Finally, a soft gate See et al. (2017) is used to combine the two distributions. The network is trained using standard cross entropy loss.


liang2018memory use systematic search to discover high reward queries and use them bootstrap the training process. Systematic search efficiently explores the search space by sampling previously unexplored queries from the policy. When a non-zero reward query is sampled, it is added to a buffer which contains all the high reward queries. The high reward queries in the buffer is used in every epoch of training for computing the gradient estimates in REINFORCE.

4 Our Proposed uTOD System

In this section, we first describe the architecture of the proposed uTOD system. We then describe the phase-wise curriculum used for training the system.

4.1 Architecture

Figure 3: The architecture of the proposed system.

The proposed uTOD system can learn to converse with just unannotated dialogs. Figure 3 shows the architecture of the uTOD system. Augmenting any existing aTOD system (whose architecture has an external memory, context encoder and response decoder) with a decoder selector and a KB query decoder

will provide the ability to learn to converse without the need of annotations. The decoder selector is a binary classifier which decides if a KB query is to be generated at a given turn based on the dialog context

c. The KB query decoder (described in Section 5) is a copy-augmented sequence decoder that generates the query one word at a time. When a query is generated, it is fired on the KB and the results are populated in the external memory.

4.2 Phase-wise Curriculum

We use a phase-wise curriculum to train the overall uTOD system. Before we start the training, each dialog must be annotated with turn

at which the KB query is expected to be made. As there are no gold labels available to train such a classifier, we heuristically label the first turn at which the agent response contains a KB entity that was never seen in the dialog so far as

. For example, in Figure 1, is the first turn at which the agent response contains a KB entity Peking restaurant that was never used in the dialog so far. Since there are no gold KB queries as well, the query decoder is trained using weak supervision provided by the entities used in the subsequent dialog.

The training curriculum is divided into three phases. In phase-1, the KB query decoder is trained using weak supervision. From each dialog, the pair of context at turn and entities in the subsequent dialog are used to train the KB query decoder. The query decoder is trained using RL with reward computed as a function of the entities retrieved by the generated query and entities in the subsequent dialog.

In phase-2, the response prediction pipeline is trained by constructing (dialog context, response) pairs from dialogs. For pairs whose response is after the KB query (), the external memory is pre-loaded with KB results obtained by querying the KB with query predicted in phase-1.

Finally, in phase-3 we train the decoder selector. We use the context at various turns in the dialog as input. The context at turn is assigned a label of one and zero otherwise. phase-2 pre-trains the context encoder used by the decoder selector.

5 KB Query Decoder

Our main technical contribution is a KB query decoder that uses a weakly supervised RL based approach. After describing the formulation, we discuss correlated attributes in TOD KBs and how they affect the learning process. We finally propose a modification to the RL based approach that is resilient to correlated attributes. We refer to this solution as Correlated Attributes Resilient RL (CARRL).

5.1 Weakly Supervised KB Query Decoder

The KB query decoder generates the KB query using the dialog context c = at turn such that the KB query retrieves KB entities required to generate the subsequent agent responses. A natural approach to train the query decoder with weak supervision is using reinforcement learning Liang et al. (2017, 2018). Here the query decoder is a policy network that takes the dialog context c as input and generates a KB query a, such that KB query when executed, returns a set of KB entities e. The KB query is represented as a sequence of tokens, and hence, generated auto-regressively as follows:

If the query a fails to retrieve the KB entities required to generate subsequent agent responses, the response decoder will be forced to memorize the KB entities rather than inferring from the retrieved results. This memorization will result in poor generalization, as the overall system will be unable to handle unseen entities. To prevent memorization, we propose to use a full recall reward that is better suited for KB query prediction, and is given by


where is the set of entities retrieved by the query a. The full recall reward ensure that a query is encouraged only if it can retrieve all the entities necessary by the subsequent conversation. Thus helping the system to learn a better generalization.

As there are no gold KB queries available, parameters of are optimized by maximizing the expected reward, i.e, , where the reward function is a measure of how well the KB query was able to retrieve the necessary entities e to generate the subsequent agent responses. Using queries sampled i.i.d. from the current policy , the gradient estimate can be expressed as

(a) (b)
User Intent
user needs a restaurant that serves
Chinese with moderate price range
user needs a Japanese restaurant
in moderate or expensive price range
KB Queries
1. cuisine=chinese, price=moderate
2. cuisine=chinese
3. price=moderate
1. cuisine=japanese, price=expensive
2. cuisine=japanese, price=(expensive OR moderate)
3. cuisine=japanese
4. price=moderate
Table 1: Summary of dialogs and a few examples of high reward KB queries (in simplified syntax) of the corresponding dialogs discovered using systematic search.

5.2 Correlated Attributes in KB

To understand the correlation, we divide KB queries into two types: (1) partial intent queries and (2) complete intent queries. Partial intent queries only capture a part of the user’s intent whereas complete intent queries capture all the attributes in the user’s intent. The column (a) in Table 1 shows three queries. The first query is a complete intent query which captures all the attributes in the user intent. The last two are partial intent queries as they captures just one of the two attributes in the user intent. As observed from the table, for a given dialog there can be multiple partial intent queries that can fetch non-zero rewards, but there will be just one complete intent query (with non-zero reward). Moreover, a specific partial intent query can achieve non-zero rewards for multiple dialogs. For example, the query “price=moderate" can fetch non-zero rewards for both the examples in the Table 1. As REINFORCE optimizes the policy to maximize the expected reward, the policy is likely to learn partial intent queries.

Partial intent queries, despite being incomplete, often receive rewards that are quite close to or equal to the complete intent queries. For example, let us assume the KB has 8 Chinese restaurants out of which 7 have moderate price range and 1 is expensive. The results of the partial intent query “cuisine=chinese" would contain one additional restaurant compared to the complete intent query “cuisine=chinese, price=moderate". Hence the reward function would return scores that are almost the same. This gets extreme in certain cases, where presence or absence of a certain attribute makes no difference. For example, let us assume that all the Japanese restaurants in the KB are expensive. Then both “cuisine=japanese, price=expensive" and “cuisine=japanese" would return the same set of results and so would receive the exact same reward. If such correlation in KB is high, the problem of accurately learning complete intent queries is further compounded in RL.

5.3 Correlated Attributes Resilient RL

To counter the attribute correlation in KB and the problem of partial intent queries, we modify the REINFORCE algorithm as follows: Let be a set of queries sampled from the policy . For brevity, we represent the reward function as . The modified reward function for these queries is as follows:

In the limiting case when , we obtain the following modified reward function:


Due to correlated attributes in the KB, partial intent queries often receive rewards that are quite close to complete intent query. The modified reward function will ensure the complete intent query will be pushed to the top, even if the gap is negligible. The modified reward in (3) can be used to compute the gradient estimates in (2). We refer to this approach for estimating gradients as CARRL.

The proposed CARRL will be effective against the partial intent query problem, only if the complete intent query is present among the top-L queries sampled from . To ensure this, we first follow the training procedure proposed by liang2018memory, named MAPO. Once MAPO converges, we then use our proposed CARRL to estimate the gradients.

Once the training of MAPO is converged, we noticed that in many cases, even though the complete intent queries had higher reward compared to the partial intent queries, they were not learnt by the policy. Applying our proposed objective encouraged the complete intent queries to be pushed up, and hence once CARRL converges, many more complete intent queries were pushed to the top.

6 Experimental Setup

We perform experiments on two task-oriented dialog datasets: CamRest Wen et al. (2016) and DSTC2 Henderson et al. (2014a). Both the datasets have KB query annotations present in them. We remove these annotations from the dialogs and use them for evaluating the proposed problem. However, the query annotations were used for evaluating the performance of various approaches used for training KB query decoder. Table 2 summarizes the statistics of the datasets.

CamRest676 is a human-human dialog dataset in the restaurant reservation domain. It was collected using the Wiz-of-Oz framework. As CamRest is designed for dialog state tracking, we use the dataset111https://github.com/dair-iitd/BossNet converted the to the end-to-end TOD format by raghu2019disentangling. All the dialogs in the dataset have just one KB query annotated in them.

DSTC2 is the datatset used for the Dialog State Tracking Challenge. It is a human-bot dialog dataset, also in the restaurant reservation domain. DSTC2 was also designed for dialog state tracking. BordesW16 converted the dataset into the format suitable for evaluating end-to-end TOD agents. We filtered the dialogs in the dataset which had more than one KB query.

CamRest DSTC2
Train Dialogs 406 1279
Val Dialogs 135 324
Test Dialogs 135 1051
Avg. no. of turns 4.06 7.94
Table 2: Statistics of CamRest and DSTC2 datasets.

6.1 Algorithms

As we are the first to propose the problem of learning TOD systems with unannotated dialogs, there are no existing baselines. So, we compare the performance of the proposed Correlated Attributes Resilient RL (CARRL) approach for training the KB query decoder to existing weakly supervised algorithms such as

REINFORCE Williams (1992): uses on-policy samples to estimate the gradient.

MAPO Liang et al. (2018): uses on-policy samples and a buffer of non-zero reward queries explored using systematic search to compute the gradient estimates.

Supervised: uses the gold KB queries as direct supervision. It is used to measure the performance of aTOD systems, and provides an upper bound.

Our primary goal is to build a uTOD system and compare its performance to an aTOD system. To refer to a uTOD system trained using a specific weakly supervised algorithm, we use the algorithm name in superscript. For example, a uTOD system trained using MAPO is referred to as .

Accuracy Total Test Rewards PIQ Ratio
DSTC2 CamRest DSTC2 CamRest DSTC2 CamRest
REINFORCE 0.00 0.00 0.00 0.00 0.00 0.00
MAPO 0.420.10 0.350.05 114.8213.9 17.481.77 0.460.07 0.510.06
CARRL 0.630.03 0.510.03 155.666.54 21.581.07 0.210.03 0.260.04
Supervised 0.710.04 0.500.05 162.076.23 19.601.96 0.200.03 0.210.07
Table 3: Performance of CARRL and other weakly supervised approaches on CamRest and DSTC2 on 10 runs. PIQ (Partial Intent Query) ratio is the fraction of generated queries that captured only a part of the user request.
CamRest DSTC2
Ent. F1 Ent. F1 BLEU Ent. F1 Ent. F1 BLEU
Non Infor. All Non Infor. All
0.080.03 0.350.03 14.091.63 0.400.02 0.440.04 47.403.23
0.240.03 0.410.03 15.650.83 0.490.01 0.400.01 47.631.43
0.290.02 0.450.02 15.721.26 0.550.02 0.450.03 49.301.25
Table 4: Performance of various uTOD systems on 10 runs. An oracle decoder selector was used by all systems.

6.2 Evaluation Metrics

In this section, we discuss the metrics used for evaluating the decoder selector, KB query decoder and the overall uTOD system.

Decoder Selector: The decoder selector is a binary classifier that indicates whether a KB query should be made at a given turn based on the context. Since we assume each dialog contains just one KB query, during test time, the classifier should predict false until turn and true at turn . A sequence of correct predictions is necessary for the decoder selector to be effective for a single dialog. To well capture this behaviour, we compute dialog accuracy, where the classifier is correct only if it predicts zeros for all turns and one at the turn . For each dialog we also compute turn difference as the absolute difference between the first turn in the dialog at which the classifier first predicts true and the gold turn at which the KB query is made in the annotated dialog. The smaller the average turn difference, the better is the classifier.

KB query decoder: The algorithms used for training KB query decoder are evaluated based on accuracy with which they generate the gold KB queries. Since both the original datasets have KB queries annotated, we use them to compute the accuracy.

uTOD system: uTOD and aTOD systems are evaluated based on their ability to generate valid responses. We use BLEU Papineni et al. (2002) and entity F1

to measure the similarity between predicted and gold responses. BLEU measures the overlap of n-grams between the predicted and the gold responses. It has become a popular measure to compare task-oriented dialog systems. BLEU assign equal weight to both entity words and non-entity words in the response. Since predicting the required entity is the goal of TOD, we also use entity F1 to better compare entity prediction performance. Entity F1

222Raghu et al. (2019); Madotto et al. (2018) report the micro average of recall as Entity F1. is computed using the macro precision and macro recall.

6.3 Human Evaluation

Liu et al. Liu et al. (2016) have shown that metric based evaluations of response prediction in dialogs are not strongly correlated with human judgements. So, we collect human judgements to compare the quality of the responses generated by aTOD and uTOD systems. Given a dialog context and the corresponding KB results, we collect the relevance of a response with respect to the dialog context on a scale of (0-2). As our focus is to evaluate the ability of an agent to query the KB and effectively use the results retrieved to generate responses, we only collect judgements for responses that occur after the KB query. We sampled 100 random dialog-context from CamRest dataset and collected judgements for 3 systems, namely , and .

6.4 Training

We implemented our system using TensorFlow

Abadi et al. (2016)

. The word embeddings and weights were initialized using a standard normal distribution with mean

and variance

. We trained the network using an Adam optimizer Kingma and Ba (2014)

and apply gradient clipping with a clip-value of 40. Since our network is built on top of BossNet, we used the best performing hyper-parameters reported by raghu2019disentangling for each dataset. Total accumulated validation reward is used as a early stopping criteria for KB query decoder, dialog accuracy for decoder selector and BLEU for the response decoder.

7 Experiments

Our experiments evaluate three research questions.

  1. How well does CARRL predict queries compared to other weakly supervised approaches?

  2. How accurate is the decoder selector?

  3. How does the overall performance of various uTOD systems compare to an aTOD system?

7.1 CARRL Performance

Table 3 reports the KB query prediction accuracies, total rewards and fraction of partial intent queries generated by CARRL and other approaches on CamRest and DSTC2 datasets. For a fair comparison, we predict the query at the same turn at which the gold query was annotated. CARRL outperforms MAPO on both the datasets. As mentioned in Section 5.2 due to the correlated attributes in the KB, MAPO generate a large number of partial intent queries (PIQ) (almost 50%). Whereas CARRL counters this using the proposed approach in Section 5.3. On both daatsets, CARRL has almost 50% drop in PIQs compared to MAPO.

The failure of REINFORCE highlights that using just the on-policy samples to computing gradient estimates is inadequate for our problem. Both CARRL and MAPO use non-zero reward queries explored using systematic search, along with on-policy samples to computing gradient estimates. This makes RL explore useful parts of policy space leading to much better model.

Table 4 shows the performance overall system with the various query decoders and an oracle decoder selector (uses gold labels during test). The aim of this experiment is to analyse the performance of the proposed query predictors on the overall system. The REINFORCE query predictor was unable to predict any valid (partial or complete) queries during test, but still achieved an F1 metric comparable to other approaches. On further inspection, we found that responses contain two types of entities: informable entities (IE) and non-informable entities (NIE) Henderson et al. (2014a). IEs are those entities that the users can specify to describe their needs and NIEs are those entities that the agent retrieves from the KB to satisfy the user needs. For example, in the restaurant suggestion task – cuisine, location are IEs, while restaurant name, phone number are NIEs. Retrieving the necessary NIEs is crucial to the success of the overall system. On CamRest, REINFORCE exhibits quite low NIE F1.

On DSTC2, REINFORCE achieves a reasonable entity F1 for both cases. On further analysis, we found that when the KB Results in the external memory are empty during train, the response decoder is forced to memorize (i.e., generate instead of copy) the entities. As DSTC2 has a large overlap of non-informable entities in train, dev and test, it achieves a reasonable non-informable entity F1 by memorizing them. Systems that memorize the entities would fail to generalize when the KB is changed, such as in disentanglement evaluation Raghu et al. (2019). To summarize, uTOD system trained using CARRL achieves better performance than uTOD systems trained using other weakly supervised approaches.

7.2 Decoder Selector Performance

Dialog Avg. Turn
Accuracy Difference
w/o PT w/ PT w/o PT w/ PT
CamRest 0.47 0.54 0.72 0.56
DSTC2 0.21 0.28 2.26 2.20
Table 5: Performance of the decoder selector with and without pre-training (PT).

Decoder selector predicts the turn at which the KB query has to be generated. Table 5 shows the dialog accuracy and the average turn difference of the decoder selector on CamRest and DSTC2. Pre-training the context encoder with the task of response generation in phase 2 helps improve the dialog level accuracy by 7 points on both datasets.

The errors on CamRest are due to the inherent ambiguity in the annotations present in the dataset. For example, When the user requests a particular cuisine, say cuisine x. If x is present in the KB, KB query is made at this turn. But if x is not present in the KB, then in the dataset the agent requests for an alternate option as x is unavailable, without explicitly making a query. Once the user suggests an alternate option, the KB query is then made. The errors in DSTC2 are also due to the ambiguity in the annotations. As DSTC2 is a speech transcripts of human-bot dialogs, many a times the bot generates a response rather than a KB query due to misinterpretation of the bot or speech recognition error.

7.3 End-to-End uTOD Performance

CamRest DSTC2
Ent. F1 Ent. F1 BLEU Ent. F1 Ent. F1 BLEU
Non Infor. All Non Infor. All
decoder selector trained using gold labels
0.230.04 0.410.03 14.900.65 0.470.02 0.430.04 45.772.98
0.270.03 0.440.02 14.961.31 0.520.01 0.440.04 46.462.59
decoder selector trained using heuristic labels
0.180.01 0.370.02 13.440.73 0.470.04 0.400.03 42.641.59
0.210.02 0.390.03 13.930.80 0.500.06 0.400.04 45.022.24
0.250.03 0.420.04 15.551.05 0.510.02 0.450.02 47.591.20
Table 6: Performance of various end-to-end uTOD and aTOD systems on 10 runs.

Table 6 reports the performance of end-to-end uTOD and aTOD systems. The uTOD systems are trained in two different settings: (1) decoder selector trained with the gold and (2) decoder selector trained with the heuristic labels as described in Section 4.2. The latter setting uses heuristic to label and hence is trained on unannotated dialogs. It can be seen that the performance of uTOD system trained using unannotated data is better than uTOD, and also not far behind the aTOD system trained using annotated dialogs.

The numbers reported in Table 6 for aTOD system is not comparable with the published results Raghu et al. (2019) due to a fundamental difference in the experimental setup. In published works, aTOD systems are evaluated using results of gold KB query in the dialog context for all at turns i.e., even if the aTOD system fails to generate the correct KB query at turn , the results of the correct query will be used to predict the subsequent agent responses. As our problem is to learn a TOD system with unannotated data, we use the results of the generated KB query, even if it is incorrect, to predict the subsequent agent responses.

Human Evaluation: We collected human judgements on the relevance of responses generated by three end-to-end systems On a scale of 0-2, uTOD received a score of 0.39, uTOD received a score of 0.53 and aTOD received 0.57. This experiment on 100 random samples from CamRest further strengthens the fact that the performance of uTOD system is not far from the aTOD system.

KB Results (Restaurant | Food | Area | Pricerange)
frankie_and_bennys | italian | south | expensive
User-1 i am looking for an expensive restaurant in the south part of town .
Agent-1 there is five restaurants to your request .
User-2 i would like italian food .
Gold-Response frankie_and_bennys serves italian food in the south part of town . is there anything else i can help you with ?
Gold-Query food=italian, area=south, pricerange=expensive
Supervised-Response frankie_and_bennys is located in the south part of town and is in the expensive would you like their phone number
Supervised-Query food=italian, area=south, pricerange=dontcare
MAPO-Response la_margherita is italian in the south part of town
MAPO-Query food=italian, area=dontcare, pricerange=dontcare
CARRL-Response frankie_and_bennys is an expensive in the south part of town . would you like their phone number
CARRL-Query food=italian, area=south, pricerange=expensive
Table 7: Responses and queries generated by uTOD (CARRL, MAPO) and aTOD systems on a dialog from CamRest. For simplicity, only the fields used in the dialog are mentioned in the KB results. Entities are italicized.

7.4 Qualitative Evaluation

We qualitatively compare the performance of uTOD, uTOD and aTOD using the example in Table 7. The example demonstrates the ability of CARRL to generate a query that captures the complete user intent. The overall system also generates an acceptable response for the given dialog context. Even though all entities in the gold are not generated by the uTOD system, it captures the correct non-informable entity. The example also shows that there are cases where CARRL performs better than Supervised learner in generating KB queries. Despite being able to capture just two of the three attributes from the context, aTOD was able to retrieve the correct entity. This shows that the response decoder can fix some mistakes made by the query decoder.

The example also demonstrates the issue with MAPO generating partial user intent queries. As only one of the three attributes were captured, the query would have retrieved a large number of results from the KB to populate the external memory. Unlike the aTOD case, the mistake by the query decoder was too costly for the uTOD system’s response decoder to fix.

8 Conclusion

We propose a novel problem of learning task-oriented dialogs with unannotated dialogs (uTOD). Thus enabling them to be easily adaptable to new domains. We design an architecture which can be augmented to existing TOD models that require annotated dialogs (aTOD). We propose a novel correlated attributes resilient RL (CARRL) approach for predicting KB queries using weak supervision to counter the effect of correlated attributes in the TOD KBs. We show that CARRL outperforms existing weakly supervised approaches on KB query prediction. We will release our code and datasets for further use by the research community.

9 Acknowledgments

This work is supported by IBM AI Horizons Network grant, an IBM SUR award, grants by Google, Bloomberg and 1MG, and a Visvesvaraya faculty award by Govt. of India. We thank Microsoft Azure sponsorships, and the IIT Delhi HPC facility for computational resources. We thank Gaurav Pandey, Danish Contractor and Sachindra Joshi for their comments on an earlier version of this work.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §6.4.
  • A. Bordes and J. Weston (2017) Learning end-to-end goal-oriented dialog. In International Conference on Learning Representations, Cited by: §1.
  • B. Dhingra, L. Li, X. Li, J. Gao, Y. Chen, F. Ahmed, and L. Deng (2017) Towards end-to-end reinforcement learning of dialogue agents for information access. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 484–495. Cited by: §2.
  • M. Eric, L. Krishnan, F. Charette, and C. D. Manning (2017) Key-value retrieval networks for task-oriented dialogue. In Dialog System Technology Challenges, Saarbrücken, Germany, August 15-17, 2017, pp. 37–49. Cited by: §2.
  • D. Guo, D. Tang, N. Duan, M. Zhou, and J. Yin (2018) Dialog-to-action: conversational question answering over a large-scale knowledge base. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 2942–2951. External Links: Link Cited by: §2.
  • M. Henderson, B. Thomson, and J. D. Williams (2014a) The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pp. 263–272. Cited by: §6, §7.1.
  • M. Henderson, B. Thomson, and S. Young (2014b)

    Word-based dialog state tracking with re- current neural networks

    In In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pp. 292–299. Cited by: §1.
  • M. Iyyer, W. Yih, and M. Chang (2017) Search-based neural structured learning for sequential question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1821–1831. Cited by: §2.
  • D. P. Kingma and J. L. Ba (2014) Adam: amethod for stochastic optimization. In Proc. 3rd Int. Conf. Learn. Representations, Cited by: §6.4.
  • C. Liang, J. Berant, Q. Le, K. D. Forbus, and N. Lao (2017) Neural symbolic machines: learning semantic parsers on freebase with weak supervision. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 23–33. Cited by: §3, §5.1.
  • C. Liang, M. Norouzi, J. Berant, Q. V. Le, and N. Lao (2018) Memory augmented policy optimization for program synthesis and semantic parsing. In Advances in Neural Information Processing Systems, pp. 9994–10006. Cited by: §5.1, §6.1.
  • C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016)

    How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation


    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    pp. 2122–2132. Cited by: §6.3.
  • A. Madotto, CS. Wu, and P. Fung (2018) Mem2Seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of 56th Annual Meeting of the Association for Computational Linguistics, Cited by: §1, §2, §3, footnote 2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §6.2.
  • D. Raghu, N. Gupta, et al. (2019) Disentangling language and knowledge in task-oriented dialogs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1239–1255. Cited by: §1, §2, §3, §3, §7.1, §7.3, footnote 2.
  • R. G. Reddy, D. Contractor, D. Raghu, and S. Joshi (2019) Multi-level memory for task oriented dialogs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3744–3754. Cited by: §1, §2, §3.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. Cited by: §3.
  • S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. In Advances in neural information processing systems, pp. 2440–2448. Cited by: §3.
  • T. Wen, D. Vandyke, N. Mrkšíc, M. Gašíc, L. Rojas-Barahona, P. Su, S. Ultes, and S. Young (2017) A network-based end-to-end trainable task-oriented dialogue system. In 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017-Proceedings of Conference, Vol. 1, pp. 438–449. Cited by: §2.
  • T. Wen, M. Gasic, N. Mrkšić, L. M. Rojas Barahona, P. Su, S. Ultes, D. Vandyke, and S. Young (2016) Conditional generation and snapshot learning in neural dialogue systems. In EMNLP, Austin, Texas, pp. 2153–2162. External Links: Link Cited by: §6.
  • J. D. Williams, K. Asadi, and G. Zweig (2017) Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 665–677. Cited by: §2.
  • J. D. Williams and S. Young (2007)

    Partially observable markov decision processes for spoken dialog systems

    Vol. 21, pp. 393–422. Cited by: §1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §2, §6.1.
  • C. Wu, R. Socher, and C. Xiong (2019) Global-to-local memory pointer networks for task-oriented dialogue. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2, §3.