Meta Dialogue Policy Learning

by   Yumo Xu, et al.

Dialog policy determines the next-step actions for agents and hence is central to a dialogue system. However, when migrated to novel domains with little data, a policy model can fail to adapt due to insufficient interactions with the new environment. We propose Deep Transferable Q-Network (DTQN) to utilize shareable low-level signals between domains, such as dialogue acts and slots. We decompose the state and action representation space into feature subspaces corresponding to these low-level components to facilitate cross-domain knowledge transfer. Furthermore, we embed DTQN in a meta-learning framework and introduce Meta-DTQN with a dual-replay mechanism to enable effective off-policy training and adaptation. In experiments, our model outperforms baseline models in terms of both success rate and dialogue efficiency on the multi-domain dialogue dataset MultiWOZ 2.0.



There are no comments yet.


page 8


Domain Adaptation in Dialogue Systems using Transfer and Meta-Learning

Current generative-based dialogue systems are data-hungry and fail to ad...

CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset

To advance multi-domain (cross-domain) dialogue modeling as well as alle...

Cross-domain Dialogue Policy Transfer via Simultaneous Speech-act and Slot Alignment

Dialogue policy transfer enables us to build dialogue policies in a targ...

Generative Dialog Policy for Task-oriented Dialog Systems

There is an increasing demand for task-oriented dialogue systems which c...

MTSS: Learn from Multiple Domain Teachers and Become a Multi-domain Dialogue Expert

How to build a high-quality multi-domain dialogue system is a challengin...

A Student-Teacher Architecture for Dialog Domain Adaptation under the Meta-Learning Setting

Numerous new dialog domains are being created every day while collecting...

Personalizing Dialogue Agents via Meta-Learning

Existing personalized dialogue models use human designed persona descrip...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Task-oriented dialogue systems aim to assist users to efficiently accomplish daily tasks such as booking a hotel or reserving dinner at a restaurant. Complex systems like Alexa and Siri often contain thousands of task domains. However, a successful model on one task often requires hundreds or thousands of carefully labelled domain-specific dialogue data, which consumes a large amount of human effort. Therefore, how to agilely adapt an existing dialogue system to new domains with a scant number of training samples is an essential task in task-oriented dialogues.

In this paper, we investigate dialogue policy, or dialogue management, which lies in the center of a task-oriented dialogue system. Dialogue policy determines the next-step action of the agent given dialogue states and the user’s goals. As a dialogue is composed of multiple turns, the feedback to a dialogue policy’s decision is often delayed until the end of the conversation. Therefore, Reinforcement Learning (RL) is usually leveraged to improve the efficiency and success rate in dialogue policy learning

(deepdynaq, ).

There have been a number of methods applying dialogue policy in multi-domain settings (peng2017composite, ; lipton2018bbq, ; lee2019convlab, )

. These models usually employ an all-in-one multi-hot representation for dialogue states. The state embedding vector is a concatenation of multiple segments, each as a multi-hot vector for the states in one domain. However, when there are unseen domains at inference time, the corresponding parameters of its dialogue acts and slots are not optimized. This significantly limits the adaptation performance of policy models.

To alleviate this problem, we note that there is often shareable low-level information between different domains. For instance, suppose the source domain is taxi-booking and the target domain is hotel-booking. Although the two domains have different ontologies, both domains share certain dialogue slots (e.g. start time and location) and dialogue acts (e.g. request and inform). These shared concepts bear a lot of similarities both in textual representation and corresponding agent policies. Thus, it is feasible to transfer domain knowledge via these commonalities in ontologies.

To this end, we propose a Deep Transferable Q-Network (DTQN), based on Deep Q-Network (DQN) (atari, ) in reinforcement learning, which learns to predict accurate Q-function values given dialogue states and system actions. In DTQN, we factorize the dialogue state space into a set of lower-level feature spaces. Specifically, we hierarchically model cross-domain relations at domain-level, act-level and slot-level. State representations are then composed of several shareable sub-embeddings. For instance, slots like start time in different domains will now share the same slot-level embedding. Furthermore, instead of treating actions as independent regression classes as in DQN, we decompose the dialogue action space and our model learns to represent actions based on common knowledge between domains.

To adapt DTQN to few-shot learning scenarios, we leverage the meta-learning framework. Meta-learning aims to guide the model to rapidly learn knowledge from new environments with only a few labelled samples (maml, ; rakelly2019efficient, )

. Previously, meta-learning has been successfully employed in the Natural Language Generation (NLG) module in dialogues

(daml, )

. However, NLG is supervised learning by nature. Comparatively, there has been little work on applying meta-learning to the dialogue policy, as it is known that applying RL under meta-learning, a.k.a. meta-RL, is a much harder problem than meta supervised learning

(metarl, ).

To verify this fact, we train the DTQN model under the Model-Agnostic Meta-Learning (MAML) framework (maml, ). However, we find through experiments that the canonical MAML fails to let the policy model converge because the task training phase leverages off-policy learning while the task evaluation and meta-adaptation phase employ an on-policy strategy. Thus, the model initially receives very sparse reward signals, especially on complex composite-domain tasks. As a result, the dialogue agent is prone to overfit the on-policy data and to get stuck at the local minimum in the policy space.

Therefore, we further propose Meta-DTQN with a dual-replay mechanism. To support effective off-policy learning in meta dialogue policy optimization, we construct a task evaluation memory to cache dialogue trajectories and prefill it with rule-based experiences in task evaluation. This dual-replay strategy ensures the consistency of off-policy strategy in both meta-training and meta-adaptation, and provides richer dialogue trajectory records to enhance the quality of the learned policy model. Empirical results show that the dual-replay mechanism can effectively increase the success rate of DTQN while reducing the dialogue length, and Meta-DTQN with dual replay outperforms strong baselines on the multi-domain task-oriented dialogue dataset MultiWOZ 2.0 (budzianowski2018multiwoz, ).

2 Related Work

Dialogue Policy Learning Dialogue policy, also known as the dialogue manager, is the controlling module in task-oriented dialogue that determines the agent’s next action. Early work on dialogue policy is constructed on manual rules (ruledm, ). As the outcome of a dialogue does not emerge until the end of the conversation, dialogue policy is often trained via Reinforcement Learning (RL) (rlbook, ). For instance, deep RL is proven useful for strategic conversations (drldm, ), and a sample-efficient online RL algorithm is proposed to learn from only a few hundred dialogues (sampleefficientdm, ). Towards more effective completion on complex tasks, hierarchical RL is employed to learn a multi-level policy either through temporal control (peng2017composite, ), or subgoal discovery (tang2018subgoal, ). Model-based RL also helps a dialogue agent to plan for the future during conversations (deepdynaq, ). While RL for multi-domain dialogue policy learning has attracted increasing attention from researchers, dialogue policy transfer remains under-studied.

Meta-Learning Meta-learning is a framework to adapt models to new tasks with a small number of data (oneshotlearning, ). It can be achieved either by finding an effective prior as initialization for new task learning (oneshotlearning, ), or by a meta-learner to optimize the model which can quickly adapt to new domains (grant2018recasting, ). Particularly, the model-agnostic meta-learning (MAML) (maml, ) framework applies to any optimizable system. It associates the model’s performance to its adaptability to new systems, so that the resulting model can achieve maximal improvement on new tasks after a small number of updates.

In dialogue systems, meta-learning has been applied to response generation. The domain adaptive dialog generation method (DAML) (daml, ) is an end-to-end dialogue system that can adapt to new domains with a few training samples. It places the state encoder and response generator into the MAML framework to learn general features across multiple tasks.

3 Problem Formulation

Reinforced Dialogue Agent

Task-oriented dialogue management is usually formulated as a Markov Decision Process (MDP): a dialogue agent interacts with a user with sequential actions based on the observed dialogue states

to fulfill the target conversational goal. At step , given the current state of the dialogue, the agent selects a system action based on its policy , i.e., , and receives a reward from the environment111Reward measures the degree of success of a dialogue. In ConvLab (lee2019convlab, ), for example, success leads to a reward of where is the maximum number of turns in a dialogue (set to 40 in default), failure to a reward of . To encourage shorter dialogues, the agent also receives a reward of at each turn.. The expected total reward of taking action under the state is defined as a function :


where is the maximum number of turns in the dialogue, and is a discount factor. The policy is trained to find the optimal Q-function so that the expected total reward at each state is maximized. The optimal policy is to greedily act as .

To better explore the action space, an -greedy policy is employed to select the action based on the state

: with probability

, a random action is chosen; with probability , a greedy policy is taken. Here, the Q-function is modeled by Deep Q-Network (DQN) (mnih2015human, ) with parameters . To train this network, state-action transitions are stored in a replay buffer . At each training step, a batch of samples is sampled from to update the policy network via 1-step temporal difference (TD) error implemented with the mean-square error loss:


where is the target network that is only periodically replaced by to stabilize training.

Environment and Domain

The dialogue environment typically includes a database that can be queried by the system, and a user-simulator that mimics human actions to interact with the agent. At the beginning of a conversation, the user-simulator specifies a dialogue goal, and the agent is optimized to accomplish it. Dialogue goals are generated from one or multiple domain(s). For instance, in the benchmark multi-domain dialogue dataset MultiWoz (budzianowski2018multiwoz, ), there are a total of 7 domains and 25 domain compositions. Each domain composition consists of one or more domains, e.g., {hotel} and {hotel, restaurant, taxi}. We split all domains into source domains and target domains to fit the meta-learning scenario (see Section 5 for details).

State Representation

We show the dialogue state representation for classic DQN in Figure 1(A). After receiving a system action , the environment responds with a user action, which is then fed into a dialogue state tracker (DST) to update the dialogue agenda. The DST maintains the entire dialogue records with a state dictionary, and the DQN has a state encoder to embed the dictionary into a state vector. In detail, this state encoder represents states with multi-hot state vectors including six primary feature categories (lee2019convlab, ), e.g., request and inform. As shown in the bottom-left corner of Figure 1(A), each category is encoded as the concatenation of a few domain-specific multi-hot vectors from its relevant domains, and the concatenation of the six category representations forms a binary state representation (see Appendix A for details).

We argue that two major issues in the classic DQN system prohibit its generalization to unseen domains: (1) the input states adopt multi-hot representations where no inter-state relation is considered and (2) given the state input, actions in different domains are modeled as independent regression classes. However, there is a considerable amount of domain knowledge that can be shared across actions and states, e.g., both taxi-booking and hotel-reserving tasks share dialogue slots such as start time and location and dialogue acts such as request. These types of information elicit similar text representation and policy handling.

4 Framework

4.1 Deep Transferable Q-Network

Figure 1: Framework of (A) classic DQN for Dialogue Policy Learning and (B) our Deep Transferable Q-Network (DTQN) for Cross-Domain Dialogue Policy Learning.

To enable effective knowledge learning and transfer across different domains, we reformulate cross-domain dialogue policy learning as a state-action matching problem. As shown in in Figure 1(B), we propose DTQN, a Deep Transferable Q-Network that jointly optimizes the policy network and domain knowledge representations.

Driven by the structure of dialogue knowledge, we assume that the dialogue state space and the system action space are factorized by a set of lower-level feature spaces. Based on this hypothesis, we aim at modeling cross-domain relations at different levels in DTQN: domain-level, act-level and slot-level. To this end, we hierarchically decompose the state and actions into four embedding subspaces, shared across all dialogue sessions: domains , dialogue acts , slots , and values . Both the states and actions are encoded by joining different sets of subspace embeddings.

We retain the existing categorization of dialogue state features mentioned in Section 3 considering its effectiveness in dialogue management (deepdynaq, ; lee2019convlab, ). We first represent each feature category as a dense vector and then concatenate the categories of feature vectors into the state representation :


Note that each consists of domain-specific features, each corresponding to a domain. In the classic DQN state representation (lee2019convlab, ), few features are shared across domains. As a result, an agent cannot generalize its policy from source domains to a target domain if the target state space remains unseen and the target action space is mostly unexplored. Besides, the length of state representations grows linearly with .

Here, we propose to use a fixed-length state vector to represent the -th feature category. To do that, we use cross-domain features to aggregate state information from different domains. In detail, the -th domain-specific component of is denoted by . For example, for the inform category,


where denotes embedding of domain. is the average of the inner product between general slot embeddings and their value embeddings. The binary feature tracks whether the corresponding domain is active, i.e., essential domain slots are already filled.

To obtain the fixed-length representation for a feature category, we aggregate its domain features from all relevant domains via a non-linear transformation with a residual connection:


where projects domain-specific features into a shared feature space across domains and we acquire the final state representation via Equation (4).

Different from DQN, which encodes only dialogue states and incorporates no prior information of actions, we explicitly model the structural information of system actions with an action encoder in DTQN to maximize knowledge sharing across domains. Action encoding follows analogous procedures to the state encoding except that it does not use value space . For each system action , the domains that contain this action form a set . We encode its -th domain feature () as , where is embedding for dialogue act, e.g., request or booking, and is the average of slot embeddings. We then obtain the system action embedding: .

All embedding tables are shared between state and action encoders. We stack all action vectors and denote the action matrix as , which is then used to produce the Q-values:


where is a parameter matrix.

4.2 Meta Reinforcement Learning with Dual Replay

1:function MetaPolicyLearning
2:     Initialize and with Policy network and target network
3:     Initialize experience replay memory and using Reply Buffer Spiking (RBS) Dual replay
4:     Set domains and gather the domain compositions , where .
5:     for  do Outer loop for meta-training
6:          Generate dialogue goals from or Single or composite domain
7:          Initialize meta-training loss
8:          for  do Inner loop for task data collection and training
9:                and load agent with
10:               EnvInteract(, , ) Task training data collection
11:               Sample random minibatches of from
12:               Update via -step minibatch SGD
13:               EnvInteract(, , ) Task evaluation data collection
14:               Sample random minibatches of from
15:               Forward pass with the minibatches and obtain
17:          end for
18:          Load agent with and update with respect to via minibatch SGD
19:          Every steps reset Target network update
20:     end for
21:end function
Algorithm 1 Meta Dialogue Policy Learning

To adapt Q-network to few-shot learning scenarios, we propose to use a meta-learning framework (maml, ) and present an instantiation of this framework with the DTQN as the policy network and target network . Algorithm 1 shows the pseudocode for our methodology for meta dialogue policy learning.

At the beginning of each outer-loop of meta-training, we first sample dialogue goals as training tasks. For the th inner-loop step, the agent interacts with the environment using task to collect trajectories and stores them in the replay buffer (see Appendix B for details of function EnvInteract). Then, we sample from a minibatch of experiences of task :

. The loss function

is from Equation  (2). We compute task-specific updated parameters from :


With the updated parameters , the agent interacts with the environment and obtains trajectory . According to MAML (maml, ), the task evaluation loss should be directly used to update with learning rate :


However, this on-policy learning suffers from very sparse rewards especially at the initial learning stage. This is due to the inherent difficulties in cross-domain dialogue learning: i) the state-action space to explore is much larger, and ii) the conversation required to complete the task is often longer (peng2017composite, ). As a result, the dialogue agent is prone to overfit with on-policy data and to get stuck at the local minimum in the policy space.

To alleviate this problem, we propose a dual-replay framework to support efficient off-policy learning in meta-RL. Apart from the main replay buffer for meta-training, we construct a task evaluation memory . We note that it is essential to separate and since the task replay buffer is for the evaluation purpose for each task and should not be seen during task training.

Moreover, we adopt a variant of imitation learning, Replay Buffer Spiking (RBS)

(lipton2016efficient, ) to warm up the learning process. Before our agent interacts with the environment, we employ a rule-based agent crafted for MultiWoz to initialize both and . Then, in steps 14-16, we collect new trajectories with our agent and push them into . We uniformly sample from a mini-batch , which can be a mixture of on-policy and relevant off-policy data, to calculate the task evaluation loss . As a result, is updated as:


During test, for an unseen domain, we adopt a similar off-policy approach for meta-adaptation. This train-test consistency circumvents the known difficulty in on-policy meta-adaptation with off-policy meta-training (rakelly2019efficient, ). In fact, classic MAML for RL can be seen as a special case of our dual-replay architecture by setting the task evaluation memory to .

5 Experiment

5.1 Setup

Systems Hotel Train Police Average Success Reward Turns Success Reward Turns Success Reward Turns Success Reward Turns Few-Shot Models Dqn-1k    0.00 -55.70 17.71   2.15 -56.02 20.60 100.00 76.62 5.38 14.66   8.58 13.68 Dtqn-1k 53.90 15.06 11.62 61.00 24.84 10.36 100.00 79.28 2.72 71.63 39.73   8.23 Adaptive Models VanillaDqn 28.10 -20.18 15.89   0.00 -59.00 21.00   24.30 -25.95 17.11 17.47 -35.04 18.00 Dqn 36.00  -9.70 14.90 32.20 -14.53 15.17 100.00 76.62 5.38 56.07 17.46 11.82 Dtqn 62.70 27.99   9.25 82.65 54.80   6.38 100.00 79.24 2.76 81.78 54.01   6.13 Meta-Dtqn-Sr 47.50   5.86 13.14 50.35 10.26 12.16 100.00 79.28 2.72 65.95 31.80   9.34 Meta-Dtqn 61.90 26.28 10.00 87.45 61.44   5.50 100.00 79.24 2.76 83.12 55.65   6.09

Table 1: System performance in the single-domain setting on 2000 dialogues in the target domains.

Systems Hotel Train Average Success Reward Turns Success Reward Turns Success Reward Turns Few-Shot Models Dqn-1k   0.00 -50.38 12.38   2.15 -56.02 20.60   1.08 -53.20 16.49 Dtqn-1k   2.55 -47.77 12.84   8.90 -45.96 18.64   5.73 -46.87 15.74 Adaptive Models VanillaDqn   0.05 -58.93 20.99   0.00 -59.00 21.00   0.03 -58.97 21.00 Dqn   3.90 -53.74 20.42   9.85 -45.52 19.34   6.88 -49.63 19.88 Dtqn   4.15 -53.32 20.30 15.40 -37.93 18.41   9.78 -45.63 19.36 Meta-Dtqn-Sr   1.15 -57.41 20.79   4.35 -53.01 20.23   2.75 -55.21 20.51 Meta-Dtqn 11.45 -43.68 19.42 19.30 -32.81 17.97 15.38 -38.25 18.70

Table 2: System performance in the composite-domain setting on 2,000 dialogues in the target domains. We show results in Hotel and Train as Police has only single-domain dialogue goal.

Dataset and task settings

We use the benchmark multi-domain dialogue dataset MultiWoz 2.0 (budzianowski2018multiwoz, ) for the evaluation. We adopt attraction, restaurant, taxi and hospital as source domains for training (source task size ), and use hotel, train and police as target domains for adaptation. This split makes sure that both train and test splits have domains with various frequency levels (see Appendix C for details). We propose two experiment settings: single-domain and composite-domain. For the single-domain setting, agents are trained and tested with only single-domain dialogue goals. In the composite-domain setting, for each task in meta-training, we first select a seed domain and then sample domain composition which contains . The trained model is then adapted and evaluated in various domain compositions containing .


We developed different baseline task-oriented dialogue systems: Dqn is standard deep Q-learning which uses binary state representations and Dtqn, which is our proposed model without meta-learning framework. We also build VanillaDqn without Replay Buffer Spiking (RBS) (lipton2016efficient, ) to show the warm-up effects in adaptation from rule-based off-policy data. In addition, we build Meta-Dtqn-Sr with only one single replay buffer to show the effects of the dual-replay mechanism we proposed. During adaptation to target domains, we simulate the data scarcity scenario by using only 1,000 frames (i.e., 10% of the training data). Besides, to examine the effects of the two-stage paradigm of training-and-adaptation, we also report the results of two few-shot models, Dqn-1k and Dtqn-1k. Both models are trained from scratch with the 1,000 frames in the target domains.

Implementation Details

We developed all variants of agents based on Convlab (lee2019convlab, ). We used a batch size of 16 for both training and adaptation. We set the size of the training replay buffer and the evaluation replay buffer to 50,000. We initialized the replay buffers with Replay Buffer Spiking (RBS) (lipton2016efficient, ) during the first 1000 episodes of meta-training, first 10 episodes of single-domain adaptation and first 50 episodes of composite-domain adaptation (see Appendix D for details).

5.2 Evaluation Results

Systems Single Composite Average Success Reward Turns Success Reward Turns Success Reward Turns Dqn 90.20 65.98 4.26 40.40  -3.00 13.48 65.30 31.49 8.87 Dtqn 91.70 68.13 3.91 74.85 42.92   8.90 83.28 55.53 6.41 Meta-Dtqn-Sr 83.80 57.00 5.57 33.95 -11.81 14.55 58.88 22.59 10.06 Meta-Dtqn 93.00 69.84 3.76 80.30 49.95   8.41 86.65 59.90 6.09

Table 3: System performance on 2,000 dialogues in the training domains.
Figure 2: Development success rate (above) and training loss (below) of MetaDtqn

with evaluation replay buffer of different sizes on composite-domain tasks. Shadow denotes variance.

Table 1 shows that our models (Meta-Dtqn and Dtqn) considerably outperform baseline systems on single-domain tasks in hotel and train. Dialogue tasks in police are relatively easy to accomplish, on which DQN-1k trained from scratch with only 1,000 frames can completely succeed. On the contrary, DQN-1k fails on all the tasks in Hotel. Also, note that Dtqn-1k significantly outperforms Dqn-1k across all domains. This demonstrates the effectiveness of modeling the dependency between the state and action space. Besides, the performance gain from meta-training is more significant in the train domain (i.e., 4.8% in success rate), which can be attributed to the similarity of state and action spaces between train and the source domain taxi.

Table 2 shows the adaptation results on the composite-domain setting, which is a much harder dialogue task. Here, Meta-Dtqn has a clear advantage over other agents on both hotel and train, showing that meta-learning can boost the quality of agents adapted with a small amount of data in complex dialogue tasks (see Appendix E for dialogue examples). Table 3 lists the performance of various models when evaluating on source domains. Here, meta-learning can also help to achieve better results, and the gain is larger in the more complex composite-domain settings.

It is worth noting that on all tasks, Meta-Dtqn shows superior results than its single-replay counterpart, Meta-Dtqn-Sr, and the performance gap is particularly large on composite-domain dialogue tasks where the agent is more prone to suffer from initial reward sparsity.

Effects of dual replay We further investigate the effects of the proposed dual-replay method. In Figure 2, we show the performance of our model with a task evaluation memory of varied sizes. We start with pure on-policy evaluation , i.e., batch size , and experiment with different buffer sizes: 16, 1000, 3000, 5000, 10000, and 50000. As shown, when the replay buffer is relatively small (), the success rate fails to improve. We argue that this optimization difficulty is due to the overfitting to on-policy data with sparse rewards at the beginning of the learning phase. This can be verified by the loss curve: the training loss abruptly drops from high values (100-500) to extremely low values (less than 10) soon after the RBS warm-up phase. When the evaluation memory size increases, our model is able to escape from the local minimum and get optimized continuously.

Figure 3: Performance of Meta-Dtqn with adaptation data of varied sizes on composite tasks.

Effects of adaptation data size In addition, we show how the size of adaptation data affects the agent’s performance on target domains in Figure 3. We test Meta-Dtqn with adaptation data ranging from 100 frames (1% of the training data) to 2,500 frames (25% of the training data). As shown, the agent performance positively correlates with the amount of data available in the target domain. Note that as one episode has on average 10 frames, and we adopt RBS for the first 50 episodes, the agent is adapted only with off-policy rule-based experiences when the number of frames is less than 500. Therefore, the large performance gap between 500 and 1,000 frames indicates that our model can considerably benefit from a very small amount of on-policy data.

6 Conclusion

Dialogue policy is the central controller of a dialogue system and is usually optimized via reinforcement learning. However, it often suffers from insufficient training data, especially in multi-domain scenarios. In this paper, we propose the Deep Transferable Q-Network (DTQN) to share multi-level information between domains such as slots and acts. We also modify the meta-learning framework MAML and introduce a dual-replay mechanism. Empirical results show that our method outperforms traditional deep reinforcement learning models without domain knowledge sharing, in terms of both success rate and length of dialogue. As future work, we plan to generalize our method to more meta-RL applications in multi-domain and few-shot learning scenarios.

Broader Impact

Our work can contribute to dialogue research and applications, especially in new domains with scant training data. Our framework helps models quickly adapt to unseen domains to bootstrap applications. The outcome is a more effective and efficient dialogue agent system to facilitate activities in human society.

However, one needs to be cautious when collecting dialogue data, which may cause privacy issues. Deanonymization methods must be used to protect personal privacy.