Log In Sign Up

Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

by   Xinting Huang, et al.

Dialogue policy optimization often obtains feedback until task completion in task-oriented dialogue systems. This is insufficient for training intermediate dialogue turns since supervision signals (or rewards) are only provided at the end of dialogues. To address this issue, reward learning has been introduced to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards. This approach requires complete state-action annotations of human-to-human dialogues (i.e., expert demonstrations), which is labor intensive. To overcome this limitation, we propose a novel reward learning approach for semi-supervised policy learning. The proposed approach learns a dynamics model as the reward function which models dialogue progress (i.e., state-action sequences) based on expert demonstrations, either with or without annotations. The dynamics model computes rewards by predicting whether the dialogue progress is consistent with expert demonstrations. We further propose to learn action embeddings for a better generalization of the reward function. The proposed approach outperforms competitive policy learning baselines on MultiWOZ, a benchmark multi-domain dataset.


page 1

page 2

page 3

page 4


Learn to Exceed: Stereo Inverse Reinforcement Learning with Concurrent Policy Optimization

In this paper, we study the problem of obtaining a control policy that c...

Internal Model from Observations for Reward Shaping

Reinforcement learning methods require careful design involving a reward...

Semi-supervised reward learning for offline reinforcement learning

In offline reinforcement learning (RL) agents are trained using a logged...

Domain Transfer in Dialogue Systems without Turn-Level Supervision

Task oriented dialogue systems rely heavily on specialized dialogue stat...

Learning to Generalize from Sparse and Underspecified Rewards

We consider the problem of learning from sparse and underspecified rewar...

Active Task-Inference-Guided Deep Inverse Reinforcement Learning

In inverse reinforcement learning (IRL), given a Markov decision process...

Causal-aware Safe Policy Improvement for Task-oriented dialogue

The recent success of reinforcement learning's (RL) in solving complex t...

1 Introduction

Task-oriented dialogue systems complete tasks for users, such as making a restaurant reservation or finding attractions to visit, in multi-turn dialogues gao2018neural; sun2016contextual; sun2017collaborative. Dialogue policy is a critical component in both the conventional pipeline approach young2013pomdp and recent end-to-end approaches zhao2019rethinking

. It decides the next action that a dialogue system should take at each turn. Considering its nature of sequential decision making, dialogue policy is usually learned via reinforcement learning

su2015reward; peng2018deep; zhang2019budgeted. Specifically, dialogue policy is learned by maximizing accumulated rewards over interactions with an environment (i.e., actual users or a user simulator). Handcrafted rewards are commonly used for policy learning in earlier work peng2018deep, which assigns a small negative penalty at each turn and a large positive/negative reward when the task is successful/failed. However, such reward setting does not provide sufficient supervision signals in each turn other than the last turn, which causes the sparse reward issues and may result in poorly learned policies takanobu2019guided.

User Side Utterance
I would like moderate price range please.
Dialogue State annotation
Restaurant: {food=modern european, price range=moderate}
System Side Utterance
I found de luca cucina and riverside brasserie. does either of them sound good for you?
System action annotation
restaurant-inform:{name=de luca cucina, name=riverside brasserie}
Table 1: State Action Annotation and Utterance Example

To address this problem, reward function learning that relies on expert demonstrations has been introduced takanobu2019guided; li2019dialogue. Specifically, state-action sequences generated by an optimal policy (i.e., expert demonstrations) are collected, and a reward function is learned to give high rewards to state-action pairs that better resemble the behaviors of the optimal policy. In this way, turn-by-turn rewards estimated by the reward function can be provided to learn dialogue policy. Obtaining expert demonstrations is critical to reward function learning. Since it is impractical to assume that an optimal policy is always available, a common and reasonable approach is to treat the decision makings in human-human dialogues as optimal behaviors. To accommodate the learning of reward function, human-human dialogues need to be annotated in the form of state-action pairs from textual utterances. Table 1 illustrates an example of human-human dialogue and its state-action annotation. However, obtaining such annotations require extensive efforts and costs. Besides, a reward function based on state-action pair might cause an unstable policy learning, especially with a limited amount of annotated dialogues yang2018unsupervised.

To address the above issues, we propose to learn dialogue policies in a semi-supervised setting where the system action of expert demonstrations only need to be partially annotated. We propose to use an implicitly trained stochastic dynamics model as the reward function to replace the conventional reward function that is restricted to state-action pairs. Dynamics models describe sequential progress using a combination of stochastic and deterministic states in a latent space, which promotes an effective tracking and forecasting minderer2019unsupervised; sun2019stochastic; Wang2019EnhancingIS. In our scenario, we train the dynamics model to describe dialogue progress of expert demonstrations. The main rationale is that the reward function should give high rewards to actions that lead to dialogue progress similar to those in expert demonstrations. This is because dialogue progress at the early stage highly influences subsequent progress, and the latter directly determines whether the task can be completed. Since the learning of dynamics model maps observations to latent states and further reason over the latent states, we are no longer restricted to fully annotated dialogues. Using dynamics model as reward function also promotes a more stable policy learning.

Learning the dynamics model in the text space is, however, prone to compounding errors due to complexities and diversities of languages. We tackle this challenge by learning the dynamics model in an action embedding

space that encodes the effect of system utterances on dialogue progress. We achieve action embedding learning by incorporating an embedding function into a generative models framework for semi-supervised learning

kingma2014semi. We observe that system utterances with comparable effects on dialogue progress will lead to similar state transitions huang2019mala. Therefore, we formulate the generative model to describe the state transition process. Using the generative model, we enrich the expert dialogues (either fully or partially annotated) with action embedding to learn the dynamics model. Moreover, we also consider the scenarios where both state and action annotations are absent in most expert dialogues, referred to as unlabeled dialogues. To expand the proposed approach to such scenarios, we further propose to model dialogue progress using action sequences and reformulate the generative model accordingly.

Our contributions are summarized as follows:

  • [topsep=0pt,leftmargin=*,noitemsep,wide=0pt]

  • To the best of our knowledge, we are the first to approach semi-supervised dialogue policy learning.

  • We propose a novel reward estimation approach to dialogue policy learning which relives the requirements of extensive annotations and promotes a stable learning of dialogue policy.

  • We propose an action embedding learning technique to effectively train the reward estimator from either partially labeled or unlabeled dialogues.

  • We conduct extensive experiments on the benchmark multi-domain dataset. Results show that our approach consistently outperforms strong baselines coupled with semi-supervised learning techniques.

2 Preliminaries

For task-oriented dialogues, a dialogue policy decides an action based on the dialogue state at each turn, where and are the predefined sets of all actions and states, respectively. Reinforcement learning is commonly applied to dialogue policy learning, where the dialogue policy model is trained to maximize accumulative rewards through interactions with environments (i.e., users):


where represents a sampled dialogue, and is the numerical rewards obtained in this dialogue. Instead of determining

via heuristics, recent reward learning approaches train a reward function

to assign numerical rewards for each state-action pair. The reward function is learned from expert demonstrations that are dialogues sampled from an optimal policy in the form of state-action pairs. Adversarial learning is usually adopted to enforces higher rewards to state-action pairs from expert demonstrations and lower rewards to those sampled from the learning policy fu2017learning:


where is the current dialogue policy, and is the distribution of dialogues generated with . In this way, the dialogue policy and reward function are iteratively optimized, which requires great training efforts and might lead to unstable learning results yang2018unsupervised. Moreover, such a reward learning approach requires a complete dialogue state and system action annotation of expert demonstrations, which are expensive to obtain.

3 Proposed Model

Figure 1: Overall framework of the proposed approach

3.1 Overview

We study the problem of semi-supervised dialogue policy learning. Specifically, we consider the setting that expert demonstrations consist of a small number of fully labeled dialogues and partially labeled dialogues . For each fully annotated dialogue in , complete annotations are available: , where is the system utterance at turn . Meanwhile, each partially labeled dialogue in only has state annotations and system utterances: .

Figure 1 illustrates the overall framework of the proposed approach. Rewards are estimated by a dynamics model that consumes action embeddings . Every action in the set is mapped to a fix-length embedding via a learnable embedding function . To obtain the action embeddings for which has no action annotations, we first predict the action via a prediction model and then transform the predicted actions to embeddings. To obtain effective action embeddings, we design a state-transition based objective to jointly optimize and via variational inference (Sec. 3.2

). After obtaining the action embeddings, the dynamics model is learned by fitting the expert demonstrations enriched by action embeddings. Rewards are then estimated as the conditional probability of the action given the current dialogue progress encoded in latent states (Sec.

3.3). We also extend the above approach to unlabeled dialogues where both state and action annotations are absent (Sec. 3.4).

3.2 Action Learning via Generative Models

We aim to learn the prediction model and action embeddings using both and . We formulate the action prediction model as which takes as input the system utterance and its corresponding state transition . We then introduce an mapping function: , where is the action embedding space later used for learning the dynamics model.

We train the prediction model by proposing a variational inference approach based on a semi-supervised variational autoencoder (Semi-VAE)

kingma2014semi. Semi-VAE describes the data generation process of feature-label pairs via latent variables as:


where is a generative model parameterised by , and the class label is treated as a latent variables for unlabeled data. Since this log-likelihood in Eqn. 3 is intractable, its variational lower bound for unlabeled data is instead optimized as:


where and are inference models for latent variable and respectively, which have a factorised form ; denotes causal entropy; is the variational bound for labeled data, ans is formulated as:


where KL is the Kullback-Leibler divergence, and

, are the prior distribution of , .

The generative model , inference model and are optimized using both the labeled subset and unlabeled subset using the objective as:


Semi-Supervised Action Prediction

We now describe the learning of action prediction model using semi-supervised expert demonstrations. We extend the semi-supervised VAE by modeling the generation process of state transitions. State transition information is indicative for action prediction and is available in both fully and partially labeled demonstrations. Thus we choose to describe the generation process of state transitions, and the optimization objective is formulated as:


For partially labeled dialogues, we treat action labels as latent variables and use the action prediction model to infer the value (which is denoted as later for simplicity). The variational bound of Eqn. 7 is derived as:


where is the variational bound for demonstrations with action labels and is derived as:


where is the inference model for latent variable . Lastly, we use fully annotated samples to form a classification loss:


The overall objectives includes the loss of fully and partially labeled demonstrations:


Action Embeddings Learning

We then incorporate action embedding function into the developed semi-supervised action prediction approach. The reason to introduce action embeddings is to make the learning of reward estimator more efficient and robust. Specifically, prediction error of the action prediction model might impinge the learning of reward estimator, especially for our semi-supervised scenarios where fully labeled dialogues are limited. By mapping actions to an embedding space, ‘wrongly predicted’ partially labeled demonstrations can still provide sufficient knowledge and thus we could achieve better generalization over actions for reward estimation.

To this aim, we consider the inference steps in the semi-supervised learning process and utilize the ones that involve action labels, i.e., the inference models for latent variables and . We first specify how the action prediction model is modified to include action embeddings. Inspired by chandak2019learning, we model the action selection using Boltzmann distribution for stability during training:


where is a temperature parameter, and is a function that maps the input into hidden states of the same dimension as action embeddings. We also modify the inference model for latent variable by incorporating action embeddings:


After optimizing the action prediction model and action embedding function jointly using the objective function Eqn. 11, we use action embeddings to enrich the expert demonstrations. For fully labeled dialogues, we map the given system action labels to corresponding embeddings and obtain . For partially labeled dialogues, we first infer the action using prediction model: , and map the inferred action to its embedding to obtain: .

3.3 Reward Estimation by Dynamics Model

We aim to learn a reward estimator based on action representations obtained from the action learning module. To achieve a more stable reward estimation than adversarial reward learning, we propose a reward estimator based on dialogues progress. Dialogue progress describes how user goals are achieved through multistep interactions and can be modeled as dialogue state transitions. We argue that an action should be given higher rewards when it leads to similar dialogue progress (i.e., state transitions) of expert demonstrations. To this aim, we learn a model to explicitly model dialogue progress without the negative sampling required by adversarial learning, and rewards can be estimated as the local-probabilities assigned to the taken actions.

To model dialogue progress, we use variational recurrent neural network (VRNN)


. The reason to use a stochastic dynamics model is due to the ‘one-to-many’ nature of task-oriented dialogues. Specifically, both user and dialogue system have multiple feasible options to proceed the dialogues which requires the modeling of uncertainty. Thus, by adding latent random variables to an RNN architecture, VRNN can provide better modeling of dialogue progress than deterministic dialogue state tracking.

VRNN has three types of variables: the observations (and here we consider action embeddings), the stochastic state , and the deterministic hidden state , which summarizes previous stochastic states , and previous observations . We formulate the prior stochastic states to be conditioned on previous timesteps through hidden state :


We obtain posterior stochastic states by incorporating the observation at the current step, i,e. action embeddings :


Predictions are made by decoding latent states, including both the stochastic and deterministic:


And lastly the deterministic states are updated as:


where are all implemented as neural networks. Note that we also make the prediction and recurrence step to condition on the dialogue state to provide more information.

We train the VRNN by optimizing the evidence lower bound (ELBO) as:


The rewards are estimated as the conditional probability given the hidden state of VRNN, which encodes the current dialogue progress:


where is the probability given to the selected action based on the decoding step of VRNN (Eqn. 16). The larger this conditional probability is, the more similar the dialogue progress this action leads to imitates the expert demonstrations. The proposed reward estimation is agnostic to the choice of policy, and various approaches (e.g., Deep Q-learning, Actor-Critic) can be optimized by plugging into the policy learning objective (Eqn. 1).

3.4 Expanding to Unlabeled Corpus

We further describe how to expand the proposed model, including action learning and reward estimation modules, to utilize unlabeled expert demonstrations. Formally, we consider the setting that we have fully labeled dialogues and unlabeled dialogues . For each dialogue in , only textual conversations are provided and neither of state and action labels are available: , where is the context and consists of the dialogue history of both user and system utterances.

With the absence of dialogue state information, we formulate the action prediction model as . This formulation can be considered as an application of Skip-Thought kiros2015skip, which originally utilizes contextual sentences as supervision signals. In our scenarios, we instead utilize the previous and next system utterances to provide more indicative information for action prediction.

We also build the joint learning of action prediction model the action embeddings on semi-supervised VAE framework. Instead of modeling state transitions, we choose the process of response generation to fully utilize unlabeled dialogues:


System action labels are treated as latent variables for unlabeled dialogues, and the variational bond is derived as:


where is variational bound for fully labeled dialogues:


The objective to jointly train the prediction model and action embeddings is the same as Eqn. 11, where the terms for fully and partially labeled dialogues are replaced with the ones in Eqn. 22 and 21, respectively. Such expanding also enables a sufficient semi-supervised learning when expert demonstrations include all types of labeled dialogues: , and . We notice that the posterior approximation and action embedding function can be sharing between the process of state transitions and response generation. Thus, by treating semi-supervised learning in and as auxiliary constraints, the learning over unlabeled corpus can also benefit from dialogues state information.

4 Experiments

(5%) + (95%) (10%) + (90%) (20%) + (80%)
Model Entity-F1 Success Turns Entity-F1 Success Turns Entity-F1 Success Turns
Handcrafted PPO 41.8 34.1 13.3 45.3 36.7 12.5 50.6 41.2 11.2
Reward Learning ALDM 38.7 35.6 15.2 42.1 38.6 14.9 44.9 42.1 13.7
GDPL 49.5 47.5 12.8 54.9 53.2 12.1 60.4 59.1 10.8
Semi-VAE Enhanced SS-PPO 45.2 36.2 13.6 47.4 37.2 12.4 53.1 43.6 11.5
SS-ALDM 39.6 38.8 14.7 44.7 43.8 13.2 47.8 51.3 12.4
SS-GDPL 53.7 51.2 11.1 61.3 58.4 10.5 66.5 68.7 9.2
Proposed SS-VRNN 68.7 63.2 9.4 75.1 68.5 8.6 77.3 72.4 8.2
Act-GDPL 70.6 65.6 9.5 78.8 71.1 8.4 80.9 78.0 8.2
Act-VRNN 76.2 72.7 9.1 83.0 81.8 8.0 85.5 86.7 7.9
Table 2: Semi-Supervised Policy Learning Results ( and )

To show the effectiveness of the proposed model (denoted as Act-VRNN), we experiment on a multi-domain dialogue environment under semi-supervised setting (Sec. 4.1). We compare against state-of-the-art approaches, and their variants enhanced by semi-supervised learning techniques (Sec. 4.2). We analyze the effectiveness of action learning and reward estimation of Act-VRNN under different supervision ratios (Sec. 4.3).

4.1 Settings

We use MultiWOZ budzianowski2018multiwoz, a multi-domain human-human conversational dataset in our experiments. It contains in total 8438 dialogues spanning over seven domains, and each dialogue has 13.7 turns on average. MultiWOZ also contains a larger dialogue state and action space compared to former datasets such as movie-ticket booking dialogues li2017end, and thus it is a much more challenging environment for policy learning. To use MultiWOZ for policy learning, a user simulator that initializes a user goal at the beginning and interacts with dialogue policy is required. For a fair comparison, we adopt the same procedure as Takanobu et al. (takanobu2019guided) to train the user simulator based on auxiliary user action annotations provided by ConvLab lee2019convlab.

To simulate semi-supervised policy learning, we remove system action and dialogue states annotations to obtain partially labeled and unlabeled expert demonstrations, respectively. Fully labeled expert demonstrations are randomly sampled from all training dialogues with different ratios (5%, 10%, and 15% in our experiments). Note that the absence of action or state annotations only applies for expert demonstrations, while interactions between policy and user simulator are in dialogue-act level as takanobu2019guided and not affected by semi-supervised setting.

We use a three-layer transformer vaswani2017attention with a hidden size of 128 and 4 heads as our base model for action embedding learning, i.e., in Eqn. 12

. We use grid search to find the best hyperparameters for the models. We choose the action embedding dimensionality among {50, 75, 100, 150, 200}, the stochastic latent state size in VRNN among {16, 32, 64, 128, 256}, and the deterministic latent state size among {25, 50, 75, 100, 150}.

We use Entity-F1 and Success Rate to evaluate dialogue task completion. Entity-F1 computes the F1 score based on whether the requested information and indicated constraints from users are satisfied. Compared to inform rate and match rate used by Budzianowski et al. (budzianowski2018multiwoz), Entity-F1 considers both informed and requested entities at the same time and balances the recall and precision. Success rate indicates the ratio of successful dialogues, where a dialogue is regarded as successful only if all informed and requested entities are matched of the dialogue. We use Turns to evaluate the cost for task completion, where a lower number indicates the policy performs tasks more efficiently.

We compare Act-VRNN with three policy learning baselines: (1) PPO schulman2017proximal using hand-crafted rewards setting; (2) ALDMliu2018adversarial; (3) GDPL takanobu2019guided; We further consider using semi-supervised techniques to enhance the baselines under semi-supervised setting, and denote them as SS-PPO, SS-ALDM, and SS-GDPL. Specifically, we first train a prediction model based on semi-supervised VAE kingma2014semi, and use the prediction results as action annotations for expert demonstrations. 111We also experimented with the pseudo-label approach lee2013pseudo, and the empirical results were worse than Semi-VAE. Thus, we only report the Semi-VAE enhancement results in the table for simplicity. We also compare the full model Act-VRNN with its two variants: (1) SS-VRNN uses a VRNN that consumes predicted action labels instead of action embeddings; (2) Act-GDPL feeds expert demonstrations enriched by action embeddings to the same reward function as GDPL

4.2 Overall Results

Table 2 shows that our proposed model consistently outperforms other models in the setting that uses fully and partially annotated dialogues ( and ). Act-VRNN improves task completion (measured by Entity-F1 and Success) while requiring less cost (measured by Turns). For example, Act-VRNN (81.8) outperforms SS-GDPL (60.4) by 35.4% under Success when having 10% fully annotated dialogues, and requires the fewest turns. Meanwhile, we find that both action learning and dynamics model are essential to the superiority of Act-VRNN. For example, Act-VRNN achieves 19.8% and 11.2% improvements over SS-VRNN and Act-GDPL, respectively, under Success when having 20% fully annotated dialogues. This validates that the learned action embeddings well capture similarities among actions, and VRNN is able to exploit such similarities for reward estimation.

We further find that the improvements brought by semi-VAE enhancement is limited for baselines, especially when the ratio of fully annotated dialogues is low. For example, SS-PPO and SS-GDPL achieve 6% and 7% improvements over their counterparts under Success when having 5% fully annotated dialogues. Similar results are also observed for pseudo-label approach. In general, the pseudo-label methods are outperformed by the counterparts of Semi-VAE and are even worse than the baselines without enhancement when the ratio of fully annotated dialogues is low. For example, in setting , pseudo-label enhanced PPO performs worse than PPO under Entity-F1 when the ratio of fully annotated dialogues is 5% and 10% (37.2 vs 41.8, 39.2 vs 45.3), and only achieves slightly gain when the ratio is 20% (51.0 vs 50.6). This is largely because the prediction accuracy of Semi-VAE and pseudo-label approach might be low with a small amount of fully annotated dialogues, and the expert dialogues with mispredicted actions impinge reward function learning of baselines. Act-VRNN overcomes this challenge with the generalization ability brought by modeling dialogue progress in an action embedding space for reward estimation.

The results for policy learning using unlabeled dialogues () are shown on Table 3. We consider two settings: (1) having fully labeled and unlabeled dialogues, i.e., + ; (2) having all three types of dialogues , i.e., + +. We can see that Act-VRNN significantly outperforms the baselines in both settings. For example, in setting + , Act-VRNN outperforms SS-GDPL by 43% and 44% under Entity-F1 and Success, respectively. Similar results are also observed in setting + +. We further find that SS-VRNN outperforms Act-GDPL in these two settings while the results are opposite in setting + , and we will conduct a detailed discussion in the following section. By comparing results of Act-VRNN and baselines in these two settings, we can see that Act-VRNN can better exploit the additional partially labeled dialogues. For example, SS-GDPL only achieves 2.3% under Success while Act-VRNN achieves more than 5%.

4.3 Discussions

We first study the effects of action learning module in Act-VRNN. We compare Act-VRNN with SS-VRNN, and their counterparts that do not use state transition based objective in semi-supervised learning (i.e., optimizing Eqn. 3 instead of Eqn. 7). These two variants are denoted as Act-VRNN (no state) and SS-VRNN (no state). For a thorough investigation, under each setting, we further show the performances under dialogues spanning over different number of domains. Dialogues spanning over more domains are considered more difficult. The results under two supervision ratio setting are shown in Fig. 2(a) and Fig. 2(b). We can see that Act-VRNN outperforms other variants in each configuration, especially in the dialogues that include more than one domains. This is largely because the learned action embeddings effectively discover the similarities between actions across domains, and thus lead to better generalization of reward estimation. We further find that the state transition based objective we formulated fits well with the VRNN based reward estimator. Both Act-VRNN and SS-VRNN optimized considering state transitions achieve performance gains.

Supervision Model Entity-F1 Success Turns
(10%) + (90%) ALDM 40.0 34.9 15.9
SS-PPO 44.7 33.8 12.9
SS-ALDM 42.1 36.4 14.9
SS-GDPL 56.3 50.2 11.8
SS-VRNN 74.1 67.1 9.1
Act-GDPL 72.9 66.7 8.5
Act-VRNN 80.6 72.4 8.4
(10%) + (10%) + (80%) ALDM 41.7 35.2 15.7
SS-PPO 44.9 34.6 12.8
SS-ALDM 42.5 40.1 14.7
SS-GDPL 57.1 51.4 10.7
SS-VRNN 75.6 67.9 8.8
Act-GDPL 73.3 67.1 8.5
Act-VRNN 81.1 76.3 8.2
  • Note that PPO and GDPL achieve the same results as (10%)+(90%) in Table 2 since they can only utilize dialogues in

Table 3: Semi-Supervised Policy Learning Results (, , and )

Last, we study the effects of dynamics model based reward function in Act-VRNN. We consider four different models as reward function: (1) our full dynamics model VRNN; (2) a dynamics model having only deterministic states (Eqn. 17); (3) a dynamics model having only stochastic states (Eqn. 15); (4) GDPL. All four models are learned based on action embedding learned in the action learning module. The results under + and + are shown in Fig. 3(a) and Fig. 3(b), respectively. We can see that both stochastic and deterministic states in VRNN are important, since VRNN outperforms its two variants and GDPL in each configuration. We further find that the contribution of stochastic and deterministic states may vary in different setting. For example, VRNN (stochastic only) consistently outperforms VRNN (deterministic only) in + while opposite results are observed in + when ratio of is over 20%. This is largely because modeling dialogue progress using stochastic states can provide more stable with less supervision signals, while the incorporation of deterministic can lead to more precise estimation can when more information of expert demonstrations are available.


(a) (5%) + (95%)


(b) (20%) + (80%)
Figure 2: Effects of action learning ( and )


(a) +


(b) +
Figure 3: Effects of dynamics model

5 Related Work

Reward learning aims to provide more effective and sufficient supervision signals for dialogue policy. Early studies focus on learning reward function utilizing external evaluations, e.g., user experience feedbacks gavsic2013line, objective ratings su2015reward; ultes2017reward, or a combination of multiple evaluations su2016line; Chen2019MiningUR. These approaches often assume a human-in-the-loop setting where interactions with real users are available during training, which is expensive and difficult to scale. As more large-scale high-quality dialogue corpus become available (e.g., MultiWOZ budzianowski2018multiwoz), recent years have seen a growing interest in learning reward function from expert demonstrations. Most recent approaches apply inverse reinforcement learning techniques for dialogue policy learning takanobu2019guided; li2019dialogue. These all require a complete state-action annotation for expert demonstrations. We aim to overcome this limitation in this study.

Semi-supervised learning aims to utilize unlabeled data to boost model performance, and is studied in computer vision

Iscen2019LabelPF, item ranking Park2019AdversarialSA; huang2019carl, and multi-label classification miyato2015distributional; wang2018kdgan; wang2019adversarial. Many studies apply semi-supervised VAE kingma2014semi

for different classification tasks, e.g., sentiment analysis

xu2017variational; li-etal-2019-semi-supervised, text matching shen2018deconvolutional; choi-etal-2019-cross. While these work focus on prediction accuracies, we aim to enrich expert demonstrations via semi-supervised learning.

6 Conclusions

We study the problem of semi-supervised policy learning and propose Act-VRNN to provide more effective and stable rewards estimations. We formulate a generative model to jointly infer action labels and learn action embeddings. We design a novel reward function to first model dialogue progress, and estimate action rewards by determining whether the action leads to similar progress as expert dialogues. The experimental results confirm that Act-VRNN achieves better task completion compared with the state-of-the-art in two settings that consider partially labeled or unlabeled dialogues. For future work, we will explore the scenarios that annotations are absent for all expert dialogues.


We would like to thank Xiaojie Wang for his help. This work is supported by Australian Research Council (ARC) Discovery Project DP180102050, and China Scholarship Council (CSC).