Task-oriented dialogue systems complete tasks for users, such as making a restaurant reservation or scheduling a meeting, in a multi-turn conversation [gao2018neural, sun2016contextual, sun2017collaborative]. Recently, end-to-end approaches based on neural encoder-decoder structure have shown promising results [wen2017network, madotto2018mem2seq]. However, such approaches directly map plain text dialogue context to responses (i.e., utterances), and do not distinguish two basic components for response generation: dialogue planning and surface realization. Here, dialogue planning means choosing an action (e.g., to request information such as the preferred cuisine from the user, or provide a restaurant recommendation to the user), and surface realization means transforming the chosen action into natural language responses. Studies show that not distinguishing these two components can be problematic since they have a discrepancy in objectives, and optimizing decision making on choosing actions might adversely affect the generated language quality [yarats2018hierarchical, zhao2019rethinking].
|(a). Was there a particular section of town you were looking for?|
|(b). Which area could you like the hotel to be located at?||Domain: Attaction|
|(c). Did you have a particular type of attraction you were looking for?|
|(d). great , what are you interested in doing or seeing ?|
|System intention (ground truth action)|
|Latent action (auto-encoding approach)|
|(a): [0,0,0,1,0]; (b): [0,1,0,0,0]||(c): [0,0,0,1,0]; (d): [0,0,0,0,1]|
|Semantic latent action (proposed)|
|(a) & (b): [0,0,0,1,0]||(c) & (d): [0,0,0,0,1]|
To address this problem, conditioned response generation that relies on action representations has been introduced [wen2015semantically, chen2019semantically]
. Specifically, each system utterance is coupled with an explicit action representation, and responses with the same action representation convey similar meaning and represent the same action. In this way, the response generation is decoupled into two consecutive steps, and each component for conditioned response generation (i.e., dialogue planning or surface realization) can optimize for different objectives without impinging the other. Obtaining action representations is critical to conditioned response generation. Recent studies adopt variational autoencoder (VAE) to obtain low-dimensional latent variables that represent system utterances in an unsupervised way. Such an auto-encoding approach cannot effectively handle various types of surface realizations, especially when these exist multiple domains (e.g., hotel and attraction). This is because the latent variables learned in this way mainly rely on the lexical similarity among utterances instead of capturing the underlying intentions of those utterances. In Table1, for example, system utterances (a) and (c) convey different intentions (i.e., request(area) and request(type)), but may have the same auto-encoding based latent action representation since they share similar wording.
To address the above issues, we propose a multi-stage approach to learn semantic latent actions that encode the underlying intention of system utterances instead of surface realization. The main idea is that the system utterances with the same underlying intention (e.g., request(area)) will lead to similar dialogue state transitions. This is because dialogue states summarize the dialogue progress towards task completion, and a dialogue state transition reflect how the intention of system utterance influences the progress at this turn. To encode underlying intention into semantic latent actions, we formulate a loss based on whether the reconstructed utterances from VAE cause similar state transitions as the input utterances. To distinguish the underlying intention among utterances more effectively, we further develop a regularization based on the similarity of resulting state transitions between two system utterances.
Learning the semantic latent actions requires annotations of the dialogue states. In many domains, there are simply no such annotations because they require extensive human efforts and are expensive to obtain. We tackle this challenge by transferring the knowledge of learned semantic latent actions from state annotation rich domains (i.e., source domains) to those without state annotation (i.e., target domains). We achieve knowledge transferring in a progressive way, and start with actions that exist on both the source and target domain, e.g., Request(Price) in both hotel and attraction domain. We call such actions as shared actions and actions only exist in the target domain as domain-specific actions. We observe that system utterances with shared actions will lead to similar states transitions despite belonging to different domains. Following this observation, we find and align the shared actions across domains. With action-utterance pairs gathered from the above shared actions aligning, we train a network to predict the similarity of resulting dialogue state transitions by taking as input only texts of system utterances. We then use such similarity prediction as supervision to better learn semantic latent actions for all utterances with domain-specific actions.
Our contributions are summarized as follows:
We are the first to address the problem of cross-domain conditioned response generation without requiring action annotation.
We propose a novel latent action learning approach for conditioned response generation which captures underlying intentions of system utterances beyond surface realization.
We propose a novel multi-stage technique to extend the latent action learning to cross-domain scenarios via shared-action aligning and domain-specific action learning.
We conduct extensive experiments on two multi-domain human-to-human conversational datasets. The results show the proposed model outperforms the state-of-the-art on both in-domain and cross-domain response generation settings.
2 Related Work
2.1 Controlled Text Generation
Controlled text generation aims to generate responses with controllable attributes. Many studies focus on open-domain dialogues’ controllable attributes, e.g., style[yang2018unsupervised], sentiment [shen2017style], and specificity [zhang2018learning]. Different from open-domain, the controllable attributes for task-oriented dialogues are usually system actions, since it is important that system utterances convey clear intentions. Based on handcrafted system actions obtained from domain ontology, action-utterance pairs are used to learn semantically conditioned language generation models [wen2015semantically, chen2019semantically]. Since it requires extensive efforts to build action sets and collect action labels for system utterances, recent years have seen a growing interest in learning utterance representations in an unsupervised way, i.e., latent action learning [zhao2018unsupervised, zhao2019rethinking]. Latent action learning adopts a pretraining phase to represent each utterance as a latent variable using a reconstruction based variational auto-encoder [yarats2018hierarchical]. The obtained latent variable, however, mostly reflects lexical similarity and lacks sufficient semantics about the intention of system utterances. We utilize the dialogue state information to enhance the semantics of the learned latent actions.
2.2 Domain Adaptation for Task-oriented Dialogues
Domain adaptation aims to adapt a trained model to a new domain with a small amount of new data. This is studied in computer vision[saito2017asymmetric], item ranking [wang2018joint, huang2019carl], and multi-label classification [wang2018kdgan, wang2019adversarial, sun2019internet]. For task-oriented dialogues, early studies focus on domain adaptation for individual components, e.g., intention determination [chen2016zero], dialogue state tracking [mrkvsic2015multi], and dialogue policy [mo2018personalizing, yin2018context]. Two recent studies investigate end-to-end domain adaptation. DAML [qian2019domain] adopts model-agnostic meta-learning to learn a seq-to-seq dialogue model on target domains. ZSDG [zhao2018zero] conducts adaptation based on action matching, and uses partial target domain system utterances as domain descriptions. These end-to-end domain adaption methods are either difficult to be adopted for conditioned generation or needing a full annotation of system actions. We aim to address these limitations in this study.
Let be a set of dialogue data, and each dialogue contains turns: , where and are the context and system utterance at turn , respectively. The context consists of the dialogue history of user utterances and system utterances . Latent action learning aims to map each system utterance to a representation , where utterances with the same representation express the same action. The form of the representations can be, e.g., one-hot [wen2015semantically], multi-way categorical, and continuous [zhao2019rethinking]. We use the one-hot representation due to its simplicity although the proposed approach can easily extend to other representation forms.
We obtain the one-hot representation via VQ-VAE, a discrete latent VAE model [van2017neural]. Specifically, an encoder encodes utterances as , and a decoder reconstructs the original utterance based on inputs , where is the hidden dimension. The difference lies in that between and , we build a discretization bottleneck using a nearest-neighbor lookup on an embedding table and obtain
by finding the embedding vector inhaving the closest Euclidean distance to i.e.,
The learned latent is a one-hot vector that only has 1 at index . All components, including , and embedding table , are jointly trained using auto-encoding objective as
The structure of VQ-VAE is illustrated in Fig. 1(a), where the three components are marked in grey color.
4 Proposed Model
To achieve better conditioned response generation for task-oriented dialogues, we propose multi-stage adaptive latent action learning (MALA). Our proposed model works for two scenarios: (i) For domains with dialogue state annotations, we utilize these annotations to learn semantic latent actions to enhance the conditioned response generation. (ii) For domains without state annotations, we transfer the knowledge of semantic latent actions learned from the domains with rich annotations, and thus can also enhance the conditioned response generation for these domains.
The overall framework of MALA is illustrated in Fig. 1. The proposed model is built on VQ-VAE that contains encoder , embedding table , and decoder . Besides auto-encoding based objective , we design pointwise loss and pairwise loss to enforce the latent actions to reflect underlying intentions of system utterances. For domains with state annotations (see Fig. 1a), we train and to measure state transitions and develop the pointwise and pairwise loss (Sec. 4.2). For domains without state annotations (see Fig. 1b), we develop a pairwise loss based on and from annotation-rich-domains. This loss measure state transitions for a cross-domain utterance pair, and thus can find and align shared actions across domains (Sec. 4.3). We then train a similarity prediction network to substitute the role of state tracking models, which only taking as input raw text of utterances. We using predictions as supervision to form pointwise and pairwise loss (see Fig. 1c), and thus obtain semantic latent actions for domain without state annotations (Sec. 4.4).
4.2 Stage-I: Semantic Latent Action Learning
We aim to learn semantic latent actions that align with the underlying intentions for system utterances. To effectively capture the underlying intention, we utilize dialogue state annotations and regard utterances that lead to similar state transition as having the same intention. We train dialogue state tracking model to measure whether any two utterance will lead to a similar state transition. We apply such measurement in (i) a pointwise manner, i.e., between a system utterance and its reconstructed counterpart from VAE, and (ii) a pairwise manner, i.e., between two system utterances.
Dialogue State Tracking
Before presenting the proposed pointwise measure, we first briefly introduce dialogue state tracking tasks. Dialogue states (also known as dialogue belief) are in the form of predefined slot-value pairs. Dialogues with state (i.e., belief) annotations are represented as , where is the dialogue state at turn , and is the number of all slot-value pairs. Dialogue state tracking (DST) is a multi-label learning process that models the conditional distribution . Using dialogue states annotations, we first train a state tracking model with the following cross-entropy loss:
is a scoring function and can be implemented in various ways, e.g., self attention models[zhong2018global], or an encoder-decoder [wu2019transferable].
With the trained state tracking model , we now measure whether the reconstructed utterance output can lead to a similar dialogue state transition from turn to (i.e., forward order). We formulate such measure as a cross-entropy loss between original state and model outputs when replacing system utterance in inputs with .
where is sampled from the decoder output. Note that once state tracking model finish training, its parameters will not be updated and is only used for training the components of VAE, i.e., the encoder, decoder and the embedding table. To get gradients for these components during back-propagation, we apply a continuous approximation trick [yang2018unsupervised]. Specifically, instead of feeding sampled utterances as input to state tracking models, we use Gumbel-softmax [jang2016categorical] distribution to sample instead. In this way outputs of the decoder
becomes a sequence of probability vectors, and we can use standard back-propagation to train the generator:
We expect the dialogue state transition in forward order can reflect the underlying intentions of system utterances. However, the state tracking model heavily depends on user utterance
, meaning that shifts of system utterances intentions may not sufficiently influence the model outputs. This prevents the considered state transitions modeled from providing valid supervision for semantic latent action learning. To address this issue, inspired by inverse models in reinforcement learning[pathak2017curiosity], we formulate a inverse state tracking to model the dialogue state transition from turn to . Since dialogue state at turn already encodes information of user utterance , we formulate the inverse state tracking as . In this way the system utterance plays a more important role in determining state transition. Specifically, we use state annotations to train an inverse state tracking model using the following cross-entropy loss:
where the scoring function can be implemented in the same structure as . The parameters of inverse state tracking model also remain fixed once training is finished.
We use the inverse state tracking model to measure the similarity of dialogue state transitions caused by system utterance and its reconstructed counterpart. The formulation is similar to forward order:
Thus, combining the dialogue state transitions modeled in both forward and inverse order, we get the full pointwise loss for learning semantic latent actions:
To learn semantic latent actions that can distinguish utterances with different intentions, we further develop a pairwise measure that estimates whether two utterances lead to similar dialogue state transitions.
With a slight abuse of notation, we use and to denote two system utterances. We use , , to denote the input user utterance, dialogue context, and dialogue state for dialogue state tracking models and , respectively. We formulate a pairwise measurement of state transitions as
where KL is the Kullback-Leibler divergence. Bothand take inputs related to . We can understand in this way that it measures how similar the state tracking results are when replacing with as input to and .
To encode the pairwise measure into semantic latent action learning, we first organize all system utterances in a pairwise way where is the total number of system utterances in the domains with state annotations. We then develop a pairwise loss to incorporate such measure on top of the VAE learning.
is the sigmoid function,is the average of and , and is encoder outputs. The pairwise loss trains by enforcing its outputs of two system utterances to have far distances when these two utterance lead to different state transitions, and vice versa.
The overall objective function of the semantic action learning stage is:
where and are hyper-parameters. We adopt to train VAE with discretization bottleneck and obtain utterance-action pair (e.g., utterance (c) and its semantic latent action in Table 1) that encodes the underlying intentions for each system utterance in the domains with state annotations.
4.3 Stage-II: Action Alignment across Domains
In order to obtain utterance-action pairs in domains having no state annotations, we propose to progressively transfer the knowledge of semantic latent actions from those domains with rich state annotations. At this stage, we first learn semantic latent actions for the utterances that have co-existing intentions (i.e., shared actions) across domains.
We use and to denote system utterances in the source and target domain, respectively. The set of all utterances is denoted by:
where and are the total utterance number in each domain, respectively. We adopt the proposed pairwise measure to find the target domain system utterances that have shared actions with the source domain. Based on the assumption that although from different domains, utterances with the same underlying intention are expected to lead to similar state transitions, we formulate the pairwise measure of cross-domain utterance pairs as:
where and are computed using the trained and . Since it only requires the trained dialogue state tracking models and state annotations related to , this pairwise measure is asymmetrical. Taking advantage of the asymmetry, this cross-domain pairwise measure can still work when we only have raw texts of dialogues in the target domain.
We then utilize the cross-domain pairwise for action alignment during latent action learning on the target domain. We then formulate a loss Incorporating action alignment:
where is computed based on outputs of the same encoder from VAE at stage-I. We also use utterances in the target domain to formulate an auto-encoding loss:
The overall objective for the stage-II is:
where is the hyper-parameter as the same in . With the VAE trained using , we can obtain utterance-action pairs for system utterances in the domain having no state annotations. However, for utterances having domain-specific intentions, their semantic latent actions are still unclear, which is tackled in Stage 3.
4.4 Stage-III: Domain-specific Actions Learning
We aim to learn semantic latent action for utterances with domain-specific actions at this stage.
Similarity Prediction Network (SPN)
We train an utterance-level prediction model, SPN, to predict whether two utterances lead to similar state transitions by taking as input the raw texts of system utterances only. Specifically, SPN gives a similarity score in to an utterance pair:
where is a scoring function (and we implement it with the same structure as ). We use the binary labels indicating whether two utterances and have the same semantic latent action to train the SPN. Specifically, we have if , and otherwise . To facilitate effective knowledge transfer, we obtained such labels from both source and target domains. We consider all pairs of source domain utterances and obtain
We also consider pairs of target domain utterances with shared actions: we first get all target domain utterances with aligned actions where represents the set of shared actions and then obtain
Using all the collected pairwise training instances , we train SPN via the loss
We then use the trained to replace state tracking models in both pointwise and pairwise measure. Specifically, we formulate the following pointwise loss
which enforces the reconstructed utterances to bring similar dialogue state transitions as the original utterance. We further formulate the pairwise loss as
The overall objective function for stage-III is:
4.5 Conditioned Response Generation
After obtaining semantic latent actions, we train the two components, dialogue planning and surface realization, for conditioned response generation. Specifically, we first train a surface realization model that learns how to translate the semantic latent action into fluent text in context as
Then we optimize a dialogue planning model while keeping the parameters of fixed
In this way, the response generation is factorized into , where dialogue planning and surface realization are optimized without impinging the other.
To show the effectiveness of MALA, we consider two experiment settings: multi-domain joint training and cross-domain response generation (Sec. 5.1). We compare against the state-of-the-art on two multi-domain datasets in both settings (Sec. 5.2). We analyze the effectiveness of semantic latent actions and the multi-stage strategy of MALA under different supervision proportion (Sec. 5.3).
We use two multi-domain human-human conversational datasets: (1) SMD dataset [eric2017key] contains 2425 dialogues, and has three domains: calendar, weather, navigation; (2) MultiWOZ dataset [budzianowski2018multiwoz] is the largest existing task-oriented corpus spanning over seven domains. It contains in total 8438 dialogues and each dialogue has 13.7 turns in average. We only use five out of seven domains, i.e., restaurant, hotel, attraction, taxi, train, since the other two domains contain much less dialogues in training set and do not appear in testing set. This setting is also adopted in the study of dialogue state tracking transferring tasks [wu2019transferable]. Both datasets contain dialogue states annotations.
We use Entity-F1 [eric2017key] to evaluate dialogue task completion, which computes the F1 score based on comparing entities in delexicalized forms. Compared to inform and success rate originally used on MultiWOZ by Budzianowski et al. (2018), Entity-F1 considers informed and requested entities at the same time and balances the recall and precision. We use BLEU [papineni2002bleu] to measure the language quality of generated responses. We use a three-layer transformer [vaswani2017attention] with a hidden size of 128 and 4 heads as base model.
Note that w/o and w/ Action means whether the baseline considers conditioned generation
|Entity-F1 on target domain||BLEU|
Multi-domain Joint Training
In this setting, we train MALA and other baselines with full training set, i.e., using complete dialogue data and dialogue state annotations. We use the separation of training, validation and testing data as original SMD and MultiWOZ dataset. We compare with the following baselines that do not consider conditioned generation: (1) KVRN [eric2017key]; (2) Mem2seq [madotto2018mem2seq]; (3) Sequicity [lei2018sequicity]; and two baselines that adopt conditioned generation: (4) LIDM [wen2017latent]; (5) LaRL [zhao2019rethinking]; For a thorough comparison, We include the results of the proposed model after one, two, and all three stages, denoted as MALA-(S1/S2/S3), in both settings.
Cross-domain Response Generation
In this setting, we adopt a leave-one-out approach on each dataset. Specifically we use one domain as target domain while the others as source domains. There are three and five possible configurations for SMD and MultiWOZ, respectively. For each configuration, we set that only 1% of dialogues in target domain are available for training, and these dialogues have no state annotations. We compare with Sequicity and LaRL using two types of training schemes in cross-domain response generation. 111We also consider using DAML [qian2019domain], but the empirical results are worse than those of target only and fine tuning. (1) Target only: models are trained only using dialogues in target domain. (2) Fine tuning: model are first trained in the source domains, and we conduct fine-tuning using dialogues in target domain.
5.2 Overall Results
Multi-Domain Joint Training
Table 2 shows that our proposed model consistently outperforms other models in the joint training setting. MALA improves dialogue task completion (measured by Entity-F1) while maintaining a high quality of language generation (measured by BLEU). For example, MALA-S3 (76.8) outperforms LaRL (71.3) by 7.71% under Entity-F1 on MultiWOZ, and has the highest BLEU score. Meanwhile, we also find that MALA benefits much from stage-I and stage-II in the joint learning setting. For example, MALA-S1 and MALA-S2 achieve 9.25% and 10.43% improvements over LIDM under Entity-F1 on SMD. This is largely because that, having complete dialogue state annotations, MALA can learn semantic latent actions in each domain at stage-I, and the action alignment at stage-II reduce action space for learning dialogue policy more effectively by finding shared actions across domains. We further find that LIDM and LaRL perform worse than Sequicity on SMD. The reason is that system utterances on SMD have shorter length and various expressions, making it challenging to capture underlying intentions merely based on surface realization. MALA overcomes this challenge by considering dialogue state transitions beyond surface realization in semantic latent action learning.
Cross-Domain Response Generation
The results on SMD and MultiWOZ are shown on Tables 3 and 4, respectively. We can see that MALA significantly outperforms the baselines on both datasets. For example, on MultiWOZ, MALA-S3 outperforms LaRL by 47.5% and 55.7% under Entity-F1 using train and hotel as target domain, respectively. We also find that each stage of MALA is essential in cross-domain generation scenarios. For example, on MultiWOZ using attraction as target domain, stage-III and stage-II brings 14.7% and 15.8% improvements compared with its former stage, and MALA-S1 outperforms fine-tuned LaRL by 27.0% under Entity-F1. We further find that the contribution of each stage may vary when using different domains as target, and we will conduct a detailed discussion in the following section. By comparing fine-tuning and target only results of LaRL, we can see latent actions based on lexical similarity cannot well generalize in the cross-domain setting. For example, fine-tuned LaRL only achieves less than 3% over target-only result under Entity-F1 on MultiWOZ using attraction as target domain.
We first study the effects of each stage in MALA in cross-domain dialogue generation. We compare MALA-(S1/S2/S3) with fine-tuned LaRL under different dialogue proportion in target domain. The results are shown in Fig. 2(a) and 2(b). We can see that the performance gain of MALA is largely attributed to stage-III when using restaurant as target domain, while attributed to stage-II using taxi as target. This is largely because there are many shared actions between taxi and train domains, many utterance-action pairs learned by action alignment at stage-II already capture the underlying intentions of system utterances. On the other hand, since restaurant does not have many shared actions across domains, MALA relies more on the similarity prediction network to provide supervision at stage-III.
Last, we study the effects of semantic latent action in both joint training and cross-domain generation settings. To investigate how pointwise measure and pairwise measure contribute to capturing utterance intentions, we compare the results of MALA without pointwise loss (MALAPT), and without pairwise loss (MALAPR) under varying size of dialogue state annotations. The results of multi-domain joint training under Entity-F1 on SMD are shown in Fig. 3(a). We can see that both pointwise and pairwise measure are both important. For example, when using 55% of state annotations, encoding pointwise and pairwise measure bring 5.9% and 8.0% improvements, respectively. For cross-domain generation results shown in Fig. 3(b), we can find that these two measures are essential to obtain semantic latent actions in the target domain.
We propose multi-stage adaptive latent action learning (MALA) for better conditioned response generation. We develop a novel dialogue state transition measurement for learning semantic latent actions. We demonstrate how to effectively generalize semantic latent actions to the domains having no state annotations. The experimental results confirm that MALA achieves better task completion and language quality compared with the state-of-the-art under both in-domain and cross-domain settings. For future work, we will explore the potential of semantic action learning for zero-state annotations application.
We would like to thank Xiaojie Wang for his help. This work is supported by Australian Research Council (ARC) Discovery Project DP180102050, China Scholarship Council (CSC) Grant #201808240008.