Adversarial Domain Adaptation for Variational Neural Language Generation in Dialogue Systems

08/08/2018 ∙ by Van-Khanh Tran, et al. ∙ JAIST Trường 0

Domain Adaptation arises when we aim at learning from source domain a model that can per- form acceptably well on a different target domain. It is especially crucial for Natural Language Generation (NLG) in Spoken Dialogue Systems when there are sufficient annotated data in the source domain, but there is a limited labeled data in the target domain. How to effectively utilize as much of existing abilities from source domains is a crucial issue in domain adaptation. In this paper, we propose an adversarial training procedure to train a Variational encoder-decoder based language generator via multiple adaptation steps. In this procedure, a model is first trained on a source domain data and then fine-tuned on a small set of target domain utterances under the guidance of two proposed critics. Experimental results show that the proposed method can effec- tively leverage the existing knowledge in the source domain to adapt to another related domain by using only a small amount of in-domain data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work is licensed under a Creative Commons Attribution 4.0 International License. License details: Traditionally, Spoken Dialogue Systems are typically developed for various specific domains, including: finding a hotel, searching a restaurant [Wen et al.2015a], or buying a tv, laptop [Wen et al.2015b], flight reservations [Levin et al.2000], etc. Such system are often requiring a well-defined ontology, which is essentially a data structured representation that the dialogue system can converse about. Statistical approaches to multi-domain in SDS system have shown promising results in how to reuse data in a domain-scalable framework efficiently [Young et al.2013]. mrkvsic2015multi addressed the question of multi-domain in the SDS belief tracking by training a general model and adapting it to each domain.

Recently, Recurrent Neural Networks (RNNs) based methods have shown improving results in tackling the domain adaptation issue

[Chen et al.2015, Shi et al.2015, Wen et al.2016a, Wen et al.2016b]. Such generators have also achieved promising results when providing such adequate annotated datasets [Wen et al.2015b, Wen et al.2015a, Tran et al.2017, Tran and Nguyen2017a, Tran and Nguyen2017b]

. More recently, the development of the variational autoencoder (VAE) framework

[Kingma and Welling2013, Rezende and Mohamed2015]

has paved the way for learning large-scale, directed latent variable models. This has brought considerable benefits to significant progress in natural language processing

[Bowman et al.2015, Miao et al.2016, Purushotham et al.2017, Mnih and Gregor2014], dialogue system [Wen et al.2017, Serban et al.2017].

This paper presents an adversarial training procedure to train a variational neural language generator via multiple adaptation steps, which enables the generator to learn more efficiently when in-domain data is in short supply. In summary, we make the following contributions: (1) We propose a variational approach for an NLG problem which benefits the generator to adapt faster to new, unseen domain irrespective of scarce target resources; (2) We propose two critics in an adversarial training procedure, which can guide the generator to generate outputs that resemble the sentences drawn from the target domain; (3) We propose a unifying variational domain adaptation architecture which performs acceptably well in a new, unseen domain by using a limited amount of in-domain data; (4) We investigate the effectiveness of the proposed method in different scenarios, including ablation, domain adaptation, scratch, and unsupervised training with various amount of data.

2 Related Work

Generally, Domain Adaptation involves two different types of datasets, one from a source domain and the other from a target domain. The source domain typically contains a sufficient amount of annotated data such that a model can be efficiently built, while there is often little or no labeled data in the target domain. Domain adaptation for NLG have been less studied despite its important role in developing multi-domain SDS. walker2001spot proposed a SPoT-based generator to address domain adaptation problems. Subsequently, a system focused on tailoring user preferences [Walker et al.2007], and controlling user perceptions of linguistic style [Mairesse and Walker2011]. Moreover, a phrase-based statistical generator [Mairesse et al.2010]

using graphical models and active learning, and a multi-domain procedure

[Wen et al.2016a] via data counterfeiting and discriminative training.

Neural variational framework for generative models of text have been studied longitudinally. chung2015recurrent proposed a recurrent latent variable model VRNN for sequential data by integrating latent random variables into hidden state of a RNN model. A hierarchical multi scale recurrent neural networks was proposed to learn both hierarchical and temporal representation

[Chung et al.2016]

. 2016arXiv160507869Z introduced a variational neural machine translation that incorporated a continuous latent variable to model underlying semantics of sentence pairs. bowman2015generating presented a variational autoencoder for unsupervised generative language model.

Adversarial adaptation methods have shown promising improvement in many machine learning applications despite the presence of domain shift or dataset bias, which reduce the difference between the training and test domain distributions, and thus improve generalization performance. tzeng2017adversarial proposed an improved unsupervised domain adaptation method to learn a discriminative mapping of target images to the source feature space by fooling a domain discriminator that tries to differentiate the encoded target images from source examples. We borrowed the idea of

[Ganin et al.2016], where a domain-adversarial neural network are proposed to learn features that are discriminative for the main learning task on the source domain, and indiscriminate with respect to the shift between domains.

3 Variational Domain-Adaptation Neural Language Generator

Drawing inspiration from Variational autoencoder [Kingma and Welling2013] with assumption that there exists a continuous latent variable from a underlying semantic space of Dialogue Act (DA) and utterance pairs , we explicitly model the space together with variable d to guide the generation process, i.e.,

. With this assumption, the original conditional probability evolves to reformulate as follows:


This latent variable enables us to model the underlying semantic space as a global signal for generation, in which the variational lower bound of variational generator can be formulated as follows:


where: is the prior model, is the posterior approximator, and is the decoder with the guidance from global signal ,

is the Kullback-Leibler divergence between Q and P.

Figure 1: The VDANLG architecture which consists of two main components: the VRALSTM to generate the sentence and two Critics with an adversarial training procedure to guide the model in domain adaptation.

3.1 Variational Neural Encoder

The variational neural encoder aims at encoding a given input sequence

into continuous vectors. In this work, we use a 1-layer, Bidirectional LSTM (BiLSTM) to encode the sequence embedding. The BiLSTM consists of forward and backward LSTMs, which read the sequence from left-to-right and right-to-left to produce both forward and backward sequence of hidden states (

), and (), respectively. We then obtain the sequence of encoded hidden states where: . We utilize this encoder to represent both the sequence of slot-value pairs in a given Dialogue Act, and the corresponding utterance (see the red parts in Figure 1). We finally operate the mean-pooling over the BiLSTM hidden vectors to obtain the representation: . The encoder, accordingly, produces both the DA representation vector which flows into the inferer and decoder, and the utterance representation which streams to the posterior approximator.

3.2 Variational Neural Inferer

In this section, we describe our approach to model both the prior and the posterior by utilizing neural networks.

Neural Posterior Approximator

Modeling the true posterior is usually intractable. Traditional approach fails to capture the true posterior distribution of due to its oversimplified assumption when using the mean-field approaches. Following the work of [Kingma and Welling2013], in this paper we employ neural network to approximate the posterior distribution of to simplify the posterior inference. We assume the approximation has the following form:


where: the mean

and standard variance

are the outputs of the neural network based on the representations of and . The function

is a non-linear transformation that project the both DA and utterance representations onto the latent space:


where: , are matrix and bias parameters respectively, is the dimensionality of the latent space,

is an elements-wise activation function which we set to be

in our experiments. In this latent space, we obtain the diagonal Gaussian distribution parameter


through linear regression:


where: , are both dimension vectors.

Neural Prior Model

We model the prior as follows:


where: and of the prior are neural models based on DA representation only, which are the same as those of the posterior in Eq. 4 and Eq. 5, except for the absence of . To acquire a representation of the latent variable , we utilize the same technique as proposed in VAE [Kingma and Welling2013] and re-parameterize it as follows:


In addition, we set to be the mean of the prior , i.e., , during decoding due to the absence of the utterance

. Intuitively, by parameterizing the hidden distribution this way, we can back-propagate the gradient to the parameters of the encoder and train the whole network with stochastic gradient descent. Note that the parameters for the prior and the posterior are independent of each other.

In order to integrate the latent variable into the decoder, we use a non-linear transformation to project it onto the output space for generation:


where: . It is important to notice that due to the sample noise , the representation of is not fixed for the same input DA and model parameters. This benefits the model to learn to quickly adapt to a new domain (see Table 1-(a) and Table 3, sec. 3).

3.3 Variational Neural Decoder

Given a DA d and the latent variable , the decoder calculates the probability over the generation y as a joint probability of ordered conditionals:


where: In this paper, we borrow the calculation and the computational RNN cell from [Tran and Nguyen2017a] where RNN(.)=RALSTM(.) with a slightly modification in order to integrate the representation of latent variable, i.e., , into the RALSTM cell, which is denoted by the bold dashed orange arrow in Figure 1-(iii). We modify the cell calculation as follows:


where: , , are input, forget and output gates respectively, is hidden layer size, is model parameter.

The resulting Variational RALSTM (VRALSTM) model is demonstrated in Figure 1

-(i), (ii), (iii), in which the latent variable can affect the hidden representation through the gates. This allows the model can indirectly take advantage of the underlying semantic information from the latent variable

. In addition, when the model learns to adapt to a new domain with unseen dialogue act, the semantic representation can help to guide the generation process (see sec. 6.3 for details).

3.4 Critics

In this section, we introduce a text-similarity critic and a domain critic to guarantee, as much as possible, that the generated sentences resemble the sentences drawn from the target domain.

Text similarity critic

To check the relevance between sentence pair in two domains and to encourage the model generating sentences in the style which is highly similar

to those in the target domain, we propose a Text Similarity Critic (SC) to classify

as 1-similar or 0-unsimilar text style. The model SC consists of two parts: a shared BiLSTM with the Variational Neural Encoder to represent the sentence, and a second BiLSTM to encode the sentence. The SC model takes input as a pair of ([target], source), ([target], generated), and ([generated], source). Note that we give priority to encoding the sentence in [.] using the shared BiLSTM, which guides the model to learn the sentence style from the target domain, and also contributes the target domain information into the global latent variables. We further utilize Siamese recurrent architectures [Neculoiu et al.2016] for learning sentence similarity, in which the architecture allows us to learn useful representations with limited supervision.

Domain critic

In consideration of the shift between domains, we introduce a Domain Critic (DC) to classify sentence as source, target, or generated domain, respectively. Drawing inspiration from work of [Ganin et al.2016], we model DC with a gradient reversal layer and two standard feed-forward layers. It is important to notice that our DC model shares parameters with the Variational Neural Encoder and the Variational Neural Inferer. The DC model takes input as a pair of given DA and corresponding utterance to produce a concatenation of both its representation and its latent variable in the output space, which is then passed through a feed-forward layer and a 3-labels classifier. In addition, the gradient reversal layer, which multiplies the gradient by a specific negative value during back-propagation training, ensures that the feature distributions over the two domains are made similar, as indistinguishable as possible for the domain critic, hence resulting in the domain-invariant features.

4 Training Domain Adaptation Model

Given a training instance represented by a pair of DA and sentence from the rich source domain and the limited target domain , the task aims at finding a set of parameters that can perform acceptably well on the target domain.

4.1 Training Critics

We provide as following the training objective of SC and DC. For SC, the goal is to classify a sentence pair into -similar or -unsimilar textual style. This procedure can be formulated as a supervised classification training objective function:


where: is number of sentences, is the model parameters of SC, denotes sentences generated from the current generator given target domain dialogue act . The scalar probability indicates how a generated sentence is relevant to a target sentence .

The DC critic aims at classifying a pair of DA-utterance into source, target, or generated domain. This can also be formulated as a supervised classification training objective as follows:


where: is the model parameters of DC, (), () are the DA-utterance pairs from source, target domain, respectively. Note also that the scalar probability indicates how likely the DA-utterance pair () is from the target domain.

4.2 Training Variational Generator

We utilize the Monte Carlo method to approximate the expectation over the posterior in Eq. 2, i.e., where: is the number of samples. In this study, the joint training objective for a training instance is formulated as follows:


where: , and . The first term is the KL divergence between two Gaussian distribution, and the second term is the approximation expectation. We simply set which degenerates the second term to the objective of conventional generator. Since the objective function in Eq. 13 is differentiable, we can jointly optimize the parameter and variational parameter using standard gradient ascent techniques.

4.3 Adversarial Training

Our domain adaptation architecture is demonstrated in Figure 1, in which both generator and critics , and jointly train by pursuing competing goals as follows. Given a dialogue act in the target domain, the generator generates sentences y’s. It would prefer a “good” generated sentence y if the values of and are large. In contrast, the critics would prefer large values of and , which imply the small values of and . We propose a domain-adversarial training procedure in order to iteratively updating the generator and critics as described in Algorithm 1. While the parameters of generator are optimized to minimize their loss in the training set, the parameters of the critics are optimized to minimize the error of text similarity, and to maximize the loss of domain classifier.

Require: generator , domain critic , text similarity critic , generated sentence ;
Input: DA-utterance pairs of source , target ;
1 Pretrain on using VRALSTM;
2 while  has not converged do
3       for i = 0, ..,  do
4             Sample from source domain;
5             ()-Compute using Eq. 12 for and ;
6             ()-Adam update of for using ;
7             ()-Compute using Eq. 13
8             ()-Adam update of for using
9             ()-Compute using Eq. 11 for ;
10             ()-Adam update of for using ;
11             , where ;
12             Choose top k best sentences of ;
13             for j = 1,..,k do
14                   (), () steps for with ;
15                   (), () steps for with and ;
16             end for
18       end for
20 end while
Algorithm 1 Adversarial Training Procedure

Generally, the current generator for each training iteration takes a target dialogue act as input to over-generate a set of candidate sentences (step 11). We then choose top k best sentences in the set (step 12) after re-ranking to measure how “good” the generated sentences are by using the critics (steps 14-15). These “good” signals from the critics can guide the generator step by step to generate the outputs which resemble the sentences drawn from the target domain. Note that the re-ranking step is important for separating the “correct” sentences from the current generated outputs by penalizing the generated sentences which have redundant or missing slots.

5 Experiments

We conducted experiments on the proposed models in different scenarios: Adaptation, Scratch, and All

using several model architectures, evaluation metrics, datasets

[Wen et al.2016a], and configurations (see Appendix A).

KL cost annealing strategy [Bowman et al.2015] encourages the model to encode meaningful representations into the latent vector , in which we gradually anneal the KL term from to . This helps our model to achieve solutions with non-zero KL term.

Gradient reversal layer [Ganin et al.2016] leaves the input unchanged during forward propagation and reverses the gradient by multiplying it with a negative scalar

during the backpropagation-based training. We set the domain adaptation parameter

which gradually increases, starting from to , by using the following schedule for each training step : , and where: is a constant which is set to be , is training progress. This strategy allows the Domain critic to be less sensitive to noisy signal at the early training stages.

6 Results and Analysis

6.1 Integrating Variational Inference

We compare the original model RALSTM with its modification by integrating Variational Inference (VRALSTM) as demonstrated in Table 2 and Table 1-(a). It clearly shows that the VRALSTM not only preserves the power of the original RALSTM on generation task since its performances are very competitive to those of RALSTM, but also provides a compelling evidence on adapting to a new, unseen domain when the target domain data is scarce, i.e., from % to %. Table 3, sec. 3 further shows the necessity of the integrating in which the VRALSTM achieved a significant improvement over the RALSTM in Scratch scenario, and of the adversarial domain adaptation algorithm in which although both the RALSTM and VRALSTM model can perform well when providing sufficient in-domain training data (Table 2), the performances are extremely impaired when training from Scratch with only a limited data. These indicate that the proposed variational method can learn the underlying semantic of DA-utterance pairs in the source domain via the representation of the latent variable , from which when adapting to another domain, the models can leverage the existing knowledge to guide the generation process.

SourceTarget(Test) R2H(Hotel) H2R(Restaurant) L2T(Tv) T2L(Laptop)
Hotel - - 0.5931 12.50% 0.4183 2.38% 0.3426 13.02%
Restaurant 0.6224 1.99% - - 0.4211 2.74% 0.3540 13.13%
Tv 0.6153 4.30% 0.5835 14.49% - - 0.3630 7.44%
Laptop 0.6042 5.22% 0.5598 15.61% 0.4268 1.05% - -
(a) Result on Laptop when adapting models trained on [Restaurant+Hotel] data. (b) Results evaluated on (Test) domains by Unsupervised adapting VDANLG from Source domains using only 10% of the Target domain Counterfeit X2Y. {X,Y}=R : Restaurant, H : Hotel, T : Tv, L : Laptop.
Table 1: Results when adapting models trained on (a) union, and (b) counterfeting dataset.
ModelTarget Hotel Restaurant Tv Laptop
HLSTM [Wen et al.2015a] 0.8488 2.79% 0.7436 0.85% 0.5240 2.65% 0.5130 1.15%
SCLSTM [Wen et al.2015b] 0.8469 3.12% 0.7543 0.57% 0.5235 2.41% 0.5109 0.89%
Enc-Dec [Wen et al.2016b] 0.8537 4.78% 0.7358 2.98% 0.5142 3.38% 0.5101 4.24%
RALSTM [Tran and Nguyen2017a] 0.8911 0.48% 0.7739 0.19% 0.5376 0.65% 0.5222 0.49%
VRALSTM (Ours) 0.8851 0.57% 0.7709 0.36% 0.5356 0.73% 0.5210 0.59%
Table 2: Results evaluated on Target domains by training models from scratch with All in-domain data.

max width=0.85 SourceTarget Hotel Restaurant Tv Laptop BLEU ERR BLEU ERR BLEU ERR BLEU ERR no Critics Hotel - - 0.6814 11.62% 0.4968 12.19% 0.4915 3.26% Restaurant 0.7983 8.59% - - 0.4805 13.70% 0.4829 9.58% Tv 0.7925 12.76% 0.6840 8.16% - - 0.4997 4.79% Laptop 0.7870 15.17% 0.6859 7.55% 0.4953 18.60% - - [R+H] - - - - 0.5019 7.43% 0.4977 5.96% [L+T] 0.7935 11.71% 0.6927 6.49% - - - - + DC + SC Hotel - - 0.7131 2.53% 0.5164 3.25% 0.5007 1.68% Restaurant 0.8217 3.95% - - 0.5043 2.99% 0.4931 2.77% Tv 0.8251 4.89% 0.6971 4.62% - - 0.5009 2.10% Laptop 0.8218 2.89% 0.6926 2.87% 0.5243 1.52% - - [R+H] - - - - 0.5197 2.58% 0.5009 1.61% [L+T] 0.8252 2.87% 0.7066 3.73% - - - - scr10 RALSTM 0.6855 22.53% 0.6003 17.65% 0.4009 22.37% 0.4475 24.47% VRALSTM 0.7378 15.43% 0.6417 15.69% 0.4392 17.45% 0.4851 10.06% + DC only Hotel - - 0.6823 4.97% 0.4322 27.65% 0.4389 26.31% Restaurant 0.8031 6.71% - - 0.4169 34.74% 0.4245 26.71% Tv 0.7494 14.62% 0.6430 14.89% - - 0.5001 15.40% Laptop 0.7418 19.38% 0.6763 9.15% 0.5114 10.07% - - [R+H] - - - - 0.4257 31.02% 0.4331 31.26% [L+T] 0.7658 8.96% 0.6831 11.45% - - - - + SC only Hotel - - 0.6976 5.00% 0.4896 9.50% 0.4919 9.20% Restaurant 0.7960 4.24% - - 0.4874 12.26% 0.4958 5.61% Tv 0.7779 10.75% 0.7134 5.59% - - 0.4913 13.07% Laptop 0.7882 8.08% 0.6903 11.56% 0.4963 7.71% - - [R+H] - - - - 0.4950 8.96% 0.5002 5.56% [L+T] 0.7588 9.53% 0.6940 10.52% - - - - sec. 3: Training RALSTM and VRALSTM models from scratch using of Target domain data;

Table 3: Ablation studies’ results evaluated on Target domains by adaptation training proposed models from Source domains using only 10% amount of the Target domain data (sec. 1, 2, 4, 5). The results were averaged over 5 randomly initialized networks.

6.2 Ablation Studies

The ablation studies (Table 3, sec. 1, 2) demonstrate the contribution of two Critics, in which the models were assessed with either no Critics or both or only one. It clearly sees that combining both Critics makes a substantial contribution to increasing the BLEU score and decreasing the slot error rate by a large margin in every dataset pairs. A comparison of model adapting from source Laptop domain between VRALSTM without Critics (Laptop) and VDANLG (Laptop) evaluated on the target Hotel domain shows that the VDANLG not only has better performance with much higher the BLEU score, in comparison to , but also significantly reduce the ERR, from % down to %. The trend is consistent across all the other domain pairs. These stipulate the necessary Critics in effective learning to adapt to a new domain.

Table 3, sec. 4 further demonstrates that using DC only (sec. 4) brings a benefit of effectively utilizing similar slot-value pairs seen in the training data to closer domain pairs such as: HotelRestaurant ( BLEU, ERR), RestaurantHotel ( BLEU, ERR), LaptopTv ( BLEU, ERR), and TvLaptop ( BLEU, ERR) pairs. Whereas it is inefficient for the longer domain pairs since their performances are worse than those without Critics, or in some cases even worse than the VRALSTM in scr10 scenario, such as RestaurantTv ( BLEU, ERR), and the cases where Laptop to be a Target domain. On the other hand, using only SC (sec. 5) helps the models achieve better results since it is aware of the sentence style when adapting to the target domain.

6.3 Distance of Dataset Pairs

To better understand the effectiveness of the methods, we analyze the learning behavior of the proposed model between different dataset pairs. The datasets’ order of difficulty was, from easiest to hardest: HotelRestaurantTvLaptop. On the one hand, it might be said that the longer datasets’ distance is, the more difficult of domain adaptation task becomes. This clearly shows in Table 3, sec. 1, at Hotel column where the adaptation ability gets worse regarding decreasing the BLEU score and increasing the ERR score alongside the order of RestaurantTvLaptop datasets. On the other hand, the closer the dataset pair is, the faster model can adapt. It can be expected that the model can better adapt to the target Tv/Laptop domain from source Laptop/Tv than those from source Restaurant, Hotel, and vice versa, the model can easier adapt to the target Restaurant/Hotel domain from source Hotel/Restaurant than those from Laptop, Tv. However, the above-mentioned is not always true that the proposed method can perform acceptably well from easy source domains (Hotel, Restaurant) to the more difficult target domains (Tv, Laptop) and vice versa (Table 3, sec. 1, 2).

Table 3, sec. 2 further shows that the proposed method is able to leverage the out of domain knowledge since the adaptation models trained on union source dataset, such as [R+H] or [L+T], show better performances than those trained on individual source domain data. A specific example in Table 3, sec. 2 shows that the adaptation VDANLG model trained on the source union dataset of Laptop and Tv ([L+T]) has better performance, at BLEU and ERR, than those models trained on the individual source dataset, such as Laptop ( BLEU, ERR), and Tv ( BLEU, ERR). Another example in Table 3, sec. 2 also shows that the adaptation VDANLG model trained on the source union dataset of Restaurant and Hotel ([R+H]) has better results, at BLEU and ERR, than those models trained on the separate source dataset, such as Restaurant ( BLEU, ERR), and Hotel ( BLEU, ERR). The trend is mostly consistent across all other domain comparisons in different training scenarios. All these demonstrate that the proposed model can learn global semantics that can be efficiently transferred into new domains.

6.4 Adaptation vs. All Training Scenario

It is interesting to compare Adaptation (Table 3, sec. 2) with All training scenario (Table 2). The VDANLG model shows its considerable ability to shift to another domain with a limited of in-domain labels whose results are competitive to or in some cases better than the previous models trained on full labels of the Target domain. A specific comparison evaluated on the Tv domain where the VDANLG model trained on the source Laptop achieved better performance, at BLEU and ERR, than HLSTM (, ), SCLSTM (, ), and Enc-Dec (, ). The VDANLG models, in many cases, also have lower of slot error rate ERR scores than the Enc-Dec model. These indicate the stable strength of the VDANLG models in adapting to a new domain when the target domain data is scarce.

Model Generated Responses from Laptop Domain
DA 1 compare(name=‘tecra erebus 20’; memory=‘4 gb’; isforbusinesscomputing=‘true’; name=‘satellite heracles 45’; memory=‘2 gb’; isforbusinesscomputing=‘false’)
Reference 1 compared to tecra erebus 20 which has a 4 gb memory and is for business computing , satellite heracles 45 has a 2 gb memory and is not for business computing . which one do you prefer
VRALSTM which would be the tecra erebus 20 is a business computing laptop with 4 gb of memory and is the SLOT_NAME , and is not for business computing . [satellite heracles 45][2 gb]
Hotel the tecra erebus 20 is used for business computing . the satellite heracles 45 has 4 gb of memory and a SLOT_BATTERY battery life for business computing . which one do you want
Restaurant the tecra erebus 20 is for business computing . the satellite heracles 45 which has 4 gb of memory and is not for business computing . which one do you want [2 gb]
Tv the tecra erebus 20 has 4 gb of memory and is not for business computing . which one do you prefer [is for business computing][satellite heracles 45][2 gb]
[R+H] the tecra erebus 20 is not for business computing . which one do you want a business computing . which one do you prefer [4 gb][is for business computing][satellite heracles 45][2 gb]
Hotel the tecra erebus 20 has a 4 gb memory , that is for business computing . the satellite heracles 45 with 2 gb of memory and is not for business computing . which one do you want [OK]
Restaurant the tecra erebus 20 has a 4 gb memory , and is for business computing . the satellite heracles 45 is not for business computing . which one do you want to know more [2 gb]
Tv the tecra erebus 20 is a business computing . the satellite heracles 45 has a 4 gb memory and is not for business computing . which one do you prefer [2 gb]
[R+H] the tecra erebus 20 is for business computing , has a 2 gb of memory. the satellite heracles 45 has 4 gb of memory , is not for business computing. which one do you want
Table 4: Comparison of top Laptop responses generated for different scenarios by adaptation training VRALSTM (denoted by ) and VDANLG (denoted by ) models from Source domains, and by training VRALSTM from Scratch. Errors are marked in colors ([missing], misplaced, redundant, wrong, spelling mistake information). [OK] denotes successful generation. VDANLG = VRALSTM+SC+DC.

6.5 Unsupervised Domain Adaptation

We further examine the effectiveness of the proposed methods by training the VDANLG models on target Counterfeit datasets [Wen et al.2016a]. The promising results are shown in Table 1-(b), despite the fact that the models were instead adaptation trained on the Counterfeit datasets, or in other words, were indirectly trained on the (Test) domains. However, the proposed models still showed positive signs in remarkably reducing the slot error rate ERR in the cases of Hotel and Tv be the (Test) domains. Surprisingly, even the source domains (Hotel/Restaurant) are far from the (Test) domain Tv, and the Target domain Counterfeit L2T is also very different to the source domains, the model can still acceptably adapt well since its BLEU scores on (Test) Tv domain reached to (/), and it also produced a very low of ERR scores (/). This phenomenon will be further investigated in the unsupervised scenario in the future work.

6.6 Comparison on Generated Outputs

On the one hand, the VRALSTM models (trained from Scratch or trained adapting model from Source domains) produce the outputs with a diverse range of error types, including missing, misplaced, redundant, wrong slots, or even spelling mistake information, leading to a very high of the slot error rate ERR score. Specifically, the VRALSTM from Scratch tends to make repeated slots and also many of the missing slots in the generated outputs since the training data may inadequate for the model to generally handle the unseen dialog acts. Whereas the VRALSTM models without Critics adapting trained from Source domains (denoted by in Table 4 and Appendix B. Table 5) tend to generate the outputs with fewer error types than the model from Scratch because the VRALSTM models may capture the overlap slots of both source and target domain during adaptation training.

On the other hand, under the guidance of the Critics (SC and DC) in an adversarial training procedure, the VDANLG model (denoted by ) can effectively leverage the existing knowledge of the source domains to better adapt to the target domains. The VDANLG models can generate the outputs in style of the target domain with much fewer the error types compared with the two above models. Moreover, the VDANLG models seem to produce satisfactory utterances with more correct generated slots. For example, a sample outputted by the [R+H] in Table 4-example 1 contains all the required slots with only a misplaced information of two slots 2 gb and 4 gb, while the generated output produced by Hotel is a successful generation. Another samples in Appendix B. Table 5 generated by the Hotel, Tv, [R+H] (in DA 2) and Laptop (DA 3) models are all fulfilled responses. An analysis of the generated responses in Table 5-example 2 illustrates that the VDANLG models seem to generate a concise response since the models show a tendency to form some potential slots into a concise phrase, i.e., “SLOT_NAME SLOT_TYPE”. For example, the VDANLG models tend to concisely response as “the portege phosphorus 43 laptop …” instead of “the portege phosphorus 43 is a laptop …”. All these above demonstrate that the VDANLG models have ability to produce better results with a much lower of the slot error rate ERR score.

7 Conclusion and Future Work

We have presented an integrating of a variational generator and two Critics in an adversarial training algorithm to examine the model ability in domain adaptation task. Experiments show that the proposed models can perform acceptably well in a new, unseen domain by using a limited amount of in-domain data. The ablation studies also demonstrate that the variational generator contributes to effectively learn the underlying semantic of DA-utterance pairs, while the Critics show its important role of guiding the model to adapt to a new domain. The proposed models further show a positive sign in unsupervised domain adaptation, which would be a worthwhile study in the future.


This work was supported by the JST CREST Grant Number JPMJCR1513, the JSPS KAKENHI Grant number 15K16048 and the SIS project.