Variational Cross-domain Natural Language Generation for Spoken Dialogue Systems

12/20/2018 ∙ by Bo-Hsiang Tseng, et al. ∙ University of Cambridge 12

Cross-domain natural language generation (NLG) is still a difficult task within spoken dialogue modelling. Given a semantic representation provided by the dialogue manager, the language generator should generate sentences that convey desired information. Traditional template-based generators can produce sentences with all necessary information, but these sentences are not sufficiently diverse. With RNN-based models, the diversity of the generated sentences can be high, however, in the process some information is lost. In this work, we improve an RNN-based generator by considering latent information at the sentence level during generation using the conditional variational autoencoder architecture. We demonstrate that our model outperforms the original RNN-based generator, while yielding highly diverse sentences. In addition, our model performs better when the training data is limited.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conventional spoken dialogue systems (SDS) require a substantial amount of hand-crafted rules to achieve good interaction with users. The large amount of required engineering limits the scalability of these systems to settings with new or multiple domains. Recently, statistical approaches have been studied that allow natural, efficient and more diverse interaction with users without depending on pre-defined rules (Young et al., 2013; Gašić et al., 2014; Henderson et al., 2014).

Natural language generation (NLG) is an essential component of an SDS. Given a semantic representation (SR) consisting of a dialogue act and a set of slot-value pairs, the generator should produce natural language containing the desired information.

Traditionally NLG was based on templates (Cheyer and Guzzoni, 2014), which produce grammatically-correct sentences that contain all desired information. However, the lack of variation of these sentences made these systems seem tedious and monotonic. Trainable generators (Langkilde and Knight, 1998; Stent et al., 2004) can generate several sentences for the same SR, but the dependence on pre-defined operations limits their potential. Corpus-based approaches (Oh and Rudnicky, 2000; Mairesse and Walker, 2011) learn to generate natural language directly from data without pre-defined rules. However, they usually require alignment between the sentence and the SR. Recently, Wen et al. Wen et al. (2015b) proposed an RNN-based approach, which outperformed previous methods on several metrics. However, the generated sentences often did not include all desired attributes.

The variational autoencoder (Kingma and Welling, 2013)

enabled for the first time the generation of complicated, high-dimensional data such as images. The conditional variational autoencoder (CVAE) 

(Sohn et al., 2015), firstly proposed for image generation, has a similar structure to the VAE with an additional dependency on a condition. Recently, the CVAE has been applied to dialogue systems (Serban et al., 2017; Shen et al., 2017; Zhao et al., 2017) using the previous dialogue turns as the condition. However, their output was not required to contain specific information.

In this paper, we improve RNN-based generators by adapting the CVAE to the difficult task of cross-domain NLG. Due to the additional latent information encoded by the CVAE, our model outperformed the SCLSTM at conveying all information. Furthermore, our model reaches better results when the training data is limited.

2 Model Description

2.1 Variational Autoencoder

The VAE is a generative latent variable model. It uses a neural network (NN) to generate

from a latent variable , which is sampled from the prior . The VAE is trained such that is a sample of the distribution from which the training data was collected. Generative latent variable models have the form . In a VAE an NN, called the decoder, models and would ideally be trained to maximize the expectation of the above integral . Since this is intractable, the VAE uses another NN, called the encoder, to model which should approximate the posterior . The NNs in the VAE are trained to maximise the variational lower bound (VLB) to , which is given by:


The first term is the KL-divergence between the approximated posterior and the prior, which encourages similarity between the two distributions. The second term is the likelihood of the data given samples from the approximated posterior. The CVAE has a similar structure, but the prior is modelled by another NN, called the prior network. The prior network is conditioned on . The new objective function can now be written as:


When generating data, the encoder is not used and is sampled from .

2.2 Semantically Conditioned VAE

Figure 1: Semantically Conditioned Variational Autoencoder with a semantic representation (SR) as the condition. is the system response with words . , and are labels for the domain, the dialogue act (DA) and the slots of .

The structure of our model is depicted in Fig. 1, which, conditioned on an SR, generates the system’s word-level response . An SR consists of three components: the domain, a dialogue act and a set of slot-value pairs. Slots are attributes required to appear in (e.g. a hotel’s area). A slot can have a value. Then the two are called a slot-value pair (e.g. area=north). is delexicalised, which means that slot values are replaced by corresponding slot tokens. The condition

of our model is the SR represented as two 1-hot vectors for the domain and the dialogue act as well as a binary vector for the slots.

During training, is first passed through a single layer bi-directional LSTM, the output of which is concatenated with

and passed to the recognition network. The recognition network parametrises a Gaussian distribution

which is the posterior.The prior network only has as its input and parametrises a Gaussian distribution which is the prior. Both networks are fully-connected (FC) NNs with one and two layers respectively. During training, is sampled from the posterior. When the model is used for generation, is sampled from the prior. The decoder is an SCLSTM (Wen et al., 2015b) using as its initial hidden state and initial cell vector. The first input to the SCLSTM is a start-of-sentence (sos) token and the model generates words until it outputs an end-of-sentence (eos) token.

2.3 Optimization

When the decoder in the CVAE is powerful on its own, it tends to ignore the latent variable since the encoder fails to encode enough information into . Regularization methods can be introduced in order to push the encoder towards learning a good representation of the latent variable . Since the KL-component of the VLB does not contribute towards learning a meaningful , increasing the weight of it gradually from to during training helps to encode a better representation in . This method is termed KL-annealing (Bowman et al., 2016). In addition, inspired by (Zhao et al., 2017), we introduce a regularization method using another NN which is trained to use to recover the condition . The NN is split into three separate FC NNs of one layer each, which independently recover the domain, dialogue-act and slots components of . The objective of our model can be written as:


where is the domain label, is the dialogue act label and are the slot labels with slots in the SR. In the proposed model, the CVAE learns to encode information about both the sentence and the SR into . Using as its initial state, the decoder is better at generating sentences with desired attributes. In section 4.1 a visualization of the latent space demonstrates that a semantically meaningful representation for was learned.

Restaurant Hotel Television Laptop
# of examples 3114/1039/1039 3223/1075/1075 4221/1407/1407 7944/2649/2649
dialogue acts
reqmore, goodbye, select, confirm, request,
inform, inform_only, inform_count, inform_no_match
compare, recommend, inform_all,
suggest, inform_no_info, 9 acts as left
shared slots
name, type, area, near, price,
phone, address, postcode, pricerange
name, type, price,
family, pricerange,
specific slots
screensizerange, ecorating,
hdmiport, hasusbport, audio,
accessories, color, screensize,
resolution, powerconsumption
warranty, battery, design,
batteryrating, weightrange,
utility, platform, driverange,
dimension, memory, processor
Table 1: The statistics of the cross-domain dataset

3 Dataset and Setup

The proposed model is used for an SDS that provides information about restaurants, hotels, televisions and laptops. It is trained on a dataset Wen et al. (2016), which consists of sentences with corresponding semantic representations. Table 1 shows statistics about the corpus which was split into a training, validation and testing set according to a 3:1:1 split. The dataset contains 14 different system dialogue acts. The television and laptop domains are much more complex than other domains. There are around 7k and 13k different SRs possible for the TV and the laptop domain respectively. For the restaurant and hotel domains only 248 and 164 unique SRs are possible. This imbalance makes the NLG task more difficult.

The generators were implemented using the PyTorch Library 

(Paszke et al., 2017). The size of decoder SCLSTM and thus of the latent variable was set to 128. KL-annealing was used, with the weight of the KL-loss reaching after 5k mini-batch updates. The slot error rate (ERR), used in (Oh and Rudnicky, 2000; Wen et al., 2015a), is the metric that measures the model’s ability to convey the desired information. ERR is defined as: , where is the number of slots in the SR, and are the number of missing and redundant slots in the generated sentence. The BLEU-4 metric and perplexity (PPL) are also reported. The baseline SCLSTM is optimized, which has shown to outperform template-based methods and trainable generators Wen et al. (2015b). NLG often uses the over-generation and reranking paradigm Oh and Rudnicky (2000). The SCVAE can generate multiple sentences by sampling multiple , while the SCLSTM has to sample different words from the output distribution.In our experiments ten sentences are generated per SR. Table 4 in the appendix shows one SR in each domain with five illustrative sentences generated by our model.

4 Experimental Results

4.1 Visualization of Latent Variable

Figure 2: 2D-projection of for each data point in the test set, with two different colouring-schemes.

2D-projections of for each data point in the test set are shown in Fig. 2, by using PCA for dimensionality reduction. In Fig. 2a, data points of the restaurant, hotel, TV and laptop domain are marked as blue, green, red and yellow respectively. As can be seen, data points from the laptop domain are contained within four distinct clusters. In addition, there is a large overlap of the TV and laptop domains, which is not surprising as they share all dialogue acts (DAs). Similarly, there is overlap of the restaurant and hotel domains. In Fig. 2b, the eight most frequent DAs are color-coded. recommend, depicted as green, has a similar distribution to the laptop domain in Fig. 2a, since recommend happens mostly in the laptop domain. This suggests that our model learns to map similar SRs into close regions within the latent space. Therefore, contains meaningful information in regards to the domain, DAs and slots.

4.2 Empirical Comparison

Metrics Method Restaurant Hotel TV Laptop Overall
ERR(%) SCLSTM 2.978 1.666 4.076 2.599 2.964
SCVAE 2.823 1.528 2.819 1.841 2.148
BLEU SCLSTM 0.529 0.642 0.475 0.439 0.476
SCVAE 0.540 0.652 0.478 0.442 0.478
PPL SCLSTM 2.654 3.229 3.365 3.941 3.556
SCVAE 2.649 3.159 3.337 3.919 3.528
Table 2: Comparison between SCVAE and SCLSTM. Both are trained with full dataset and tested on individual domains

4.2.1 Cross-domain Training

Table 2 shows the comparison between SCVAE and SCLSTM. Both are trained on the full cross-domain dataset, and tested on the four domains individually. The SCVAE outperforms the SCLSTM on all metrics. For the highly complex TV and laptop domains, the SCVAE leads to dramatic improvements in ERR. This shows that the additional sentence level conditioning through helps to convey all desired attributes.

4.2.2 Limited Training Data

Fig. 3 shows BLEU and ERR results when the SCVAE and SCLSTM are trained on varying amounts of data. The SCVAE has a lower ERR than the SCLSTM across the varying amounts of training data. For very slow amounts of data the SCVAE outperforms the SCLSTM even more. In addition, our model consistently achieves better results on the BLEU metric.

Figure 3: Comparison between SCVAE and SCLSTM with limited training data.

4.2.3 K-Shot Learning

For the K-shot learning experiments, we trained the model using all training examples from three domains and only 300 examples from the target domain111600 examples were used for laptop as target domain.. The target domain is the domain we test on. As seen from Table 3, the SCVAE outperforms the SCLSTM in all domains except hotel. This might be because the hotel domain is the simplest and the model does not need to rely on the knowledge from other domains. The SCVAE strongly outperforms the SCLSTM for the complex TV and laptop domains where the number of distinct SRs is large. This suggests that the SCVAE is better at transferring knowledge between domains.

Metrics Method Restaurant Hotel TV Laptop
ERR(%) SCLSTM 13.039 5.366 24.497 27.587
SCVAE 10.329 6.182 20.590 20.864
BLEU SCLSTM 0.462 0.578 0.382 0.379
SCVAE 0.458 0.579 0.397 0.393
PPL SCLSTM 3.649 4.861 5.171 6.469
SCVAE 3.575 4.800 5.092 6.364
Table 3: Comparison between SCVAE and SCLSTM in K-shot learning

5 Conclusion

In this paper, we propose a semantically conditioned variational autoencoder (SCVAE) for natural language generation. The SCVAE encodes information about both the semantic representation and the sentence into a latent variable . Due to a newly proposed regularization method, the latent variable contains semantically meaningful information. Therefore, conditioning on leads to a strong improvement in generating sentences with all desired attributes. In an extensive comparison the SCVAE outperforms the SCLSTM on a range of metrics when training on different sizes of data and for K-short learning. Especially, when testing the ability to convey all desired information within complex domains, the SCVAE shows significantly better results.


Bo-Hsiang Tseng is supported by Cambridge Trust and the Ministry of Education, Taiwan. This research was partly funded by the EPSRC grant EP/M018946/1 Open Domain Statistical Spoken Dialogue Systems. Florian Kreyssig is supported by the Studienstiftung des Deutschen Volkes. Paweł Budzianowski is supported by the EPSRC and Toshiba Research Europe Ltd.


  • Bowman et al. (2016) Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating Sentences from a Continuous Space.
  • Cheyer and Guzzoni (2014) Adam Cheyer and Didier Guzzoni. 2014. Method and apparatus for building an intelligent automated assistant. US Patent 8,677,377.
  • Gašić et al. (2014) M Gašić, Dongho Kim, Pirros Tsiakoulis, Catherine Breslin, Matthew Henderson, Martin Szummer, Blaise Thomson, and Steve Young. 2014. Incremental on-line adaptation of pomdp-based dialogue managers to extended domains. In Fifteenth Annual Conference of the International Speech Communication Association.
  • Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Steve Young. 2014.

    Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised adaptation.

    In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages 360–365. IEEE.
  • Kingma and Welling (2013) Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. CoRR, abs/1312.6114.
  • Langkilde and Knight (1998) Irene Langkilde and Kevin Knight. 1998. Generation that exploits corpus-based statistical knowledge. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1, pages 704–710. Association for Computational Linguistics.
  • Mairesse and Walker (2011) François Mairesse and Marilyn A Walker. 2011. Controlling user perceptions of linguistic style: Trainable generation of personality traits. Computational Linguistics, 37(3):455–488.
  • Oh and Rudnicky (2000) Alice H Oh and Alexander I Rudnicky. 2000. Stochastic language generation for spoken dialogue systems. In Proceedings of the 2000 ANLP/NAACL Workshop on Conversational systems-Volume 3, pages 27–32. Association for Computational Linguistics.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W.
  • Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pages 3295–3301.
  • Shen et al. (2017) Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, and Guoping Long. 2017. A conditional variational framework for dialog generation. In ACL.
  • Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491.
  • Stent et al. (2004) Amanda Stent, Rashmi Prasad, and Marilyn Walker. 2004. Trainable sentence planning for complex information presentation in spoken dialog systems. In Proceedings of the 42nd annual meeting on association for computational linguistics, page 79. Association for Computational Linguistics.
  • Wen et al. (2015a) Tsung-Hsien Wen, Milica Gašić, Dongho Kim, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015a. Stochastic Language Generation in Dialogue using Recurrent Neural Networks with Convolutional Sentence Reranking. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). Association for Computational Linguistics.
  • Wen et al. (2016) Tsung-Hsien Wen, Milica Gašić, Nikola Mrkšić, Lina M. Rojas-Barahona, Pei-Hao Su, David Vandyke, and Steve Young. 2016. Multi-domain neural network language generation for spoken dialogue systems. In Proceedings of the 2016 Conference on North American Chapter of the Association for Computational Linguistics (NAACL). Association for Computational Linguistics.
  • Wen et al. (2015b) Tsung-Hsien Wen, Milica Gašić, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015b. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    . Association for Computational Linguistics.
  • Young et al. (2013) Steve Young, Milica Gašić, Blaise Thomson, and Jason D Williams. 2013. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179.
  • Zhao et al. (2017) Tiancheng Zhao, Ran Zhao, and Maxine Eskénazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 654–664.