Conditioned Query Generation for Task-Oriented Dialogue Systems

11/09/2019 ∙ by Stéphane d'Ascoli, et al. ∙ 7

Scarcity of training data for task-oriented dialogue systems is a well known problem that is usually tackled with costly and time-consuming manual data annotation. An alternative solution is to rely on automatic text generation which, although less accurate than human supervision, has the advantage of being cheap and fast. In this paper we propose a novel controlled data generation method that could be used as a training augmentation framework for closed-domain dialogue. Our contribution is twofold. First we show how to optimally train and control the generation of intent-specific sentences using a conditional variational autoencoder. Then we introduce a novel protocol called query transfer that allows to leverage a broad, unlabelled dataset to extract relevant information. Comparison with two different baselines shows that our method, in the appropriate regime, consistently improves the diversity of the generated queries without compromising their quality.



There are no comments yet.


page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Closed-domain dialogue systems, single- or multi-turn, have become ubiquitous nowadays with the rise of conversational interfaces. These systems aim at extracting relevant information from a user’s spoken query, produce the appropriate response/action and, when applicable, start a new dialogue turn. The typical spoken language understanding (SLU) framework relies on a speech-recognition engine that transforms the spoken utterance into text followed by a natural language understanding engine that extracts meaning from the text utterance. Here we consider essentially single-turn closed-domain dialogue systems where the meaning is well summarized by an intent and its corresponding slots. As an example, the query “Play Skinny Love by Bon Iver” should be interpreted as a PlayTrack intent with slots TrackTitle “Skinny Love” and Artist “Bon Iver”.

Training data for conversational systems consist in annotated utterances corresponding to the various intents within the scope of the system. When developing a new interaction scheme with new intents, a (possibly large) representative set of manually annotated utterances needs to be produced, which is a costly and time-consuming process. It is therefore desirable to automate it as much as possible to reduce cost and development time. We aim at alleviating the training data scarcity problem through automatic generation of utterances conditioned to the desired intent.

In this work, we focus on the conditioned generation problem in itself in the context of conversational systems and detail the assessment of its performance in terms of quality and diversity of generated sentences. We choose to consider the low data regime as we feel it is prevalent for this task. The application of the proposed approach as an augmentation technique for SLU tasks will be the subject of a future paper and is beyond the scope of this work. We consider the scenario in which only a small set of annotated queries is available for all the in-domain intents, while leveraging a very large reservoir of unannotated queries that belong to a broad spectrum of intents ranging from close to far domain. This situation is indeed very typical of conversational platforms like DialogFlow, IBM Watson, or Snips, which offer a high degree of user customization.

1.1 Contribution and Outline.

In this paper we propose a method for conditional text generation with Conditional Variational Autoencoders (CVAE) [1], that leverages transfer from a large out-of-domain unlabelled dataset. The model hyper-parameters are tuned to favor transfer of valuable knowledge from the reservoir while maintaining an accurate conditioning. We use the trained CVAE decoder to generate new queries for each intent. We call this mechanism query transfer. We analyse the performance of this approach on the publicly-available Snips dataset [2]

through both quality and diversity metrics. We also observe an improvement in the perplexity of a language model trained on data augmented with our generation scheme. This preliminary result is encouraging for future application to SLU data augmentation. We briefly show in the Appendix that the same approach can be applied to computer vision as well.

The paper is structured as follows: in Section 1.2, we briefly present the related literature, in Section 2 we introduce our approach in details, and in Section 3

we describe the experimental settings and the evaluation metrics. In Section

4 we show our results on the quality of generation compared to two different baselines and on language modelling perplexity, before concluding in Section 5.

Figure 1: Architecture of the model. (Left panel) The variational autoencoder architecture with the various losses defined in Eq. 1. The code is obtained from through the so-called reparametrization trick

while the categorical variable is sampled using the

Gumbel trick

on the continuous vector

. (Right panel) An illustration of the categorical latent vector and its role in filtering relevant sentences.

1.2 Related work

While there is a vast literature on text generation, conditional generation, data augmentation and transfer learning, there are only few existing works that combine these elements. In 

[3] and [4] the authors use variational autoencoders to generate utterances through paraphrasing with the objective of augmenting the SLU training set and improve slot-filling. There is no conditioning on the intent and the data used to train the paraphrasing model is annotated and in-domain.

In [5] the authors use a CVAE to generate queries conditioned to the presence of certain slots and observe improvements in slot-filling performance when augmenting the training set with generated data. In [6] they instead propose an autoencoder that is capable of jointly generating a query together with its annotation (intent and slots) and show improvements in both intent classification and slot-filling through data augmentation. In neither of the above, the model conditions the generation on the intent label nor leverages unannotated data for the training.

In a recent paper [7], the authors use semi-supervised self-learning to iteratively incorporate data coming from an unannotated set into the annotated training set. Their chosen metrics are both SLU performance and query diversity. This method represents a valid alternative to our generative data augmentation protocol and will be the object of competitive benchmarks in future work, where the impact of training data augmentation on SLU performance will be explored.

2 Approach

Conditional variational autoencoders. In order to generate queries conditioned to an underlying intent, we use a CVAE as depicted on Fig. 0(a). While with VAEs the latent vector only incorporates continuous variables [8], features of discrete nature can be considered as input in CVAEs, e.g. the digit-class in MNIST or the user intent in conversational systems. Just like the encoding of continuous features, the encoding of discrete features is non-deterministic yet differentiable, thanks to the Gumbel-Max trick [9], and regularized to match a simple prior, generally taken to be the uniform categorical distribution.

Each training sample is associated to a continuous feature vector and a categorical variable expressed as a one-hot vector . Differently from [1] and other implementations, we do not condition the generation of the latent code to the categorical variable. We instead use the encoding distribution to generate the latent variable

and we add both supervision and a KL regularization term that enforces the prior distribution on the classes. The associated loss function consists of three terms, namely the reconstruction term, the Kullback-Leibler term, and the categorical term:



In the equations above, and represent the encoder and the decoder respectively with their associated parameters, is the dimension of the categorical space. The constant is used to set the relative weight of the KL regularization and perform annealing during training [10, 11]. We introduce the class-specific coefficients to account for possible class-imbalance as well as to set the strength of the supervision we exercise on each category and. For all the experiments presented here, is the uniform categorical distribution, and . At inference time, a sentence is generated by constructing the concatenation of a chosen with a sampled

, feeding it to the decoder and extracting greedily the most probable sequence.

Query transfer. A CVAE can be trained on a dataset of annotated queries, namely on couples, where is the sentence itself and is the underlying query’s intent. With too few sentences as training examples, a CVAE would not yield generated sentences of high enough quality and diversity. In addition to an annotated training dataset – kept small in the data scarcity regime of interest in this paper – a second large “reservoir” dataset is considered. The latter is unannotated and contains sentences that potentially cover a larger spectrum, ranging from intents that are semantically close to the in-domain ones to completely out-of-domain examples.

The novelty in our approach is that one extra dimension is allocated for irrelevant sentences coming from , namely an additional None intent – the categorical latent space of the CVAE already contains one dimension for each intent in . All sentences from are supervised by a cross-entropy loss to this dimension, but we want the relevant ones to be allowed to transfer to one of the intents of , as illustrated on Fig. 0(b). To allow for this to happen, we may control the amount of transfer by weakening the supervision loss of by the factor (for simplicity, we will denote in the following, namely ). In the case , the sentences from are not supervised at all. The validity of this approach and the effect of is illustrated in the context of computer vision in the Appendix.

Sentence selection. We introduce another mechanism to further improve the query transfer. Since

may contain a lot of irrelevant data that can potentially pollute the generation and conditioning, we may want to preprocess it and select queries that belong to a close domain. In the context of natural language processing, this may be achieved by sentence embeddings. We suggest to use generalist sentence embeddings such as InferSent

[12] as a first, rough, sentence selection mechanism. We first compute an “intent embedding” for each intent of , obtained by averaging the embeddings of all the sentences of the given intent. Then we only collect the sentences from which are “close” enough to one of the intents of , i.e


where is a threshold which controls selectivity.

Figure 2: Generation metrics. (Left Panel) Evolution of the generation metrics as a function of the transfer parameter for 200 sentences in both and . (Middle Panel) Comparison of the introduced method (transfer with InferSent selection at ) with two baselines: one without any transfer () and one with InferSent pseudo-labelling (see text). (Right Panel) Effect of the size of the reservoir (for and ) on the generation: increasing the number of transferred sentences improves the generation up to a certain point at which the quality degrades rapidly.

3 Experiments

3.1 Experimental setup

Data processing. For our experiments we use the publicly-available Snips benchmark dataset [2], which contains user queries from 7 various intents such as manipulating playlists or booking restaurants and 2000 queries per intent (from which we will only keep small fractions to mimic scarcity) and a test set of 100 queries per intent. Each intent comes with specific slots. As a proxy for a reservoir dataset, we use a large in-house dataset which collects assistants created by Snips users and contains all sorts of queries from over 300 varied intents.

The word embeddings feeded to the encoder are pre-trained GloVe embeddings [13]. We use a delexicalization procedure similar to that used in  [4] for Seq2Seq models. First, slot values are replaced a placeholder and stored in a dictionary (“Weather in Paris” “Weather in [City]”). The model is then trained on these delexicalized sentences and new delexicalized sentences are generated. A last step may consist in relexicalizing the generated sentences: abstract slot names are replaced by stored slot values. The effort is indeed put on generating new contexts, rather than just shuffling slot values.

Note that if the slots are too loosely defined, one would in principle have to pay attention to context to relexicalize [4]. Here we assume that the slots are sufficiently specific to ignore this issue. We tried various strategies for the initialization of slot-embeddings (e.g. the average of all slot values) and found that it had no impact in our experiments. We therefore initialize them with random embeddings.

Training details. Both the encoder and the decoder of our model use one-layer GRUs, with a hidden layer of size and both the continuous and categorical latent spaces are of size (for the categorical one: 7 intents + one None class). We adopt the KL-annealing trick from [10] to avoid posterior collapse: the weight of the KL loss term is annealed from to using the logistic function, at a time and a rate given by two hyper-parameters and . The hyper-parameters were chosen to ensure satisfactory intent conditioning: and .

The Adam optimization method is used and we train for epochs at a learning rate of with a batch size of . Depending on the size of , it takes a few dozens of minutes per experiment on a laptop. No word or embedding dropout is applied. The InferSent threshold is set to . Note that we draw a fixed number of samples from both and , however since we are in a data scarcity regime and only consider small , this draw entails high variability. Hence all results presented are averaged over five random seeds.

3.2 Generation metrics

Choosing relevant metrics for generation tasks is always a tricky yet interesting question. Generally speaking, one must optimize a trade-off between quality and diversity of the generated sentences. Indeed, for data augmentation purposes, we want the generated sentences to both be consistent with the original dataset and bring novelty, which is somewhat in contradiction. We use the following metrics to assess quality and diversity.

To account for quality we first consider the intent conditioning

accuracy. The generated sentences need to be well-conditioned to the intent imposed in the one-hot categorical variable during generation. We train an intent classifier based on a logistic regression on the full Snips dataset (2000 queries per intent), reaching near-perfect accuracy on the test set. We use this “oracle” classifier as a proxy for evaluating the accuracy of the intent conditioning. We then assess the semantic quality of the generated sentences by considering what is referred to in the following as the

BLEU-quality, namely the forward Perplexity [14], or the BLEU score [15] computed against the reference sentences of the given intent.

To account for diversity, we consider the so-called BLEU-diversity defined as where self-BLEU is merely the BLEU score of the generated sentences of a given intent against the other generated sentences of the same intent [16]. Finally, the second diversity metric is what we call the originality. Indeed, enforcing diversity does not ensure that we are not just reproducing the training set. If the latter has high diversity, we may obtain high diversity by plagiarizing it. Therefore the originality is defined as the fraction of generated delexicalized queries that are not present in the training set.

These four metrics take values in

. The three last metrics (BLEU, BLEU-diversity, originality) are evaluated intent-wise, which may be problematic if the intent conditioning of the generated sentences is poor. For example, if we condition to “PlayMusic” and the generated sentence is “What is the weather ?”, the diversity metrics of the “PlayMusic” intent would be over-estimated while the quality would be under-estimated. To reduce this effect as much as possible, the computation of these metrics is therefore restricted to generated sentences for which the oracle classifier agrees with the conditioning intent.

4 Results

The code to reproduce all of our experiments is publicly available on GitHub222

4.1 Quality of generation

For all of the experiments described in this paragraph, we set and . As stated in Section 2, the parameter allows to control the amount of transfer between and the reservoir . The left panel of Fig. 2 indeed shows that is a useful cursor for the diversty-quality tradeoff. Increasing yields generated sentences of higher quality (both in terms of intent conditioning and BLEU-quality) but lower diversity (in terms of BLEU-diversity and originality). Again, the optimal value of is task dependent and needs to be tuned accordingly: some tasks would rather require high quality, others would require high diversity (see the Appendix for an illustration on images).

To test the efficiency of the query transfer, we compare it to two baselines. The first one is simply a CVAE trained only on (in blue on the middle panel of Fig. 2). The second one, referred to as pseudo-labelling (in orange on the figure), leverages queries from directly associated to intents of using InferSent-based similarity scores (the CVAE is trained without a None class). If the parameter defined in Section 2 exceeds a certain threshold for a given intent, the sentence from is directly added to the corresponding intent in , on which the CVAE is trained. The middle panel of Fig. 2 shows that the proposed query transfer method improves the diversity metrics (especially the originality) of the generated sentences, with hardly any deterioration in quality. In comparison, the pseudo-labelling approach deteriorates significantly the quality of generated sentences.

Finally, the right panel of Fig. 2 displays the evolution of the generation metrics with the size of the reservoir. We observe a remarkable improvement of the diversity metrics when the number of sentences injected from increases, without any loss in quality up to a certain point at which the quality degrades strongly. This is due to the imbalance introduced in the conditioning mechanism of the CVAE. A statisfying trade-off seems to be found for .

4.2 Data augmentation for language models

Table 1: Relative loss of perplexity (%) with respect to LM trained on the original dataset , when varying the size of and the augmentation ratio. Perplexities – averaged over 3 experiments – are computed on the test set for LMs trained on , , and respectively, when varying the size of and the augmentation ratio. Results can only be compared row-wise because of the vocabulary restriction (see text).

In this section, we show that the proposed approach can effectively be used as data augmentation technique for language modeling tasks. Indeed, leveraging in-domain language models in cascaded approaches – trained for a specific use case rather than in a large vocabulary setting – allows to both reduce their size and increase their in-domain accuracy [17]. We hence propose to compare the perplexity [18] of Language Models (LM) trained on three datasets: the initial dataset of delexicalized sentences , containing augmented by sentences generated by the CVAE model trained on with query transfer, and containing augmented by “real” sentences from the original Snips benchmark dataset.

The SRILM toolkit [19] was used to train 4-grams LMs with Kneser-Ney Smoothing [20]. Perplexity is only comparable if the vocabulary supported by the various models is the same. To fix this issue, the words contained in at least , and are added as unigrams with a count in every LM. Finally, the CVAE might generate sentences already present in but every sentence is kept only once. The perplexity is evaluated on a pool of 700 test sentences, for 4 different data regimes (i.e. sizes of ). The experiment is repeated 3 times. In this experiment, we set , and , consistently with the previous section.

Table 1 shows the results when varying the size of and the number of sentences generated by the augmentation process (augmentation ratios of and ). We see that the perplexity on the test set is consistently lower when the LM is trained on rather than when trained on , though it does not reach the performance of augmentation with real data (). The improvement is less significant as the dataset size increases, illustrating that most phrasings of the various intents are already covered in this data regime. These results are encouraging and show that this technique could be used as a data augmentation process for SLU tasks, especially in the low data regime.

5 Discussion and conclusion

We introduce a method to alleviate data scarcity in conditional generation tasks where one has access to a large unlabelled dataset containing some potentially useful information, using conditional variational autoencoders. We present this approach in the context of sentence generation, but the same can be applied e.g. to visual data as shown in the Appendix. We choose to focus on the low data regime, as it is the most relevant for user-customized closed-domain dialogue systems, where gathering manually annotated datasets is very cumbersome.

Transferring knowledge from the large reservoir dataset to the original dataset comes with the risk of introducing unwanted information which may corrupt the generative model. However, this risk may be controlled by adjusting two parameters. First, we consider a selectivity threshold to adjust how much irrelevant data is discarded from during preprocessing. The pre-processing procedure consists in evaluating the similarity of an example from with the examples in

(in this context the cosine similarity of sentence embeddings). Second, we introduce a transfer parameter

, adjusting the supervision of unlabelled examples from , low values of facilitating transfer from the reservoir.

In this paper, we mainly focus on assessing the performance of the proposed generation technique by both introducing quality and diversity metrics and show how the introduced parameters may help choosing the best trade-off. We also illustrate our approach on a small language modelling task. The full potentiality of this method for more complex SLU tasks still needs to be explored and will be the subject of a future work.

6 Appendix

(a) Without using
(b) Using ,
(c) Using ,
(d) Using ,
Figure 3: MNIST dataset. (a): CVAE trained on a small labelled dataset of digits between 0 and 4 (10 images per class). (b)–(d): leveraging an unlabelled reservoir dataset of digits between 0 and 9 (50 images per class), with a varying transfer parameter . Here the best quality/diversity trade-off is reached around .
(a) Without using
(b) Using ,
(c) Using ,
(d) Using ,
Figure 4: Fashion MNIST dataset. (a): CVAE trained on a small labelled dataset containing only the first 5 classes (10 images per class). (b)–(d): leveraging an unlabelled reservoir dataset containing all 10 classes (50 images per class), with a varying transfer parameter . Here the best quality/diversity trade-off is reached at .

We present below results on the MNIST and Fashion MNIST [21] datasets as toy examples to give another illustration of the transfer process. Here, the small annotated dataset contains only examples from the first 5 first classes of each dataset, with 10 examples per class. The larger reservoir dataset contains examples from each of the 10 classes (half of its content being irrelevant to the generative task), with 50 examples per class.

Figs. 3 & 4 show results obtained by training a very simple two-layer fully-connected conditional variational auto encoder for 200 epochs, for various values of the transfer parameter . The code used to produce these figures is included in the GitHub repository. We see that without the reservoir dataset (panel (a) on both figures), there is not enough training data to generate high-quality diverse images. Using with a too low value of (second column) yields unwanted image transfer and corruption of the generated images (4’s get mixed up with 9’s and 7’s in MNIST, shoes get mixed up with jackets in Fashion MNIST). Conversely, if is too high (panels (d) on both figures), there is not enough transfer and the generated images do not benefit from anymore. However, for a well-chosen value (panels (c)), there is significant improvement both in quality and diversity of the generated images. As we can see here, the optimal value of is dataset-dependent: for MNIST, for Fashion MNIST.