Condition-Transforming Variational AutoEncoder for Conversation Response Generation

04/24/2019 ∙ by Yu-Ping Ruan, et al. ∙ USTC Anhui USTC iFLYTEK Co 0

This paper proposes a new model, called condition-transforming variational autoencoder (CTVAE), to improve the performance of conversation response generation using conditional variational autoencoders (CVAEs). In conventional CVAEs , the prior distribution of latent variable z follows a multivariate Gaussian distribution with mean and variance modulated by the input conditions. Previous work found that this distribution tends to become condition independent in practical application. In our proposed CTVAE model, the latent variable z is sampled by performing a non-lineartransformation on the combination of the input conditions and the samples from a condition-independent prior distribution N (0; I). In our objective evaluations, the CTVAE model outperforms the CVAE model on fluency metrics and surpasses a sequence-to-sequence (Seq2Seq) model on diversity metrics. In subjective preference tests, our proposed CTVAE model performs significantly better than CVAE and Seq2Seq models on generating fluency, informative and topic relevant responses.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been a growing interest in neural-network-based end-to-end models for text generation tasks, including machine translation


, text summarization

[2], and conversation response generation [3, 4, 5]. Among these, encoder-decoder framework has been widely adopted and they principally learn the mapping from an input sequence to its target sequence . Although this framework has achieved great success in machine translation, previous studies on generating responses for chit-chat conversations [5, 6] have found that ordinary encoder-decoder models tend to generate dull, repeated and generic responses in conversations, such as “i don’t know”, “that’s ok”, which are lack of diversity. One possible reason is the deterministic calculation of ordinary encoder-decoder models which constrains them from learning the -to- mapping relationship, especially on semantic connections, between input sequence and potential multiple target sequences. In the task of chit-chat conversation, modeling and generating the diversity of responses is important because an input post or context may correspond to multiple responses with different meanings and language styles.

Many attempts have been made to alleviate these deficiencies of encoder-decoder models, such as by utilizing extra features or knowledge as conditions to generate more specific responses [7, 8] and by improving the model structure, the training algorithms and the decoding strategies [9, 10, 11]. Additionally, conditional variational autoencoders (CVAEs), which were originally proposed for image generation [12, 13], have recently been applied to dialog response generation [14, 15]. Variational generative models, including variational autoencoders (VAEs) and CVAEs, are suitable for learning the -to- mapping relationship due to their variational sampling mechanism for deriving latent representations.

This paper studies variational generative models for text generation in single-turn chit-chat conversations. The CVAE models used in previous work [14, 16, 12, 13, 15] all assumed a prior distribution of latent variable followed a multivariate Gaussian distribution

whose mean and variance were estimated by a prior network using condition

as input. However, previous studies on image generation [12, 17] found that the samples of from tended to be independent of given estimated models, which implied that the effect of the condition was constrained at the generation stage. In the conversation response generation task, the condition is in the form of natural language. The semantic space of in the training set is always sparse, which further increases the difficulty of estimating the prior network . To address this issue of CVAEs, we propose condition-transforming variational autoencoders (CTVAEs) in this paper. In contrast to CVAEs, which use prior networks to describe , a condition-independent prior distribution

is adopted in CTVAEs. Then, another transformation network is built to derive the samples of

for decoding by transforming the combination of condition and samples from .

Specifically, the contributions of this paper are two-fold: First, the subjective preference tests in this paper demonstrate that there is no significant performance gap between the ordinary CVAE model and a simplified CVAE model whose prior distribution is fixed as a condition-independent distribution, i.e., , which implies that the effects of the condition-dependent prior distribution in CVAE were limited. Second, a new model, called CTVAE, is proposed to enhance the effect of the conditions in CVAEs. This model samples the condition-dependent latent variable by performing a non-linear transformation on the combination of the input condition and the samples from a condition-independent Gaussian distribution. In our experiments of generating short text conversations, the CTVAE model outperforms CVAE on objective fluency metrics and surpasses a sequence-to-sequence (Seq2Seq) model on objective diversity metrics. In subjective preference tests, our proposed CTVAE model performs significantly better than CVAE and Seq2Seq models on generating fluency, informative and topic relevant responses.

2 Methodology

2.1 From CVAE to CTVAE

Figure 1: Graphical models of (a) CVAE, and (b) CTVAE. In each subgraph, the left part shows the recognition process of latent variable during the training stage, and the right part shows the process of generating during the testing stage. The dashed lines and the single solid lines represent the recognition network and the decoder network respectively. The double solid line in (a) and the thick solid lines in (b) denote the prior network and the transformation network respectively.
Figure 2: The model architecture of the CTVAE implemented in this paper.

denotes the concatenation of input vectors. All the encoders and decoders are 1-layer LSTM-RNNs, both recognition network and transformation network are MLPs.

Figure 1 shows directed graphical models of CVAE and CTVAE. In the single-turn short text conversation task, the condition is the input post and is the output response. As Figure 1(a) shows, a CVAE is composed of a prior network , a recognition network , and a decoder network . Both and are multivariate Gaussian distributions. The generative process of response at testing stage is as follows: sample a point from the prior distribution , then feed it into decoder network . CVAEs can be efficiently trained with the stochastic gradient variational Bayes (SGVB) [18] framework by maximizing the lower bound of the conditional log likelihood as follows,


As shown in Figure 1(b), a CTVAE has no prior network but adopts as a transitional prior distribution to generate . Similarly, a CTVAE includes a recognition network and a decoder network . Additionally, CTVAEs use an alternative non-linear transformation network to sample the latent variable from the combination of and the samples of transitional latent variable . Following the training strategy for CVAEs, the model parameters of CTVAEs can be estimated by maximizing the lower bound of the conditional log likelihood as follows,


2.2 Model Implementation

The model architecture of the CTVAE implemented in this paper is shown in Figure 2. Specifically, all the encoders and decoders

are 1-layer recurrent neural networks with long short-term memory units. For an input post

with words, we can derive the corresponding output hidden states by sending its word embedding sequence into the Condition Encoder. Then, the mean pooling of hidden states is used to present the condition post, denoted as . Similarly, we can derive vector representation for response by inputting into the Output Encoder. The Recognition Network

is a multi-layer perceptron (MLP), which has a hidden layer with

softplus activation and a linear output layer in our implementation. The recognition network predicts and from , which gives . The samples of the transitional latent variable generated by are further used to derive the samples of latent variable for reconstructing

during training. To guarantee the feasibility of error backpropagation for model training, reparametrization

[18] is performed to generate the samples of . To derive the samples of latent variable , the sampled is concatenated with condition and passed through a transformation network, which is a MLP with two hidden layers with tanh activation in our implementation. The output of the transformation network is used as the samples of latent variable . The initial hidden state of the is . At each time step of the 1-layer LSTM-RNN, the input is composed of the word embedding from the previous time step and the encoding vector , which is the concatenation of and samples. According to Eq. (2), the summation of the log-likelihood of reconstructing from the and the negative KL divergence between and the prior distribution of the transitional latent variable is used as the objective function for training.

In the CVAE built for comparison, all its encoders and decoder have identical structure to those in CTVAE. Both the recognition network and prior network have the same structure as the recognition network in CTVAE except that the recognition network accepts the concatenation of and as input.

2.3 Reranking Multiple Responses

To evaluate the performance of producing diverse responses using different models, multiple responses for each post are generated at the testing stage. Specifically, for the CVAE/CTVAE models, we first generatd multiple samples of . Then, for each sample, a beam search is adopted to return the best result. The multiple responses for each post are reranked using a topic coherence discrimination (TCD) model, which is trained based on the ESIM model [19]. Specifically, we replace all BiLSTMs in the ESIM with 1-layer LSTMs and define the objective of the TCD model as judging whether a response is a valid response to a given post. In order to train the TCD model, all post-response pairs in the training set are used as positive samples and negative samples are constructed by randomly shuffling the mapping between posts and responses. Finally, ranking scores are adopted to rerank all responses generated for one post. The scores are calculated as , where the first term is the log-likelihood of generating response using the decoder network and is the condition input to the decoder, i.e.,

in CVAE/CTVAE models. The second term is the log-likelihood of the output probability of the TCD model.

represents the weight between these two terms.

3 Experiments

3.1 Dataset

The short text conversation (STC) dataset from NTCIR-12 [20] was used in our experiments. This dataset was crawled from Chinese Sina Weibo111 The dataset contains post-response pairs222This dataset was originally prepared for retrieval models and had no standard division for generative models. Here we filtered the post-response pairs in raw STC dataset according to word frequencies to built our dataset., and one post corresponds to an average of responses. Therefore, it contains -to- mapping relationship and is appropriate for studying diverse text generation methods. We randomly split the data into pairs to build the training, development and test sets. There were no overlapping posts among these three sets.

3.2 Models in Our Experiments

In our experiments, we compared CTVAE with following three baseline models, i.e., Seq2Seq, CVAE-simple, and CVAE . We didn’t include the models in NTCIR-12 contest because they were all retrieval models.

Seq2Seq Like previous study, we used the encoder-decoder neural network with attention as the baseline model [14, 15], which was similar to that for machine translation [1, 21]. Both the encoder and decoder were 1-layer LSTM-RNNs, and the attention weights were obtained by the inner product of the hidden states.

CVAE-simple & CVAE The CVAE model has been described in Section 2.2. As described in Section 1, the prior distribution in CVAEs was previously found to degrade to . To verify this, we manually removed the prior network in the CVAE and fixed the prior distribution to . This modified CVAE model was denoted as CVAE-simple.

3.3 Parameter Setting

We trained the models in our experiments with the following hyperparameters. All word embeddings, hidden layers of the recognition network and prior network, hidden layers of the transformation network, and hidden state vectors of the encoders and decoders had

dimensions. The latent variables in CTVAE and in CVAE had dimensions. Each encoder and decoder had word embeddings of its own, and the vocabulary size was . All word embeddings and model parameters were initialized randomly with Gaussian-distributed samples. The method of Adam [22] was adopted for optimization with initial learning rate . The batch size was set to . When training the CVAEs and CTVAEs, the KL annealing strategy [23] was adopted to address the issue of latent variable vanishing. The model parameters were pre-trained without optimizing the KL divergence term. Additionally, we also adopted a training strategy which optimized the KLD loss term every 3 steps but optimized the reconstruction non-negative log likelihood (NLL) loss term every 1 step. As described in Section 2.3, we generated multiple responses for each post. Specifically, for CVAE and CTVAE models, the number of samples was set to 50, the beam search size was 20. For Seq2Seq, a beam search with beam size 50 was used to return multiple responses. The weight

for reranking was heuristically set to

. The top-5 responses after reranking were used for evaluation in our experiments.

3.4 Objective Evaluation

Seq2Seq CVAE-simple CVAE CTVAE
PPL on LM 7.61 31.82 36.96 21.75
Matching(%) 92.58 8.12 10.51 19.10
Table 1: The objective fluency performance of different models.
Seq2Seq CVAE-simple CVAE CTVAE
Distinct-1(%) 1.61 10.26 11.52 8.69
Distinct-2(%) 5.26 41.23 42.6 33.44
Unique(%) 22.86 97.66 97.78 97.62
Table 2: The objective diversity performance of different models.

3.4.1 Fluency

We trained a RNN language model (LM) [24] using the same STC dataset to evaluate the fluency of the generated responses by calculating their perplexities, denoted as PPL on LM here. Furthermore, the percentage of generated responses that exactly matched any responses in the training set were counted. This matching percentage was used as a metric to evaluate the model’s ability to generate fluency sentences with reasonable syntactic and semantic representations. For each model, 50 responses were generated for unique posts in the test set, and the responses were reranked using the methods described in Section 2.3. The average LM perplexity and matching percentage of all top-5 responses were calculated for each model and the results are presented in Table 1. It can be found that the Seq2Seq model achieved the lowest perplexity on LM and the highest matching percentage because it tended to generate its dull, generic and repeated responses. The CVAE models performed worst on these two fluency metrics. For CTVAE, it performed much better than the CVAE models on both LM perplexity and matching percentage.

Pair No. Seq2Seq CVAE-simple CVAE CTVAE N/P
Fluency P1 32.8(4.0) 40.4(6.0) 26.8(5.3)
P2 25.6(1.6) 32.4(1.7) 42.0(3.0)
P3 23.6(3.4) 41.2(2.9) 35.2(6.0)
Topic relevance P1 17.2(1.7) 63.2(3.2) 19.6(2.9)
P2 28.0(1.9) 34.4(1.1) 37.6(2.5)
P3 29.6(2.4) 40.8(1.9) 29.6(4.1)
Informativeness P1 6.4(0.9) 79.6(3.1) 14.0(2.7)
P2 28.4(2.3) 28.0(2.1) 43.6(4.2)
P3 32.4(2.2) 47.2(1.1) 20.4(2.6)
Table 3: Average preference scores (std.) () on fluency, topic relevance, and informativeness score between three model pairs (P1-P3), where N/P stands for ”no preference” and denotes the -value of a -test between two models.

3.4.2 Diversity

The percentages of distinct unigrams and bigrams [6] in the generated top-5 responses were used to evaluate the diversity of the generated responses. These two percentages, denoted as distinct-1 and distinct-2

, respectively, represented the diversity at the n-gram level. We also counted the percentage of unique response sentences, which evaluated the diversity of responses at sentence level. The results for the four models are presented in Table

2. It can be found that the Seq2Seq model had the worst diversity at both the n-gram level and the sentence level. CVAE performed slightly better than CVAE-simple. And both CVAE models achieved better diversity at the n-gram level than that of the CTVAE model, especially for distinct-2. According to the results of Section 3.4.1, the CVAE models performed worst on fluency performance, which may lead to higher diversity at the surface-text level. For the diversity at sentence level, the percentage of unique responses achieved by CTVAE was close to that of CVAE models.

3.5 Subjective Evaluation

It is difficult to evaluate the final performance of the generated conversation responses using objective metrics, such as BLEU. It has been argued that such objective metrics for machine translation are very weakly correlated with human judgment in dialog generation [25]. To evaluate the responses generated by our models in a more comprehensive and convincing manner, several groups of subjective ABX preference tests were conducted. We randomly chose 50 posts from the test set and generated the top-5 responses from each model for each post. The responses generated by two models were compared in each test. Five native Chinese speakers with rich Sina Weibo experience were recruited for the evaluation. For each test post, a pairs of top-5 responses generated by two models were presented in random order. The evaluators were asked to judge which top-5 responses in each pair were preferred or if there was no preference on three subjective metrics: fluency, topic relevance, and informativeness score. Fluency was used to evaluate the quality of grammar and the semantic logic of responses. Topic relevance measured whether a response matched the topic of the given post. Informativeness measured how informative and interesting a response was. In addition to calculating the average preference scores, the -value of a -test was adopted to measure the significance of the difference between two models. Several significance levels were examined, including , , and . indicated that there was no significant difference between two models. The subjective evaluation results are presented in Table 3.

According to results of model pair P2 in Table 3, we can see that there is no significant difference on all three metrics between CVAE and CVAE-simple whose prior latent distribution was condition-independent, which implies that the effects of the condition-dependent prior distribution in the CVAE model were limited. From the model pair P1, it can be found that CVAE outperformed Seq2Seq significantly on all metrics except fluency. On the other hand, the results of model pair P3 show that CTVAE outperformed CVAE significantly on all three metrics, which confirms the effectiveness of our proposed CTVAE model. These results indicate that our CTVAE model can derive samples with better condition-dependency than CVAE model.

Figure 3: An example of the top-5 responses generated by Seq2Seq, CVAE, and CTVAE for the post “It will be sunny after the rain in Beijing today. night owls, please get up early today.(今天北京雨后天晴,熬夜人啊,早起吧。)”.

Case study Figure 3 shows one typical example of the top-5 responses generated by the Seq2Seq, CVAE, and CTVAE models. We can see that the Seq2Seq model tend to generate dull and generic responses. Furthermore, the responses of CTVAE tended to be more topic relevant and informative than those of CVAE.

4 Conclusion

We have proposed a model named condition-transforming variational autoencoder (CTVAE) for diverse text generation. In this model, the samples of latent variable are derived by performing a non-linear transformation on the combination of the input condition and the samples from a prior Gaussian distribution . In our experiments on single-turn short text conversation, the CTVAE outperformed the Seq2Seq and CVAE models in both objective and subjective evaluations, which indicates that the CTVAE can derive samples with better condition-dependency than CVAE models. Applying the proposed CTVAE model to multi-turn conversation response generation and pursuing controllable sampling of the latent variable will be our future work.