1 Introduction
There has been a growing interest in neuralnetworkbased endtoend models for text generation tasks, including machine translation
[1][2], and conversation response generation [3, 4, 5]. Among these, encoderdecoder framework has been widely adopted and they principally learn the mapping from an input sequence to its target sequence . Although this framework has achieved great success in machine translation, previous studies on generating responses for chitchat conversations [5, 6] have found that ordinary encoderdecoder models tend to generate dull, repeated and generic responses in conversations, such as “i don’t know”, “that’s ok”, which are lack of diversity. One possible reason is the deterministic calculation of ordinary encoderdecoder models which constrains them from learning the to mapping relationship, especially on semantic connections, between input sequence and potential multiple target sequences. In the task of chitchat conversation, modeling and generating the diversity of responses is important because an input post or context may correspond to multiple responses with different meanings and language styles.Many attempts have been made to alleviate these deficiencies of encoderdecoder models, such as by utilizing extra features or knowledge as conditions to generate more specific responses [7, 8] and by improving the model structure, the training algorithms and the decoding strategies [9, 10, 11]. Additionally, conditional variational autoencoders (CVAEs), which were originally proposed for image generation [12, 13], have recently been applied to dialog response generation [14, 15]. Variational generative models, including variational autoencoders (VAEs) and CVAEs, are suitable for learning the to mapping relationship due to their variational sampling mechanism for deriving latent representations.
This paper studies variational generative models for text generation in singleturn chitchat conversations. The CVAE models used in previous work [14, 16, 12, 13, 15] all assumed a prior distribution of latent variable followed a multivariate Gaussian distribution
whose mean and variance were estimated by a prior network using condition
as input. However, previous studies on image generation [12, 17] found that the samples of from tended to be independent of given estimated models, which implied that the effect of the condition was constrained at the generation stage. In the conversation response generation task, the condition is in the form of natural language. The semantic space of in the training set is always sparse, which further increases the difficulty of estimating the prior network . To address this issue of CVAEs, we propose conditiontransforming variational autoencoders (CTVAEs) in this paper. In contrast to CVAEs, which use prior networks to describe , a conditionindependent prior distributionis adopted in CTVAEs. Then, another transformation network is built to derive the samples of
for decoding by transforming the combination of condition and samples from .Specifically, the contributions of this paper are twofold: First, the subjective preference tests in this paper demonstrate that there is no significant performance gap between the ordinary CVAE model and a simplified CVAE model whose prior distribution is fixed as a conditionindependent distribution, i.e., , which implies that the effects of the conditiondependent prior distribution in CVAE were limited. Second, a new model, called CTVAE, is proposed to enhance the effect of the conditions in CVAEs. This model samples the conditiondependent latent variable by performing a nonlinear transformation on the combination of the input condition and the samples from a conditionindependent Gaussian distribution. In our experiments of generating short text conversations, the CTVAE model outperforms CVAE on objective fluency metrics and surpasses a sequencetosequence (Seq2Seq) model on objective diversity metrics. In subjective preference tests, our proposed CTVAE model performs significantly better than CVAE and Seq2Seq models on generating fluency, informative and topic relevant responses.
2 Methodology
2.1 From CVAE to CTVAE
Figure 1 shows directed graphical models of CVAE and CTVAE. In the singleturn short text conversation task, the condition is the input post and is the output response. As Figure 1(a) shows, a CVAE is composed of a prior network , a recognition network , and a decoder network . Both and are multivariate Gaussian distributions. The generative process of response at testing stage is as follows: sample a point from the prior distribution , then feed it into decoder network . CVAEs can be efficiently trained with the stochastic gradient variational Bayes (SGVB) [18] framework by maximizing the lower bound of the conditional log likelihood as follows,
(1)  
As shown in Figure 1(b), a CTVAE has no prior network but adopts as a transitional prior distribution to generate . Similarly, a CTVAE includes a recognition network and a decoder network . Additionally, CTVAEs use an alternative nonlinear transformation network to sample the latent variable from the combination of and the samples of transitional latent variable . Following the training strategy for CVAEs, the model parameters of CTVAEs can be estimated by maximizing the lower bound of the conditional log likelihood as follows,
(2)  
2.2 Model Implementation
The model architecture of the CTVAE implemented in this paper is shown in Figure 2. Specifically, all the encoders and decoders
are 1layer recurrent neural networks with long shortterm memory units. For an input post
with words, we can derive the corresponding output hidden states by sending its word embedding sequence into the Condition Encoder. Then, the mean pooling of hidden states is used to present the condition post, denoted as . Similarly, we can derive vector representation for response by inputting into the Output Encoder. The Recognition Networkis a multilayer perceptron (MLP), which has a hidden layer with
softplus activation and a linear output layer in our implementation. The recognition network predicts and from , which gives . The samples of the transitional latent variable generated by are further used to derive the samples of latent variable for reconstructingduring training. To guarantee the feasibility of error backpropagation for model training, reparametrization
[18] is performed to generate the samples of . To derive the samples of latent variable , the sampled is concatenated with condition and passed through a transformation network, which is a MLP with two hidden layers with tanh activation in our implementation. The output of the transformation network is used as the samples of latent variable . The initial hidden state of the is . At each time step of the 1layer LSTMRNN, the input is composed of the word embedding from the previous time step and the encoding vector , which is the concatenation of and samples. According to Eq. (2), the summation of the loglikelihood of reconstructing from the and the negative KL divergence between and the prior distribution of the transitional latent variable is used as the objective function for training.In the CVAE built for comparison, all its encoders and decoder have identical structure to those in CTVAE. Both the recognition network and prior network have the same structure as the recognition network in CTVAE except that the recognition network accepts the concatenation of and as input.
2.3 Reranking Multiple Responses
To evaluate the performance of producing diverse responses using different models, multiple responses for each post are generated at the testing stage. Specifically, for the CVAE/CTVAE models, we first generatd multiple samples of . Then, for each sample, a beam search is adopted to return the best result. The multiple responses for each post are reranked using a topic coherence discrimination (TCD) model, which is trained based on the ESIM model [19]. Specifically, we replace all BiLSTMs in the ESIM with 1layer LSTMs and define the objective of the TCD model as judging whether a response is a valid response to a given post. In order to train the TCD model, all postresponse pairs in the training set are used as positive samples and negative samples are constructed by randomly shuffling the mapping between posts and responses. Finally, ranking scores are adopted to rerank all responses generated for one post. The scores are calculated as , where the first term is the loglikelihood of generating response using the decoder network and is the condition input to the decoder, i.e.,
in CVAE/CTVAE models. The second term is the loglikelihood of the output probability of the TCD model.
represents the weight between these two terms.3 Experiments
3.1 Dataset
The short text conversation (STC) dataset from NTCIR12 [20] was used in our experiments. This dataset was crawled from Chinese Sina Weibo^{1}^{1}1https://weibo.com/. The dataset contains postresponse pairs^{2}^{2}2This dataset was originally prepared for retrieval models and had no standard division for generative models. Here we filtered the postresponse pairs in raw STC dataset according to word frequencies to built our dataset., and one post corresponds to an average of responses. Therefore, it contains to mapping relationship and is appropriate for studying diverse text generation methods. We randomly split the data into pairs to build the training, development and test sets. There were no overlapping posts among these three sets.
3.2 Models in Our Experiments
In our experiments, we compared CTVAE with following three baseline models, i.e., Seq2Seq, CVAEsimple, and CVAE . We didn’t include the models in NTCIR12 contest because they were all retrieval models.
Seq2Seq Like previous study, we used the encoderdecoder neural network with attention as the baseline model [14, 15], which was similar to that for machine translation [1, 21]. Both the encoder and decoder were 1layer LSTMRNNs, and the attention weights were obtained by the inner product of the hidden states.
CVAEsimple & CVAE The CVAE model has been described in Section 2.2. As described in Section 1, the prior distribution in CVAEs was previously found to degrade to . To verify this, we manually removed the prior network in the CVAE and fixed the prior distribution to . This modified CVAE model was denoted as CVAEsimple.
3.3 Parameter Setting
We trained the models in our experiments with the following hyperparameters. All word embeddings, hidden layers of the recognition network and prior network, hidden layers of the transformation network, and hidden state vectors of the encoders and decoders had
dimensions. The latent variables in CTVAE and in CVAE had dimensions. Each encoder and decoder had word embeddings of its own, and the vocabulary size was . All word embeddings and model parameters were initialized randomly with Gaussiandistributed samples. The method of Adam [22] was adopted for optimization with initial learning rate . The batch size was set to . When training the CVAEs and CTVAEs, the KL annealing strategy [23] was adopted to address the issue of latent variable vanishing. The model parameters were pretrained without optimizing the KL divergence term. Additionally, we also adopted a training strategy which optimized the KLD loss term every 3 steps but optimized the reconstruction nonnegative log likelihood (NLL) loss term every 1 step. As described in Section 2.3, we generated multiple responses for each post. Specifically, for CVAE and CTVAE models, the number of samples was set to 50, the beam search size was 20. For Seq2Seq, a beam search with beam size 50 was used to return multiple responses. The weightfor reranking was heuristically set to
. The top5 responses after reranking were used for evaluation in our experiments.3.4 Objective Evaluation
Seq2Seq  CVAEsimple  CVAE  CTVAE  

PPL on LM  7.61  31.82  36.96  21.75 
Matching(%)  92.58  8.12  10.51  19.10 
Seq2Seq  CVAEsimple  CVAE  CTVAE  

Distinct1(%)  1.61  10.26  11.52  8.69 
Distinct2(%)  5.26  41.23  42.6  33.44 
Unique(%)  22.86  97.66  97.78  97.62 
3.4.1 Fluency
We trained a RNN language model (LM) [24] using the same STC dataset to evaluate the fluency of the generated responses by calculating their perplexities, denoted as PPL on LM here. Furthermore, the percentage of generated responses that exactly matched any responses in the training set were counted. This matching percentage was used as a metric to evaluate the model’s ability to generate fluency sentences with reasonable syntactic and semantic representations. For each model, 50 responses were generated for unique posts in the test set, and the responses were reranked using the methods described in Section 2.3. The average LM perplexity and matching percentage of all top5 responses were calculated for each model and the results are presented in Table 1. It can be found that the Seq2Seq model achieved the lowest perplexity on LM and the highest matching percentage because it tended to generate its dull, generic and repeated responses. The CVAE models performed worst on these two fluency metrics. For CTVAE, it performed much better than the CVAE models on both LM perplexity and matching percentage.
Pair No.  Seq2Seq  CVAEsimple  CVAE  CTVAE  N/P  

Fluency  P1  32.8(4.0)  –  40.4(6.0)  –  26.8(5.3)  
P2  –  25.6(1.6)  32.4(1.7)  –  42.0(3.0)  
P3  –  –  23.6(3.4)  41.2(2.9)  35.2(6.0)  
Topic relevance  P1  17.2(1.7)  –  63.2(3.2)  –  19.6(2.9)  
P2  –  28.0(1.9)  34.4(1.1)  –  37.6(2.5)  
P3  –  –  29.6(2.4)  40.8(1.9)  29.6(4.1)  
Informativeness  P1  6.4(0.9)  –  79.6(3.1)  –  14.0(2.7)  
P2  –  28.4(2.3)  28.0(2.1)  –  43.6(4.2)  
P3  –  –  32.4(2.2)  47.2(1.1)  20.4(2.6) 
3.4.2 Diversity
The percentages of distinct unigrams and bigrams [6] in the generated top5 responses were used to evaluate the diversity of the generated responses. These two percentages, denoted as distinct1 and distinct2
, respectively, represented the diversity at the ngram level. We also counted the percentage of unique response sentences, which evaluated the diversity of responses at sentence level. The results for the four models are presented in Table
2. It can be found that the Seq2Seq model had the worst diversity at both the ngram level and the sentence level. CVAE performed slightly better than CVAEsimple. And both CVAE models achieved better diversity at the ngram level than that of the CTVAE model, especially for distinct2. According to the results of Section 3.4.1, the CVAE models performed worst on fluency performance, which may lead to higher diversity at the surfacetext level. For the diversity at sentence level, the percentage of unique responses achieved by CTVAE was close to that of CVAE models.3.5 Subjective Evaluation
It is difficult to evaluate the final performance of the generated conversation responses using objective metrics, such as BLEU. It has been argued that such objective metrics for machine translation are very weakly correlated with human judgment in dialog generation [25]. To evaluate the responses generated by our models in a more comprehensive and convincing manner, several groups of subjective ABX preference tests were conducted. We randomly chose 50 posts from the test set and generated the top5 responses from each model for each post. The responses generated by two models were compared in each test. Five native Chinese speakers with rich Sina Weibo experience were recruited for the evaluation. For each test post, a pairs of top5 responses generated by two models were presented in random order. The evaluators were asked to judge which top5 responses in each pair were preferred or if there was no preference on three subjective metrics: fluency, topic relevance, and informativeness score. Fluency was used to evaluate the quality of grammar and the semantic logic of responses. Topic relevance measured whether a response matched the topic of the given post. Informativeness measured how informative and interesting a response was. In addition to calculating the average preference scores, the value of a test was adopted to measure the significance of the difference between two models. Several significance levels were examined, including , , and . indicated that there was no significant difference between two models. The subjective evaluation results are presented in Table 3.
According to results of model pair P2 in Table 3, we can see that there is no significant difference on all three metrics between CVAE and CVAEsimple whose prior latent distribution was conditionindependent, which implies that the effects of the conditiondependent prior distribution in the CVAE model were limited. From the model pair P1, it can be found that CVAE outperformed Seq2Seq significantly on all metrics except fluency. On the other hand, the results of model pair P3 show that CTVAE outperformed CVAE significantly on all three metrics, which confirms the effectiveness of our proposed CTVAE model. These results indicate that our CTVAE model can derive samples with better conditiondependency than CVAE model.
Case study Figure 3 shows one typical example of the top5 responses generated by the Seq2Seq, CVAE, and CTVAE models. We can see that the Seq2Seq model tend to generate dull and generic responses. Furthermore, the responses of CTVAE tended to be more topic relevant and informative than those of CVAE.
4 Conclusion
We have proposed a model named conditiontransforming variational autoencoder (CTVAE) for diverse text generation. In this model, the samples of latent variable are derived by performing a nonlinear transformation on the combination of the input condition and the samples from a prior Gaussian distribution . In our experiments on singleturn short text conversation, the CTVAE outperformed the Seq2Seq and CVAE models in both objective and subjective evaluations, which indicates that the CTVAE can derive samples with better conditiondependency than CVAE models. Applying the proposed CTVAE model to multiturn conversation response generation and pursuing controllable sampling of the latent variable will be our future work.
References
 [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.

[2]
Alexander M Rush, Sumit Chopra, and Jason Weston,
“A neural attention model for abstractive sentence summarization,”
inProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, 2015, pp. 379–389.  [3] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
 [4] Lifeng Shang, Zhengdong Lu, and Hang Li, “Neural responding machine for shorttext conversation,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, vol. 1, pp. 1577–1586.

[5]
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and
Joelle Pineau,
“Building endtoend dialogue systems using generative hierarchical
neural network models,”
in
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 1217, 2016, Phoenix, Arizona, USA.
, 2016, pp. 3776–3784.  [6] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan, “A diversitypromoting objective function for neural conversation models,” arXiv preprint arXiv:1510.03055, 2015.
 [7] Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and WeiYing Ma, “Topic aware neural response generation.,” in AAAI, 2017, pp. 3351–3357.
 [8] Marjan Ghazvininejad, Chris Brockett, MingWei Chang, Bill Dolan, Jianfeng Gao, Wentau Yih, and Michel Galley, “A knowledgegrounded neural conversation model,” arXiv preprint arXiv:1702.01932, 2017.
 [9] Yu Wu, Wei Wu, Dejian Yang, Can Xu, Zhoujun Li, and Ming Zhou, “Neural response generation with dynamic vocabularies,” arXiv preprint arXiv:1711.11191, 2017.

[10]
Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng
Gao,
“Deep reinforcement learning for dialogue generation,”
in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1192–1202.  [11] Ganbin Zhou, Ping Luo, Rongyu Cao, Fen Lin, Bo Chen, and Qing He, “Mechanismaware neural machine for dialogue response generation.,” in AAAI, 2017, pp. 3400–3407.
 [12] Kihyuk Sohn, Honglak Lee, and Xinchen Yan, “Learning structured output representation using deep conditional generative models,” in Advances in Neural Information Processing Systems, 2015, pp. 3483–3491.

[13]
Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee,
“Attribute2image: Conditional image generation from visual
attributes,”
in
European Conference on Computer Vision
. Springer, 2016, pp. 776–791.  [14] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi, “Learning discourselevel diversity for neural dialog models using conditional variational autoencoders,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, vol. 1, pp. 654–664.
 [15] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio, “A hierarchical latent variable encoderdecoder model for generating dialogues.,” in AAAI, 2017, pp. 3295–3301.
 [16] Xiaopeng Yang, Xiaowen Lin, Shunda Suo, and Ming Li, “Generating thematic chinese poetry with conditional variational autoencoder,” arXiv preprint arXiv:1711.07632, 2017.

[17]
Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling,
“Semisupervised learning with deep generative models,”
in Advances in Neural Information Processing Systems, 2014, pp. 3581–3589.  [18] Diederik P Kingma and Max Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [19] Qian Chen, Xiaodan Zhu, ZhenHua Ling, Si Wei, Hui Jiang, and Diana Inkpen, “Enhanced lstm for natural language inference,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, vol. 1, pp. 1657–1668.
 [20] Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li, Ryuichiro Higashinaka, and Yusuke Miyao, “Overview of the ntcir12 short text conversation task.,” in NTCIR, 2016.

[21]
Thang Luong, Hieu Pham, and Christopher D Manning,
“Effective approaches to attentionbased neural machine translation,”
in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.  [22] Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [23] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio, “Generating sentences from a continuous space,” in Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 2016, pp. 10–21.
 [24] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur, “Recurrent neural network based language model,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.

[25]
ChiaWei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and
Joelle Pineau,
“How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation,”
in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2122–2132.
Comments
There are no comments yet.