Sentiment is a fundamental part of the human communication, and reflecting sentiment in human-computer interfaces is a key to making them engaging and interesting to use. This is certainly true for dialog systems and there has been a large body of literature that attempts to equip dialogue systems with the ability to understand and express the sentiment [Polzin and Waibel2000, Skowron et al.2011, Partala and Surakka2004, Prendinger and Ishizuka2005, Hasegawa et al.2013, Zhou et al.2017, Shen et al.2017]. However, these methods are either based on templates or rules which require extensive hand engineering.
End-to-end neural dialogue generation [Shang, Lu, and Li2015, Vinyals and Le2015, Serban et al.2016] is now a popular research topic because of the ease and flexibility of creating systems within this paradigm. While early works [Shang, Lu, and Li2015, Vinyals and Le2015] just employ simple sequence-to-sequence (SEQ2SEQ) models similar to those used in machine translation [Cho et al.2014, Sutskever, Vinyals, and Le2014], a number of papers have aimed to further improve the quality and diversity of dialogue responses in manners specific to dialog systems [Li et al.2015, Li et al.2016, Gu et al.2016, Xing et al.2016, Li et al.2017, Serban et al.2017, Zhao, Zhao, and Eskenazi2017].
Furthermore, There have also been a few attempts [Skowron et al.2011, Hasegawa et al.2013, Shen et al.2017, Zhou et al.2017] to incorporate sentiment information into data-driven end-to-end dialog systems, but each has their own shortcomings. For example, hasegawa-EtAl:2013:ACL2013 hasegawa-EtAl:2013:ACL2013 propose a method to train individual systems for each kind of emotion, which will cause the system to suffer from data sparsity and high computational cost. In addition, shen2017conditional shen2017conditional incorporate latent variables expressing emotion into the dialogue system but do not provide an explicit way to control the sentiment of these responses.
Recently, acl2018zhou acl2018zhou collect a large corpus of tweets from Twitter with emojis in the response, and assume that these emojis could reflect the sentiment of the response. Furthermore, they train a conditional variational autoencoder (CVAE)-based neural dialogue system which is capable of controlling the sentiment of the generated response explicitly. In this work, we investigate the application of another powerful model, i.e., generative adversarial networks (GANs) to this problem.
In this paper, we propose a conditional generative adversarial network (CGAN)-based framework for sentiment-controlled dialogue generation. In this framework, the desiderata of fluency and controlability are explicitly enforced by creating a model with two subcomponents: a generator and a discriminator. The generator is in charge of generating sentimental responses given a dialogue history and a sentiment label, while the adversarial discriminator enforces sentimental response quality by trying to determine whether the item (dialogue history, sentiment label, dialogue response) comes from the real data distribution. By training the generator to fool the discriminator, our system can simultaneously improve the quality of dialogue responses and generate responses with different sentiments depending on the sentiment label.
Our task is to train a dialogue system which is able to generate high-quality and sentiment-controlled responses. Given the dialogue history , where is the i-th token and is the length of the sequence, and a sentiment label , the task is to generate a response where is the length of the generated response. We would like this response to be consistent with the sentiment label and semantically appropriate for the dialogue history .
Encoder-Decoder for Dialog Generation
, in which an encoder reads in the previous dialog history/context and encodes it a continuous vector representation, which the decoder then uses to output the next dialogue utterance.
Specially, given the dialogue history , the hidden state of the encoder at time , , is computed according to:
where is the encoder RNN. Finally, we obtain the vector representation of , i.e., .
The decoder is another RNN which is capable of generating a response given the context vector . The hidden state of the decoder at the time step is calculated by:
where is the decoder RNN.
The probabilityover the whole vocabulary at -th time step is then calculated by a softmax function conditioned on the hidden state .
By multiplying all probabilities of the gold word tokens at each time step, we can calculate the probability of the response sequence given the dialogue history sequence .
Generative Adversarial Nets
The second important technology contributing to the proposed method is Generative Adversarial Networks (GANs; goodfellow2014generative goodfellow2014generative).
In the original GAN framework, there are two models: a generative model , which is in charge of generating outputs (one example being the SEQ2SEQ model in the previous section), and a discriminative model that attempts to discriminate whether its input samples are real or generated outputs. By training to create outputs that are able to fool into thinking that they are real, it is possible to generate samples that seem highly realistic, improving the quality of generation of images [Salimans et al.2016] and text [Li et al.2017, Yu et al.2017].
However, one major problem in the GAN framework is that there is no mechanism to control attributes of generated items. Therefore, mirza2014conditional mirza2014conditional propose a condition adversarial nets (CGANs) in which both the generator and discriminator are conditioned on some extra information so that the generator can control the types of items being generated according to this extra information.
Back to the dialogue scenario, li2017adversarial li2017adversarial propose an adversarial dialogue generation model, in which the generative model is a standard SEQ2SEQ model [Sutskever, Vinyals, and Le2014] which could generate a response given the dialogue history according to Eq.1 and Eq.2. The discriminative model
is a binary classifier which takes the dialogue historyand a dialogue response as an input and outputs a label indicating whether the dialogue response is generated from machines or human beings.
In more detail, its objective is to maximize the expected reward, i.e., of generated responses :
Another popular deep generative model recently is the framework of variational autoencoders (VAEs). VAEs have been successfully applied to many text generation tasks[Bowman et al.2015, Serban et al.2017, Zhou and Neubig2017, Hu et al.2017]. Specifically, in serban2017hierarchical serban2017hierarchical, the dialogue generation model has been augmented by introducing a latent variable at the decoder. shen2017conditional shen2017conditional and acl2018zhou acl2018zhou present a CVAEs-based framework for dialogue generation in which the response is generated from a stochastic latent variable and and the context vector . Mathematically, a CVAE-based dialogue generation system maximizes a variational lower bound on the conditional likelihood of given the latent variable and the context vector .
In this section, we build upon the SEQ2SEQ-based dialogue generation model with the CVAE and CGAN techniques introduced in the previous section.
In order to explicitly control the sentiment of the generated response, we slightly change the structure of the standard SEQ2SEQ model [Sutskever, Vinyals, and Le2014]. Specifically, after obtaining the context vector of the dialogue history from the encoder by Eq.1, the concatenation of a sentiment vector and is fed into the decoder to generate a response and this vector is called “sentiment context”, .
To generate the sentiment vector , similarly to word embedding, we first map the sentiment label to a vector
, and this vector will be fed into a fully-connected neural network to output the sentiment vector.
The computation graph of the Sentiment-Context SEQ2SEQ model is shown in Figure 1.
Conditional Variational Autoencoders (CVAEs) SEQ2SEQ
We follow the model structure described in sohn2015learning sohn2015learning and acl2018zhou acl2018zhou to build the CVAE-SEQ2SEQ model.
Mathematically, the objective of CVAE-SEQ2SEQ is to maximize the lower bound probability of the response given the sentiment context vector, i.e.,
where is the latent variable,
is the sentiment context vector mentioned before. Based on the assumption that the latent variable follows a multivariate Gaussian distribution with a diagonal covariance matrix, the lower bound ofis:
where is modeled by another decoder which is different from the decoder in the seq2seq model, is described by a recognition network and is modeled by a prior network, both of which are MLP-based neural networks..
In more detail, for the encoder RNN, CVAE-SEQ2SEQ uses the same setting as the SEQ2SEQ model to encode the dialogue history into a context vector. The decoder RNN, however, is different because it now takes the concatenation of the sentiment context vector and the sampled stochastic latent variable as input to generate a response. At training time, the latent variable sample is drawn from an approximate posterior network and used for optimizing the variational lower-bound given by Eq. 5. At test time, the latent variable sample is drawn from a prior network for decoding, which has no knowledge of the ground-truth response. Furthermore, the bag-of-word loss [Zhao, Zhao, and Eskenazi2017] has been added in the above objective function. Therefore, the final objective function for the CVAE-SEQ2SEQ is:
Conditional Generative Adversarial Net SEQ2SEQ
Adversarial training methods have been successfully applied to neural dialogue generation [Li et al.2017] to improve the quality of generated responses. However, in their model, the property of the response such as the sentiment, could not be controlled explicitly. Therefore, we propose a conditional generative adversarial network-based dialogue system named CGAN-SEQ2SEQ which is able to improve the quality of the response and control its sentiment at the same time. The model is shown in Figure 2.
Our proposed CGAN-SEQ2SEQ consists of two components, i.e., a conditional generator and a conditional discriminator .
Generator . We adopt the original sentiment-context SEQ2SEQ as the which could generate a response given the dialogue history and a sentiment label . The goal of the is to produce high-quality and sentiment-controlled responses as similar as those being generated from human beings so as to fool the discriminator .
Discriminator . The discriminator in our framework is to identify whether the input response is generated from human beings or machines given the dialogue history and the sentiment label . Specifically, the discriminator consists of two encoders. The first encoder is similar to that in the SEQ2SEQ model which is able to encode the input dialogue history to a representation vector which will be concatenated with the sentiment vector to compose the sentiment context vector. For the second encoder, the initialize state will be set as the sentiment context vector allowing the decoder to condition on the and , then it encodes the response sequence to a representation vector. Finally, the concatenation of this vector and sentiment context vector will be fed into a fully-connected neural network-based binary classifier to compute the final result. The reason why we utilize the sentiment context vector again is to let the pay more attention to the sentiment information. The computation graph of the is illustrated in the Figure 3.
A Game with Two Players. Following the training process mentioned in the li2017adversarial li2017adversarial, we first pre-train a generator without the discriminator, then freeze the parameters of the pre-trained generator to pre-train the discriminator.During pre-training of the discriminator, responses generated from the pre-trained generator and human beings are regarded as negative and positive samples respectively. Finally, the generator and the discriminator play a two-player game. In this game, the generator first generates a response given the dialogue history and the sentiment label , then the discriminator provides back to the generator and use triples (, , ) and (, , ) to train itself. The generator will be optimized according to the obtained from discriminator.
Policy Gradient Training. Similarly to li2017adversarial li2017adversarial, the generator in our proposed framework is a probabilistic transformation from the dialogue history to the dialogue response, both in discrete space. Therefore, we also employ the REINFORCE algorithm [Williams1992] to optimize it.
The objective of the generator is to maximize the expected reward of generated responses:
Note that can be regarded as the probability of the response being generated from human beings given the and . The gradient with respect to the in Eq.7 could be approximately computed by the likelihood ratio trick [Williams1992]:
whereNg, Harada, and Russell1999].
Intuitively, when the generated response is more likely to fool the , the larger reward the will get, and thus parameters will be updated with a larger step.
One advantage of the CGAN-SEQ2SEQ over CVAE-SEQ2SEQ is that it will not change the structure of the SEQ2SEQ model. During response generation, the discriminator could be removed and the SEQ2SEQ model remains the same.
From the Figure 2
, it is easy to find that the generator could also be the CVAE-SEQ2SEQ. Therefore, we propose a CGAN-CVAE SEQ2SEQ model, in which the generator is the CVAE-SEQ2SEQ model and the discriminator stays the same as that in the CGAN SEQ2SEQ model. Intuitively, from the reinforcement learning perspective, the discriminator is regarded as the reward provider and for a high-quality generated response, it will assign a higher reward back to the generator.
To evaluate our proposed framework, we use the large corpus of tweets with emojis collected and used in acl2018zhou acl2018zhou. To simplify the task, we classify all emojis into two clusters, i.e., positive and negative. As a result, there are approximately 374, 21 and 21 tweets in train, dev and test sets respectively. The ratio of the ratio of positive to negative samples is around 3:1.
Perplexity: Perplexity is a common metric used in many natural language tasks and connected to the likelihood of the gold response given a dialogue generation model. Although the diversity of responses generated by the dialogue system is very important, the system should nonetheless assign a relatively high likelihood to the ground truth response.
Because the goal of our task is to control the sentiment of the response given a sentiment label and a dialogue history, whether the generated response correctly reflects the sentiment is very important. Therefore, we build a sentiment classifier on the training set and evaluate the generated responses by how often the classifier-predicted label represents the specified sentiment.
Human Evaluation: Because automatic metrics are sub-optimal for evaluating performance of dialogue generation systems [Liu et al.2016], we ask three judges to evaluate 30 random items, each of which consists of a dialogue history, a gold response, and a generated response. Judges are expected to evaluate in two settings.
In one setting, the goal is to evaluate the quality of dialogue responses from different models. We use a 1-5 scale where means that the response and the dialogue history is highly relevant semantically and syntactically, and means they are irrelevant.
In the other setting, judges are asked to label the sentiment of the given responses as positive or negative. In this case only the generated responses are provided to the judges.
Note that these two experiments are conducted separately and the items are different in order to avoid bias.
Sentiment Classifier Our sentiment classifier is 1-layer bidirectional RNN-GRU encoder with 128 hidden units in each direction. This is fed into an MLP classifier to predict the final sentiment class.
We employ a standard SEQ2SEQ [Sutskever, Vinyals, and Le2014] model with attention [Luong, Pham, and Manning2015] to build the sentiment-context SEQ2SEQ model. The encoder is a 1-layer bidirectional GRU with hidden size 128 in each direction, and the dimension of the sentiment vector is 12. The decoder is a 1-layer GRU of size
. The Adam optimizer with a 1e-3 learning rate and gradients clipped to 5 is employed to train this model.
Following the experiment settings in acl2018zhou acl2018zhou, we incorporate a response encoder, a recognition network and a prior network into the above SEQ2SEQ model to build a CVAE-SEQ2SEQ model. The response encoder is another 1-layer bidirectional GRU of size 128 in each direction. The mean and log variance of latent variable is obtained from the recognition and the prior network, both of which are two fully-connected networks, then latent variables are sampled via the reparameterization trick [Kingma and Welling2013]. During generation without golden responses, the latent variable sampled from the prior network will be directly fed into the decoder.
Capacity for Sentiment Control: The sentiment control capacity of each model is evaluated by the sentiment accuracy metric. As shown in Table 1, the CGAN-CVAE SEQ2SEQ model outperforms all the other models in sentiment accuracy, indicating that, combining CGANs and CVAEs together, the generator could control the sentiment of the response more effectively than the respective baselines. Although the sentiment accuracy of the CGAN-SEQ2SEQ is better than the SEQ2SEQ model, it can not control the sentiment of the response as well as the CVAE-SEQ2SEQ model. We suspect that this is because during REINFORCE training the generator can only access the generated sentences, which will be noise to deteriorate the generator if they are of low quality. We have found that the responses from the pre-trained generator are indeed generic and do not control the sentiment information well. However, the CVAE-SEQ2SEQ model can utilize the golden response at every training step.
Response Quality: We employ Perplexity (PPL), which is shown in Table 1, as a proxy to evaluate the response quality. Compared with other models, the CGAN-CVAE SEQ2SEQ model achieves the lowest PPL score, which means that its likelihood of generating the golden response is highest. Similarly to the sentiment accuracy, the PPL of the CGAN-SEQ2SEQ is higher than that of the CVAE-SEQ2SEQ and we attribute this to the same reason mentioned above.
The human evaluation result is shown in Table 2. With respect to both content quality and sentiment accuracy, the CGAN-CVAE has better accuracy than other models. This demonstrates that our proposed CGAN-CVAE could not only generate high-quality dialogue response but effectively control the sentiment of dialogue responses as well, which is consistent with the automatic evaluation results. The overall performance of the CVAE model is also better than that of the CGAN model.
In order to show the differences between the performances of these models more concretely, we show some examples in Table 3. We can clearly see that the responses generated from the CGAN-CVAE SEQ2SEQ and CVAE-SEQ2SEQ models are more distinctive given different sentiment labels and topics related to the dialogue context. For CGAN-SEQ2SEQ, the sentiment of the response is relatively consistent with the sentiment label but compared with CVAE-SEQ2SEQ, the diversity of responses is relatively low. As for the SEQ2SEQ model, it seems that it only remembers some sentimental words and the responses are quite dull and generic.
|Perplexity||Sentiment Acc (%)|
|Context||goldlink is dope live one of my favorite shows i ’ve been to|
|SEQ2SEQ||i ’m so happy for you||i ’m so sad|
|CGAN||i ’m gonna be there||i ’m not sure i ’m gonna be able to find it|
|CVAE||i like the song||i feel like i was gonna cry|
|CVAE-CGAN||omg i love it||that ’s the worst|
|Context||and i never got lol|
|SEQ2SEQ||i ’m so excited||i ’m so sorry|
|CGAN||he ’s so cute||i ’m not sure i ’m not going to be a fan of|
|CVAE||i love to hear that ! i ’m so happy to hear this||well , didn ’t realize you had to get the wrong name|
|CVAE-CGAN||lmao i ’ m looking for it||i ’m sorry for you|
|Context||always got ya my dude no matter what ! lets bowl|
|SEQ2SEQ||i’m so mad||i ’m not sure if it ’s a good idea|
|CGAN||i ’m glad you ’re enjoying it||i ’m not sure if you ’re joking|
|CVAE||we are doing a great job !||wow . i hate you|
|CVAE-CGAN||yes i love you guys , but it is a good time||i mean i hate my bestfriend|
The sentiment is crucial to human-human communication, and thus for machines to communicate smoothly with humans, it is necessary for machines to generate utterances with sentiment.
skowron2011good skowron2011good propose that affective profiles in a dialogue system are strongly correlated with the emotional changes experienced by participants. partala2004effects partala2004effects show that positive affective intervention could be especially useful to enhance users’ problem solving performance. prendinger2005empathic prendinger2005empathic indicate that a computer agent with empathic feedback could support people preparing for an interview. polzin2000emotion polzin2000emotion describe how a dialogue system adjusts its interaction according to the emotion of users.
While these works are valuable proofs-of-concept, they generally either focus on small-scale corpora or using rule-based templates to generate responses, which make them difficult to extend to large-scale open-domain dialogue generation.
Some recent works have tried to incorporate sentiment in large-scale conversation generation. hasegawa-EtAl:2013:ACL2013 hasegawa-EtAl:2013:ACL2013 study how utterances affect the emotion of other speakers, and try to predict the emotion of the user to generate a response which reflects the specific emotion using the statistical machine translation framework [Ritter, Cherry, and Dolan2011]. Within the framework of neural dialogue systems, pioneering work by shen2017conditional shen2017conditional incorporate sentiment into the variational hierarchical encoder-decoder model [Serban et al.2017]. However, their model is trying to improve the quality of the response instead of controlling the sentiment of the response. In parallel, zhou2017emotional zhou2017emotional tried to model sentiment in a dialogue conversation system through three mechanisms: emotion category embedding, internal emotion memory, and external memory. The internal memory models the change of the internal emotion state of the decoder, and therefore encodes how much an emotion has already been expressed. The external memory decides whether to choose an emotional or generic (non-emotional) word at a given step during decoding.
In this paper, we propose an intuitive training objective for neural dialogue generation, which is to control the sentiment of the generated response explicitly. This objective is implemented through a conditional adversarial training paradigm, in which the generator is trained to generate sentiment-controlled responses via sentiment labels assisted by a discriminator. Furthermore, the generator in our system could be the standard SEQ2SEQ or CVAEs-based SEQ2SEQ models. Our system adopts a policy gradient algorithm to deal with the optimization challenge posed by discrete generator outputs. Experiments clearly demonstrate the effectiveness of such an adversarial training objective in successfully controlling the sentiment of the response and improving the content quality.
Future directions include validating our approach on more fine-grained sentiment-based data and improved combination with other advanced techniques in reinforcement learning and adversarial learning such as reward shaping, etc.
- [Bowman et al.2015] Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; and Bengio, S. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
- [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672–2680.
- [Gu et al.2016] Gu, J.; Lu, Z.; Li, H.; and Li, V. O. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393.
- [Hasegawa et al.2013] Hasegawa, T.; Kaji, N.; Yoshinaga, N.; and Toyoda, M. 2013. Predicting and eliciting addressee’s emotion in online dialogue. In Proceedings of the 51st ACL, 964–972.
[Hu et al.2017]
Hu, Z.; Yang, Z.; Liang, X.; Salakhutdinov, R.; and Xing, E. P.
Toward controlled generation of text.
International Conference on Machine Learning, 1587–1596.
- [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- [Li et al.2015] Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2015. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
- [Li et al.2016] Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; and Jurafsky, D. 2016. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.
- [Li et al.2017] Li, J.; Monroe, W.; Shi, T.; Ritter, A.; and Jurafsky, D. 2017. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547.
- [Liu et al.2016] Liu, C.-W.; Lowe, R.; Serban, I. V.; Noseworthy, M.; Charlin, L.; and Pineau, J. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
- [Luong, Pham, and Manning2015] Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
- [Mirza and Osindero2014] Mirza, M., and Osindero, S. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
- [Ng, Harada, and Russell1999] Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, 278–287.
- [Partala and Surakka2004] Partala, T., and Surakka, V. 2004. The effects of affective interventions in human–computer interaction. IWC.
- [Polzin and Waibel2000] Polzin, T. S., and Waibel, A. 2000. Emotion-sensitive human-computer interfaces. In ITRW.
Prendinger, H., and Ishizuka, M.
The empathic companion: A character-based interface that addresses
Applied Artificial Intelligence19(3-4):267–285.
- [Ritter, Cherry, and Dolan2011] Ritter, A.; Cherry, C.; and Dolan, W. B. 2011. Data-driven response generation in social media. In Proceedings of the conference on EMNLP, 583–593. ACL.
- [Salimans et al.2016] Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; and Chen, X. 2016. Improved techniques for training gans. In NIPS, 2234–2242.
- [Serban et al.2016] Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A. C.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, 3776–3784.
- [Serban et al.2017] Serban, I. V.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A. C.; and Bengio, Y. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, 3295–3301.
- [Shang, Lu, and Li2015] Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364.
- [Shen et al.2017] Shen, X.; Su, H.; Li, Y.; Li, W.; Niu, S.; Zhao, Y.; Aizawa, A.; and Long, G. 2017. A conditional variational framework for dialog generation. arXiv preprint arXiv:1705.00316.
- [Skowron et al.2011] Skowron, M.; Rank, S.; Theunis, M.; and Sienkiewicz, J. 2011. The good, the bad and the neutral: affective profile in dialog system-user communication. In ACII, 337–346. Springer.
- [Sohn, Lee, and Yan2015] Sohn, K.; Lee, H.; and Yan, X. 2015. Learning structured output representation using deep conditional generative models. In NIPS, 3483–3491.
- [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS, 3104–3112.
- [Vinyals and Le2015] Vinyals, O., and Le, Q. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
- [Williams1992] Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.
- [Xing et al.2016] Xing, C.; Wu, W.; Wu, Y.; Liu, J.; Huang, Y.; Zhou, M.; and Ma, W.-Y. 2016. Topic augmented neural response generation with a joint attention mechanism. arXiv preprint arXiv:1606.08340.
- [Yu et al.2017] Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2852–2858.
- [Zhao, Zhao, and Eskenazi2017] Zhao, T.; Zhao, R.; and Eskenazi, M. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960.
- [Zhou and Neubig2017] Zhou, C., and Neubig, G. 2017. Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction. arXiv preprint arXiv:1704.01691.
- [Zhou and Wang2018] Zhou, X., and Wang, W. Y. 2018. Mojitalk: Generating emotional responses at scale. In Proceedings of the 56th ACL. Melbourne, Victoria, Australia: ACL.
- [Zhou et al.2017] Zhou, H.; Huang, M.; Zhang, T.; Zhu, X.; and Liu, B. 2017. Emotional chatting machine: Emotional conversation generation with internal and external memory. arXiv preprint arXiv:1704.01074.