In recent years, with the development of artificial intelligence and robotics, affective computing has become increasingly critical in the research on human-computer interaction. Artificial intelligence with both emotion and intelligence has higher practical value and significance [1, 2]. To achieve accurate artificial intelligence, it is necessary to facilitate natural human-computer interactions that integrate intelligence and emotion.
In addition to visual, speech and other forms of expression, text is a basic and essential mode of emotion expression and is widely used in daily life. The emotional calculation of text includes text emotion recognition and emotional text generation. There are many works on text emotion recognition, and generating emotional text is very challenging. It is difficult to consider emotions naturally and coherently because we need to balance grammaticality and expression. The present emotional text generation considers the rule method and is a task-oriented application, limiting the domain adaptability and scalability of the model. In recent years, most research efforts are focused on improving the quality of conversational content (e.g., fluency, diversity) [4, 5] while ignoring the generation of fine-grained emotional factors in text. In , the researchers first introduced emotion into the neural network language model and proved that emotional sentences have better performance than sentences generated by the traditional models without considering emotion. Other researchers have used reinforcement learning methods to generate emotional text [8, 9]. In , the researchers used the reinforcement learning method to minimize penalty items, further strengthened the constraints on the text emotions, and enhanced the emotional factors in text.
There are two shortcomings in the past work. First, a whole sentence is generated from left to right, which is not entirely in line with habitual human natural language expression. This approach also limits the variety and fluency of the generated text [11, 12]. Second, existing models are not able to fully consider the emotional elements contained in a conversation. The emotional of replies is uncontrollable, and in some cases, emotion is undetectable.
Given the above deficiencies, this paper proposes an emotional editing constraint conversation content generation model based on reinforcement learning. This paper makes the following contributions:
The proposed emotional editor can select the template sentence based on the topic and emotion and further optimize the generation of replies. The generated replies have more accurate emotion.
The proposed model comprehensively constrains the generation of replies from three aspects: coherence, topic, and emotion by introducing the reinforcement learning.
The proposed model introduces the multi-task learning method to enhance the model effect and learn the coherence, topic, and emotion of a reply so that these indicators can coordinate and constrain each other.
The experiments show that the proposed model is better than previous models that consider only one-sided factors, including the fluency of replies, the relevance of emotion and the relevance of the topic.
Ii Related Work
Recently, a sequence-to-sequence model based on sequence prediction problems, which can be applied to large-scale datasets , has been widely used in machine translation  and conversation generation . Later, a large number of variant models based on that were proposed, focusing on improving the quality of text in terms of grammar and sentence patterns, including increasing the diversity of generated text , introducing additional prior knowledge to generate more meaningful text [4, 20].
The work in [14, 15] verifies that machines that can generate meaningful and emotional replies can enhance the users’ satisfaction and lead to a smarter interaction. However, in the above work, emotional factors are less considered in the text generation process. In 
, the researchers introduced the emotion category vector and two storage mechanisms to generate the replies of the corresponding emotions, the quality of the replies was improved compared with the past. In, the researchers introduced topic information and emotional information. Emotional keywords and topic keywords were predicted to guide the generation of replies so that the replies have higher topic relevance and emotional relevance.
Unsupervised text generation is also an important research field of natural language processing. In, the researchers combined the traditional sequence-to-sequence model with the reinforcement learning and proposed a model with information flow, semantic coherence, and ease of answer as the rewards, which improved the quality of text. In addition, the generative adversarial network (GAN)  is a novel unsupervised generation model that is similar in nature to reinforcement learning and has many applications in text generation. In , the researchers introduced the reinforcement learning to address the weakness that the GAN is indifferentiable to discrete sequence data. In , the researchers used the Monte Carlo search to calculate the penalty term in the generation process and minimized the expectation of the overall penalty term as the objective function. The emotional constraint in the generation process was strengthened.
The above method uses the reinforcement learning strategy and the tradition neural network to generate the emotional conversation, but there are still two shortcomings: First, because natural language belongs to high-level semantic coding, it is difficult to find perfect objective indicators to measure it. Second, the works are unable to effectively excavate the emotional elements in conversation. The emotional strengths of generated replies are uncontrollable and inconspicuous. It is difficult to give full play to the role of emotion in conversation, and the resulting replies appear to be very blunt and rigid. The lexical, syntactic, grammatical and other information related to emotional factors is not considered.
Iii Emotional Conversation Generation Model
This section discusses the proposed emotional conversation generation model in detail. We use to represent a post input by the external environment and to represent a reply given by the agent to the input. The components (states, actions, reward, etc.) of model are summarized in the following sub-sections. Due to the length limit, we provide the model details in the supplementary file.
Iii-a Action in Model
An action is the process of generating a reply to an input post. The action space is infinite since arbitrary-length sequences can be generated.
Iii-A1 The Overview of Action
Given post x, the encoder is utilized to obtain the encoded vector. After that, the process of generation consists of the following four steps:
Step I: The structure predictor is first used to predict whether an emotional keyword or topic keyword needs to be included in the reply and to predict the positional relationship between them.
Step II: Based on the result of Step I, a keyword predictor is used to generate corresponding keywords, and these keywords are used as prior knowledge to guide the generation of replies.
Step III: After the keywords are generated, the asynchronous decoder is used to generate the reply. The model considers two cases: when only one keyword exists, an asynchronous decoder similar to  is used to generate the reply. When the reply requires two keywords, the reply is divided into three clauses with the keywords as the boundary. The decoder generates these three clauses in turn. The three clauses are then combined into a complete reply according to the positional relationship.
Step IV: A template sentence is selected from the training set based on the emotional keyword and the topic keyword, and then, the template sentence is used to generate a corresponding emotion editing vector. The template and the emotion editing vector are used to edit and optimize the reply generated in Step III, thereby further improving the emotional accuracy and content quality of the reply.
Iii-A2 Keywords Predictor
The main role of the keyword predictor is to predict which keywords should appear in the reply. The emotion dictionary and topic dictionary used are the same as the work in . We first use the pre-trained LDA model to analyse the input post, predict the topic category of the reply, and determine the emotion category by artificially designating one of the seven categories listed above. To integrate the prior knowledge into the process, we combine the sum of hidden states and the category embedding , keywords are predicted as follows:
where and separately represent the emotion keyword and topic keyword that is expected to appear in the reply.
Iii-A3 Asynchronous Decoder
The next step is to generate the reply based on this prior knowledge. For the case where only one keyword is included, we use this keyword as the starting point and then go backward and forward to generate other parts of the reply. For the case where two keywords are included, because there are many situations, one of them is selected for detailed description to facilitate discussion. That is, the emotional keyword is in front, and the topic keyword is in the back. Other situations can be analogized.
Formally, suppose that the input post is , and the reply is , where and represent the start word and the terminator , respectively, and and represent the emotional keyword and the topic keyword, respectively. represents the portion between and the emotional keyword, represents the portion between the emotional keyword and the topic keyword, and represents the portion between the topic keyword and .
As shown in Fig.1, the entire reply is divided into three clauses. First, we generate with and as the starting word and the ending word, respectively. Second, based on the , we generate with and as the starting word and the ending word, respectively. Third, based on the and the , is generated starting from to . Then, the clauses and keywords are combined in the previously determined order to get a complete reply. The specific process is as follows:
where denotes the set of keywords, , and denote intermediate states in the decoding process of the three clauses, respectively.
Iii-A4 Emotional Editor
(1) Picking a template: As shown in Fig.2, the words with single underlines denote the keywords and the words with wavy underlines denote the emotional editing part. We select the template in training set based on the keywords and the positional relationship of the keywords. The priority when selecting a template is as follows (decreasing): a sentence containing the same keywords and the same positional relationship, a sentence containing the same keywords but different positional relationships, a sentence containing only the same topic keyword, and a sentence containing only the same emotional keyword. Using lexical-level similarities to distinguish between sentences with the same priority:
Where is the Jaccard distance between template sentence and primary reply . According to the above rules, the sentence with the highest priority and the highest similarity with the candidate reply is selected as the template sentence .
(2) Calculate the emotion editing vector: In , the authors suppose that multi-word insertions and deletions to be represented as the sum of the inserted word vectors. In contrast to the above work, to enhance the optimization effect of the edit vector on emotion, we introduce the emotion coefficient for each word in a sentence. The smaller the distance from the emotional keyword, the greater the emotional coefficient of the word is. We multiply the word vector of each word to be modified and the word emotion coefficient and then sum them, thereby calculating the emotion editing vector. Formally, define to be the set of words added and to be the words deleted. We represent the difference between and using the following vector:
where represents the distance between word and emotional keyword and represents the emotional coefficient of word . represents the word vector of word and represents a join operation.
Referring to Kelvin’s work, we design to add noise to perturb the direction of vector . We let , and let denote a distribution over points v on the unit sphere with mean vector and concentration parameter . Define the following:
where is the truncated norm. The resulting edit vector is .
(3) Edit optimization:We employ an encoder-decoder architecture to implement emotional editor, where prototype is the input sequence and revised sentence is the output sequence, extending it to condition on an edit vector by concatenating to the input of the decoder at each time step:
The emotional editing optimization of the reply is completed, and final reply is obtained.
Iii-B Rewards Calculation
For the contents of state and policy, we refer to the work in . The state is represented by post
input by the external environment. Note that we use a stochastic representation of the policy (a probability distribution over actions given states). In this subsection, we discuss major factors that contribute to the success of a reply and describe how approximations to these factors can be operationalized in computable reward functions.
Coherence: denotes the probability of generating reply given post . denotes the backward probability of generating post based on reply . is trained in a similar way as standard sequence-to-sequence models with sources and targets swapped. Again, to control the influence of replies length, both and are scaled by the length of replies. and represent the length of the reply and the post, respectively. We calculate the coherence of reply with the following:
Topic relevance:We use the pre-trained LDA model mentioned earlier to make topic category predictions for the reply. We define as the topic category for the post, as the predicted probability distribution of the topic of the LDA model for the reply, and as the total number of topic categories. The topic relevance of reply is calculated by the following:
Emotion relevanceas the specified sentiment category, as the predicted probability distribution of the classifier, and as the total number of sentiment categories. We calculate the emotional relevance of reply by the following:
To strengthen the constraints on the reply generation process, rewards are calculated for each clause, that is, the weighted sum of the indicators proposed above. Each clause has a different focus, so its weights in reward calculations are different. After repeated experiments, when we use the following weight parameters, the model has the best fitting effect on the corpus. The reward calculation formulas are organized as follows:
where , and represent the rewards of the three clauses , and , respectively, and represents the reward of the reply that is spliced and edited. The process of calculating the reward is shown in Figure 3. Therefore, the final reward for generating a reply is
where and represent the rewards of the three clauses and , respectively, and represents the reward of the reply that is spliced and edited. The process of calculating the reward is shown in Fig. 3. Therefore, the final reward for generating a reply is
The model uses multiple indicators to comprehensively consider the reply; therefore, to promote learning between indicators, the model introduces a multi-task learning strategy based on parameter sharing . In the process of generating the reply, the encoder is shared, especially in the process of generating each clause. By using the same encoder, the indicators can be combined with each other, which is more conducive to measuring the quality of reply from the overall perspective.
The model is able to generate some plausible replies by initializing the MLE parameters. We then use policy gradient methods to find parameters that lead to a larger expected reward. The objective to maximize is the expected future reward:
where denotes the reward resulting from action . We use the likelihood ratio trick  for gradient updates:
Iv Experiments and Results
Iv-a Dataset Description
The experiments used the emotional conversation dataset NLPCC2017 to train and test the proposed model. There are 1,118,341 post-reply pairs after the entire dataset has been filtered to remove meaningless sentences. Approximately 43.5% of the conversation replies contain two keywords. Here, we focus on data experiments with two keywords. A total of 80,000 pairs of sentences were randomly selected from the training set to train the LDA model, and 100,000 pairs of sentences were randomly selected to train the emotional classifier. Due to the length limit, we provide the implementation details in the supplementary file.
Iv-B Baseline Models
In the experiments, our model is compared with the following baselines:
Seq2Seq: An encoder and a decoder for text generation can be used to generate some fluent text. We compare it to the quality of the generated text.
ECM: It introduced emotional embedding vectors and two stored mechanisms to generate emotional replies. We contrast it with the emotional intensity and emotional accuracy of the replies.
SentiGAN: Using GAN and the reinforcement learning strategy support the generation of emotional text. We contrast it with the emotional intensity and emotional accuracy of the replies.
E-SCBA: The model introduced both emotion and topic knowledge into the generation to make a comprehensive optimization for the quality of replies. We contrast it with the content quality and sentiment of the text.
W/O Edit: To verify the effect of the proposed emotion editor, the emotion editor is removed and compared to the complete model.
Iv-C Manual Evaluation
We asked four annotators to evaluate the results of our model and baselines. In total, we used 700 conversations, 100 for each emotion category, which were sampled randomly from the test set. The annotators were asked to score a reply based on the following metrics:
Consistency measures the fluency and grammaticality of the reply on a three-point scale: 0, 1, 2. Logic measures the degree to which the post and the reply logically match on a three-point scale as above. Note that overly short or overly frequent replies would be annotated as either 0 or 1 (if the annotator thought the reply related to the post), such as ”Me too”. Emotion measures whether the reply includes the right emotion. A score of 0 means the emotion in the reply is wrong or there is no emotion, a score of 1 means the reply has the correct emotion but the intensity is weak, and a score of 2 means the reply has the correct emotion and the intensity is strong. Because SentiGAN is generally used for emotional text generation, no logical comparison is involved.
Table I (2-tailed t-test:for Consistency and Logic, for Emotion) compares our model with the baselines. As we can see, the average performance of our model on the three indicators is better than that of other models. The experimental results are further analysed below.
Considering consistency and emotional relevance, our model is much better than others. However, the model without the emotion editor is not outstanding in relation to these two indicators; in fact. This shows that the proposed emotion editor can improve the fluency and emotional relevance of the reply. In terms of logic, our model does not achieve the best results for the surprised and angry emotions. This is mainly because the datasets for the two emotion categories are relatively small. This leads to the selection of a template sentence ignoring the constraints on the topic and the deviation of the optimized reply in the topic.
The score distribution of each model in terms of logic and sentiment is calculated, as shown in Table II. For instance, 2-1 means logic score is 2 and emotion score is 1. As observed, the baseline models have a small proportion of 2-2, which indicates that they can’t balance the emotion and the topic. However, the model proposed in this paper performs well in this respect, with the proportion of 2-2 reaching 41.7% and the percentage of the emotional score of 2 reaching 67.3%, which shows that the proposed model makes up for the shortcomings of the previous model’s weak emotion.
Iv-D Automatic Evaluation
We adopted perplexity to evaluate the model at the content level (to determine whether the content is relevant and grammatical). To evaluate the model at the emotion level, we adopted emotion accuracy as a reflection of agreement between the expected emotion category (as input to the model) and the predicted emotion category of a reply generated by the emotion classifier.
The results of the experiment are shown in Table III. To avoid the contingency of the experiment, we performed 3 tests for each model. The results show that our model achieves the best results in terms of perplexity and emotional accuracy. Compared with the model without emotion editor, the results show that the latter does not perform well in terms of perplexity and emotional accuracy. However, after the emotion editor is added, the model’s performance greatly improves. This shows that the proposed emotion editor can integrate the keywords prior knowledge into the reply naturally. Smoothing, optimizing, and other editing operations are performed on the replies according to the template, which not only makes the reply more fluent but also makes the emotions of reply more prominent.
In Fig.4, we visualize the diversity distribution of words in different positions (1-10) of the reply. Our model is committed to solving the problem of generic replies, which can be defined as a high frequency of certain replies to posts as well as a large number of identical words produced in the same place. The results shown in the figure have been normalized.
The worst model of all is the general Seq2Seq, whose diversity in different locations is always low. In addition to insufficient information from posts, the immutable sequential structure limits the potential of the model, resulting in generic replies. In contrast, our model obtains sufficient information in the process of decoding by introducing prior knowledge of keywords. The editor is then used to optimize replies to further improve the text quality and emotional relevance without generating a single secure reply. Besides, the colour of our model fades more slowly, showing that our model improves not only the quality of content but also the capacity of memory.
Iv-E Case Study
We provide some examples in Fig.5. The words with single underlines denote the keywords, and the words with wavy underlines denote the emotional editing optimization part. As we can see, the general Seq2Seq prefers to generate short and meaningless replies. The replies are more like a summary of the posts rather than a conversation.
The ECM model is improved compared to the Seq2Seq model, and it can generate fluent and emotional replies. However, the examples demonstrate that the most of the replies generated by the ECM have weak or even no emotion, such as ”Where is it here? Ask for explanation”, etc. This is because ECM only introduces the emotion embedding vector to guide the model to generate an emotional reply, which lacks detailed emotional guidance information when generated, resulting in fuzzy emotional replies. In contrast, the replies generated by our model, after the optimization of emotional editing, not only have rich and varied sentence patterns but also greatly enhance the emotional relevance and intensity. For example, the happy reply ”I have never seen such a cute cake!” and the angry reply ”Such bad weather!” contain more realistic and more detailed emotions.
Iv-F Conclusions and Future Work
This paper proposes an emotional conversation generation model based on the reinforcement learning. Our model generates replies in three iterations and proposes a mechanism of emotional editing that refers to existing sentences to further strengthen the content quality and emotional relevance of the replies. Subjective and objective experiments show that the model proposed in this paper can generate logical and emotional replies by ensuring the fluency of replies, and the emotion is more prominent and delicate. In the future, we will enhance the flexibility of the model by introducing other knowledge (such as a tone) and customize a personalized framework to meet the specific needs of the actual application.
This work was supported by the State Key Program of the National Natural Science Foundation of China (61432004, 71571058, 61472117,and 61461045). This work was partially supported by a project funded by the China Postdoctoral Science Foundation (2017T100447). This research was also partially supported by a Qinghai Province Science and Technology Fund for fundamental and applied research (No. 2016-ZJ-743).
-  CS Wong and KS Law, The effects of leader and follower emotional intelligence on performance and attitude: An exploratory study - The leadership, Elsevier, 2002.
-  Mai Ngoc Khuong and Vu Ngoc Bich Tram, The Effects of Emotional Marketing on Consumer Product Perception, Brand Awareness and Purchase Decision - A Study in Ho Chi Minh City, Vietnam, Journal of Economics,Business and Management, 2015.
-  Liangchen Luo, Jingjing Xu, Junyang Lin, Qi Zeng, Xu SunAn, Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation, arXiv:1606.01541, 2016.
-  Liu, Y. Huang, M. Zhou and W.-Y. Ma, Topic Aware Neural Response Generation, Association for the Advance of Artificial Intelligence, 2017.
-  Gaurav Pandey, Danish Contractor, Vineet Kumar and Sachindra Joshi, Exemplar Encoder-Decoder for Neural Conversation Generation, Association for the Advance of Artificial Intelligence, 2018.
-  Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren and Percy Liang, Generating Sentences by Editing Prototypes, Association for Computational Linguistics, 2018.
-  H.Zhou, M.Huang, T.Zhang, X.Zhu and B.Liu, Emotional chatting machine: Emotional conversation generation with internal and external memory, Association for the Advance of Artificial Intelligence, 2017.
-  Mahipal Jadeja, Neelanshi Varia and Agam Shah, Deep Reinforcement Learning for Conversational AI, International Conference on Theory of Information Retrieval, 2017.
-  Thomas M. Moerland, Joost Broekens and Catholijn M. Jonker, Emotion in Reinforcement Learning Agents and Robots: A Survey, arXiv:1705.05172, 2017.
-  Ke Wang and Xiaojun Wan, SentiGAN: Generating Sentimental Texts via Mixture Adversarial Networks, International Joint Conference on Artificial Intelligence, 2018.
-  L. Mou, Y. Song, R. Yan, G. Li, L.Zhang and Z.Jin, Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation, International Conference on Computational Linguistics, 2016.
-  Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk et al, deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets, International Joint Conference on Natural Language Processing, 2015.
-  Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao and Dan Jurafsk, Deep Reinforcement Learning for Dialogue Generation, arXiv:1606.01541, 2016.
-  Partala and T., Surakka, The effects of affective interventions in human-computer interaction, Interacting with computers, 2004.
-  Prendinger, Helmut, Ishizuka and Mitsuru, The empathic companion: A character-based interface that addresses users’affective states, Applied Artificial Intelligence, 2005.
-  I. Sutskever, O. Vinyals, and Q. V. Le, Sequence to sequence learning with neural networks, Neural Information Processing System, 2014.
-  K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H.Schwenk and Y.Bengio, Learning phrase representations using rnn encoder-decoder for statistical machine translation, Empirical Methods in Natural Language Processing, 2014.
Liu, C.; Lowe, R.; Serban, I.; Noseworthy, M.; Charlin, L.; and Pineau, J,
How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation, Empirical Methods in Natural Language Processing, 2016.
-  I. V. Serban, A. Sordoni, Y. Bengio,A. C. Courville,J. Pineau, Building end-to-end dialogue systems using generative hierarchical neural network models, Association for the Advance of Artificial Intelligence, 2016.
-  Shuman Liu, Hongshen Chen, Zhaochun Ren, Yang Feng, Qun Liu and Dawei Yin, Knowledge Diffusion for Neural Dialogue Generation, Association for the Advance of Artificial Intelligence, 2018.
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov and Eric P. Xing, Toward controlled generation of text
, International Conference on Machine Learning, 2017.
-  Cagan, T.; Frank, S. L.; and Tsarfaty, Data-driven broad-coverage grammars for opinionated natural language generation, Association for Computational Linguistics, 2017.
-  Goodfellow, Generativeadversarial Nets, Neural Information Processing Systems, 2014.
-  Jingyuan Li and Xiao Sun, A Syntactically Constrained Bidirectional-Asynchronous Approach for Emotional Conversation Generation, arXiv:1806.07000, 2018.
-  Lantao Yuy, Weinan Zhangy, Jun Wangz and Yong Yu, SeqGAN: Sequence Generative Adversarial Nets with Policy, Association for the Advance of Artificial Intelligence, 2017.
L. Xu, H. Lin, Y. Pan, H. Ren, and J. Chen,
Constructing the affective lexicon ontology, Journal of the China Society for Scientific and Technical Information, 2008.
-  Thomas Hofmann, Probabilistic Latent Semantic Indexing, Special Interest Group on Information Retrieval, 1999.
-  Yu Zhang and Qiang Yang, A Survey on Multi-Task Learning, arXiv:1707.08114, 2018.
-  Ronald J Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, 1992.
-  L.Shang, Z.Lu and H.Li, Neural responding machine for shorttext conversation, International Joint Conference on Natural Language Processing, 2015.