Natural Language Generation Using Reinforcement Learning with External Rewards

11/26/2019 ∙ by Vidhushini Srinivasan, et al. ∙ UNC Charlotte 0

We propose an approach towards natural language generation using a bidirectional encoder-decoder which incorporates external rewards through reinforcement learning (RL). We use attention mechanism and maximum mutual information as an initial objective function using RL. Using a two-part training scheme, we train an external reward analyzer to predict the external rewards and then use the predicted rewards to maximize the expected rewards (both internal and external). We evaluate the system on two standard dialogue corpora - Cornell Movie Dialog Corpus and Yelp Restaurant Review Corpus. We report standard evaluation metrics including BLEU, ROUGE-L, and perplexity as well as human evaluation to validate our approach.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We aim to develop models that are capable of generating language across multiple genres of text – say, conversational text and restaurant reviews. After all, humans are adept at both. Extant natural language generation (NLG) models work on either conversational text (e.g. movie dialogues) or longer text (e.g. stories, reviews) but not both [5], [4]. In addition, while the state-of-the-art in this field has advanced quite rapidly, current models are prone to generate language that is short, dull, off-context or vague. More importantly, the generated language may not adequately reflect the affective content of the input. Indeed, humans are adept at this task as well. To address these research challenges, we propose an RNN-LSTM architecture that uses an encoder-decoder network. We also use reinforcement learning that incorporates internal and external rewards. Specifically, we use emotional appropriateness as an internal reward for the NLG system – so that the emotional tone of the generated language is consistent with the emotional tone of prior context fed as input to the model. We also effectively incorporate usefulness scores as external rewards in our model. Our main contribution is the use of distantly labeled data in an architecture that generates coherent, affective content and we test the architecture across two different genres of text.

Ii Problem Statement and Intuition

Our goal is to take advantage of reinforcement learning and external rewards during the process of language generation. Complementary to this goal, we also aim to generate language that has the same emotional tone as the preceding input. Emotions are recognized as functional in decision-making by influencing motivation and action selection [12]. However, external feedback and rewards are hard to come by for language generation; these would need to be provided through crowdsourcing judgments on the generated responses during the generation process, which makes the process time-consuming and impractical. To overcome this issue, we look for distance labeling [2] - and use labels provided in the training set as a proxy for human judgments on the generated responses. Specifically, we incorporate usefulness scores in a restaurant review corpus as a proxy for external feedback.

Fig. 1: Overall Architecture of the system showing internal and external rewards using reinforcement learing

Iii Model Architecture

Figure 1 shows overall system architecture. We use a bidirectional RNN seq2seq encoder-decoder network with maximum mutual information as objective function. We tune the system using reinforcement learning with internal reward functions to promote ease of answering, semantic coherence [4] along with emotional intelligence [7] and use external rewards from human feedback to generate emotionally rich responses. Table 1 shows examples of system output on two different corpora, the Cornell Movie Dialog Corpus and the Yelp review dataset.

Iii-a Affective Word Embeddings

We use the Affective Norms of Words (ANEW) lexicon

[8] that has valence, arousal and dominance scores for words to augment existing word embeddings [16]

. We append Valence (V), Arousal (A) and Dominance (D) score from the ANEW lexicon to each word, resulting in 1027 dimensions for each word. In cases where a match cannot be found in the lexicon, we append a neutral vector [5,1,5] similar to


. This word2vec-VAD embedding is fed as input to the bidirectional RNN encoder-decoder seq2seq model and is also used in the RL system to model the Emotional Intelligence heuristic.

Iii-B Adaptive RL System

Objective Function
Internal Rewards
External Rewards
Li et al. Approach
Policy Gradient Method
  • Ease of Answering

  • Information Flow

  • Semantic Coherance

Our Proposed Approach
Maximum Mutual Information (MMI)
  • Ease of Answering

  • Semantic Coherance

  • Emotional Intelligence

Human Feedback
TABLE II: State-of-the-art Method vs. Our Proposed Approach. We use a different Objective Function, and Internal as well as External Rewards in our model

We use a bidirectional RNN encoder-decoder seq2seq model [19] with Bahdanau-style attention mechanism [20]. Next, we use a greedy decoder to generate the best response at every stage of decoding during decoder training and inference phases. We fine tune the basic seq2seq generative model with the internal and external rewards in our RL tuner system to generate more interesting, diverse and emotionally appropriate responses. The internal rewards take into account coherence, flexibility in answering and also emotional intelligence measures whereas external rewards incorporates human feedback to make the responses resemble human produced ones. Our approach is very closely related to Li et al. [4], however, with key differences in the objective functions and the use of external rewards. These are highlighted in Table II.

The standard objective function for seq2seq models is the log-likelihood of target T given source S, given as follows:


This formulation leads to generic responses, since it only selects for target given source. We optimize this standard objective function by replacing it with Maximum Mutual Information (MMI) [14]. In MMI, parameters are chosen to maximize (pairwise) mutual information between the source S and the target T:


Doing so avoids favoring responses that unconditionally enjoy high probability, and instead biases towards those responses that are specific to the given input. The MMI objective can written as follows:



is the hyperparameter that controls the extent to which we penalize generic responses to get more diverse responses. Adjusting the value of

results in a reasonable number of diverse responses, however, these could still be dull and also lack emotion and proper grammatical structure. To address these issues, we model the reinforcement learning system with appropriate heuristics.

The generated sentences from the seq2seq model can be viewed as actions that are taken by the policy defined by the encoder-decoder language model. The parameters of this encoder-decoder network are fine-tuned using reinforcement learning with policy gradient method [4]. The components are:

Action (a) – dialog utterances to generate i.e. action , where is the sequence generated by RNN-LSTM. The action space is infinite and generates sequences of varying length.

State (S) – dialog is transformed to a vector representation by feeding the current dialog (state) for which the response has to be generated.

Policy – policy takes the form where is the response to be generated for the given dialog

. Here, we use a stochastic distribution to represent policy as it is the probability distribution over actions given states, where both state and actions are dialogs. By doing so, we overcome the difficulty of optimizing a deterministic policy, as that would lead to discontinuous objective and cannot be further used with gradient-based methods.

Rewards (r) – We implement three internal rewards and one external reward to overcome the issues in generating language with seq2seq architecture [19]. The three internal are Ease of Answering , Semantic Coherence , Emotional Intelligence [4] and one external reward [6] from human feedback .

  1. [nolistsep,noitemsep]

  2. Ease of Answering (EA) () – is measured as negative log likelihood of generating a dull response for a dialog. Following [4], [3] and [1], we compose a list of 10 dull responses that frequently occur in the seq2seq model and penalize the model when it generates those responses.111The 10 responses are: “I don’t know.”, “I don’t know what I mean.”, “I don’t know what you’re talking about.”, “You don’t know.”, “You know what I mean.”, “You know what I’m saying.”, “You don’t know anything.”, “I am not sure.”, “I know what you mean.”, “I do not know anything.” Let set represent a list of dull responses. Then, the reward function can be defined as follows:


    represents likelihood output of seq2seq model. The RL system is likely to penalize utterances in the above composed list and hence less likely to generate dull responses. is scaled by length of the target S.

  3. Semantic Coherence (SC) () [4] – is used to avoid situations in which the generated responses are highly rewarded, but are neither grammatical nor coherent. We consider the mutual information between the action and the given input to ensure that the responses are coherent and appropriate. This also involves reverse training the model where we count the probability of the input prompt given the current generated response.

  4. Emotional Intelligence (EI) () [7] – This reward is incorporated by minimizing affective dissonance between the prompts and the responses. This approach tries to maintain affective consistency between input and generated response. The heuristic is based on the fact that open-domain textual conversations between humans follow an affective pattern. Thus, we make an assumption that the affective tone does not fluctuate often in general and we focus on minimizing the dissonance in affective tone between the input prompt and the generated responses.


    Here, W2AV in Equation 6 denotes the word-affect vector of the given sequence and the term denotes average affect vector of the input prompt and denotes average affect vector of the generated response up to the current step .

  5. Human Feedback (HF) () – To incorporate external rewards in our model, the external rewards analyzer is trained with human feedback. We simulate human feedback through the reviews from the Yelp dataset usefulness score. We categorize each review in the Yelp dataset into two main classes Useful and Not Useful based on the frequency distribution of the reviews (as shown in Figure 2). Reviews with normalized scores

    are considered not useful, while the rest are considered to be useful. We exclude all reviews that do not have usefulness ratings, since it is not clear which category they would fall under. Next, we train an SVM classifier to differentiate between the two classes Useful and Not Useful as described above. During the training phase, we determine whether the generated response is useful or not (by classifying the generated output in real-time using the SVM classifier) and give the reward accordingly. This synthetic feedback

    [6] from the external reward analyzer is provided throughout the training phase and a greedy decoder is then used to generate the best response.

Iv Experiments and Results

We test the efficacy of the proposed method in generating text in two different corpora, which pertain to different genres of text. The corpora we used are the Cornell Movie Dialog corpus [17] and the Yelp Restaurant Review dataset. The Cornell Movie-Dialog corpus [17] contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. There are 220,579 conversational exchanges between 10,292 pairs of movie characters involving 9,035 characters. There are 304,713 utterances in total. The Yelp review dataset contains 5.9M reviews. Along with the reviews, this dataset contains nine additional features, including usefulness score, which we use as external rewards. We perform the standard pre-processing steps on both the Cornell and Yelp dataset, including lowercasing all conversations, expanding contractions, compress duplicate end punctuation to one symbol and removing HTML entities.

Table  III

shows the descriptive statistics for both corpora. We take the most common 12,000 words from the training and validation sets as our vocabulary as in

[7], and replace any other tokens in these sets with an unknown symbol UNK. We partition the training and validation sets such that none of the responses in the training set have UNK. This effectively prevents the model from generating the unknown token during inference. The word count is set to maximum threshold of 20.

Cornell corpus
160,000 14,000 6000
Yelp corpus
4,017,986 1,187,406 791,604
TABLE III: Descriptive statistics for the two corpora used in our experiments
Fig. 2: Yelp-Useful score distribution.

Iv-a Proposed Model & Baseline

We use a baseline which is a basic seq2seq model with MMI objective function [14], that does not use reinforcement learning. Our proposed model uses seq2seq to choose initial policy and fine-tunes that model to generate more diverse responses based on internal and external rewards as described in Section 2.2.

For Cornell corpus, the final reward function just uses the internal reward components during reinforcement learning and can be described as follows:


For the Yelp corpus, the final reward function, is the weighted sum of both internal and external rewards, as follows:


For both models, the values of the hyperparameters are given in Table IV

. We can see certain differences in the hyperparameters since the Yelp corpus size is greater than the Cornell corpus (e.g. batch size and number of epochs). The learning rate and decay rate are greater for Yelp because it takes a longer time to converge and train the model than it does on the smaller corpus Cornell. The values for the rewards are adjusted and fine-tuned based on the outcome of each model.

model value
model value
Batch size 128 512
Gradient clip 1.0 1.0
Learning rate 0.01 0.15
Decay rate 0.0095 0.01
Epochs 50 75
LSTM layers 2 2
Encoder RNN size 1027 1027
Decoder RNN size 1027 1027
0.25 0.25
0.35 0.25
0.40 0.25
TABLE IV: Hyperparameter settings for the models with Reinforcement Learning used in our approach.
BLEU ROUGE-L Perplexity
Cornell Corpus without RL 0.15 0.39 98.96
with RL 0.38** 0.55*** 76.65*
Yelp Corpus without RL 0.014 0.24 99.04
with RL 0.21** 0.32** 85.34**
TABLE V: Model evaluation on automated metrics.
* ** ***

Iv-B Performance on Automated Metrics

We evaluate the model using automated metrics including BLEU score, ROUGE-L and Perplexity[18]. In Table V, we report scores on the automated metrics, BLEU, ROUGE-L and Perplexity. The scores are statistically significantly better than baseline (without RL), with for BLEU score, for Perplexity and for ROUGE-L for Cornell. For the Yelp corpus, the model with external rewards performs significantly better on all three metrics () when compared to the baseline (without RL).

Iv-C Human Evaluation of Performance

To evaluate the performance of our model with human ratings, we performed two rounds of crowd-sourced annotation.

First, we created a simple survey, containing 20 prompt/response pairs from both Cornell and Yelp models. We recorded a total of 52 responses from undergraduate and graduate Computer Science students. Each response generated by the system was evaluated on three measures - Syntactic Coherence (how grammatical and coherent the responses are with respect to the given prompt), Natural Flow (how natural the response is to the given prompt) and Emotional Appropriateness (captures the emotional appropriateness of the text to the given prompt) [7].

Next, we conducted experiments on Amazon Mechanical Turk with 100 prompts/response pairs of both Cornell and Yelp models. Each response was rated by at least 5 workers on measures of Syntactic Coherence, Natural Flow and Emotional Appropriateness. Table VI shows the ratings obtained from human evaluation on the metrics of Syntactic Coherence, Natural Flow and Emotional Appropriateness. We find that our model achieves better ratings on all three metrics as we generate longer sentences for the Yelp review and the model is also able to outperform the current state of the art of the model[7] as demonstrated in Table VII.

Cornell Corpus Survey 1.45 1.42 1.44
MTurk 1.49 1.41 1.53
Yelp Corpus Survey 1.46 1.52 1.73
MTurk 1.51 1.50 1.66
TABLE VI: Human evaluation of our models performance on measures of syntactic coherence, naturalness of flow and emotional appropriateness of generated response. Scores are averages on a 3-point (0 being lowest and 2 being highest) Likert scale, with higher scores indicating better performance on a given metric.
Cornell Model
1.49 1.41 1.53
Ashgar et al.
1.45 1.31 1.33
TABLE VII: Comparison of our model against best performing model from state-of-the-art baseline Asghar et al. ([7]) on measures of syntactic coherence, naturalness of flow and emotional appropriateness of generated response. Scores are averages on a 3-point (0 being lowest and 2 being highest) Likert scale, with higher scores indicating better performance on a given metric.

From our results, we can observe that incorporating external rewards results in higher (on average) scores across our metrics of Syntactic Coherence, Natural Flow and Emotional Appropriateness than when external rewards are not incorporated in the model.

V Related Work

Language Generation using Reinforcement Learning: Our work closely follows that of Li et al. [4], who proposed an advancement in seq2seq models using reinforcement learning to obtain diverse, coherent responses that could sustain conversations. They designed appropriate reward functions to overcome some of the challenges in traditional seq2seq models. Li et al. [11] also proposed using adversarial training for open-domain dialogue generation. They cast the task as a reinforcement learning problem where they jointly trained a generative model to produce response sequences, and a discriminator to distinguish between the human-generated dialogues and the machine-generated ones. Similar to our proposed method, [14] used Maximum Mutual Information (MMI) as the objective function. More recently, Sankar and Ravi [23] propose the usage of using discrete attributes such as sentiment, dialog-acts and emotion to generate responses through the use of reinforcement that leads to improvement over traditional seq2seq models. However, in none of these prior works where any external rewards incorporated during the reinforcement learning phase.

Incorporating External Rewards: Christiano et al. [6] used external rewards to fine-tune their reinforcement learning model. However, their system was trained for Atari games and simulated robot locomotion, not language generation. Perhaps the closest work to ours is the work by Niu and Bansal [1] who generated polite language by assigning rewards proportional to the politeness classifier score of the sampled response. Their work, however,does not include emotional appropriateness.

Generation of Emotional Language: Emotions are recognized as functional in decision-making by influencing motivation and action selection. Therefore, computational emotion models are usually grounded in the agent‘s decision-making architecture, of which reinforcement learning is an important subclass. Moerland [12] provides the first survey of computational models of emotion in reinforcement learning agents. The survey focuses on agent/robot emotions. Badoy and Teknomo [13] proposed using four basic emotions: joy, sadness, fear, and anger to influence a Qlearning agent. Simulations show that the proposed affective agent requires fewer steps to find the optimal path.

With respect to language generation, Asghar et al. [7] incorporated affective content in neural models by using the ANEW lexicon [8] and appending word embeddings with affective objective functions to achieve affective response generation. Zhou et al. [9] have proposed Emotional Chatting Machine that can generate appropriate responses not only in content (relevant and grammatical) but also in emotionally consistent with the input prompt. However, these prior approaches also do not incorporate external feedback as a reward towards generating emotionally rich, coherent and useful language.

Vi Discussion and Future Work

The novelty of our approach lies in the addition of Emotional Intelligence as an internal reward function and combining both internal and external rewards to create an emotionally appropriate model. The use of external rewards to generate more sensible and human-like responses is novel in natural language generation task, with the exception of recent work conducted by Niu and Bansal [1].

Our approach is also able to generate language across two different genres of text, one for conversational agent trained on movie dialogue and the other for Yelp restaurant reviews.

We determine through our experiments, that the our seq2seq model with MMI function and appropriately designed reward functions could generate diverse, coherent and emotionally appropriate responses. Moreover, the metrics like BLEU, perplexity and ROUGE-L are inadequate measures of how well the model performs. Through human evaluation, we determine that model performs well and outperforms the state-of-the-art baseline in measures of syntactic coherence, naturalness of flow and emotional appropriateness.

In future, we plan to experiment with different heuristics like maximizing affective dissonance and content as emotional intelligence heuristic reward system. We have used useful score in the Yelp restaurant review dataset as external feedback. We also plan to incorporate direct human feedback into the training phase. All the code used in these experiments and repository of additional examples is available at


  • [1] Niu, T. and Bansal, M., 2018. Polite dialogue generation without parallel data. Transactions of the Association for Computational Linguistics, 6, pp.373-389.
  • [2]

    Mintz, M., Bills, S., Snow, R. and Jurafsky, D., 2009, August. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2 (pp. 1003-1011). Association for Computational Linguistics.

  • [3] Lowe, R., Pow, N., Serban, I. and Pineau, J., 2015, September. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 285-294).
  • [4] Li, J., Monroe, W., Ritter, A., Jurafsky, D., Galley, M. and Gao, J., 2016, November. Deep Reinforcement Learning for Dialogue Generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 1192-1202).
  • [5]

    Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y. and Wang, J., 2018, April. Long text generation via adversarial training with leaked information. In Thirty-Second AAAI Conference on Artificial Intelligence.

  • [6] Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S. and Amodei, D., 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (pp. 4299-4307).
  • [7] Asghar, N., Poupart, P., Hoey, J., Jiang, X. and Mou, L., 2018, March. Affective neural response generation. In European Conference on Information Retrieval (pp. 154-166). Springer, Cham.
  • [8] Warriner, A.B., Kuperman, V. and Brysbaert, M., 2013. Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior research methods, 45(4), pp.1191-1207.
  • [9] Zhou, H., Huang, M., Zhang, T., Zhu, X. and Liu, B., 2018, April. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • [10]

    Jaques, N., Gu, S., Turner, R.E. and Eck, D., 2017. Tuning recurrent neural networks with reinforcement learning.

  • [11] Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A. and Jurafsky, D., 2017, September. Adversarial Learning for Neural Dialogue Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2157-2169).
  • [12]

    Moerland, T.M., Broekens, J. and Jonker, C.M., 2018. Emotion in reinforcement learning agents and robots: a survey. Machine Learning, 107(2), pp.443-480.

  • [13] Badoy Jr, W. and Teknomo, K., 2016. Q-learning with basic emotions. arXiv preprint arXiv:1609.01468.
  • [14] Li, J., Galley, M., Brockett, C., Gao, J. and Dolan, B., 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of NAACL-HLT (pp. 110-119).
  • [15] Sequeira, P., Melo, F.S. and Paiva, A., 2014. Learning by appraising: an emotion-based approach to intrinsic reward design. Adaptive Behavior, 22(5), pp.330-349.
  • [16]

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J., 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

  • [17] Danescu-Niculescu-Mizil, C. and Lee, L., 2011, June. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd workshop on cognitive modeling and computational linguistics (pp. 76-87). Association for Computational Linguistics.
  • [18] Papineni, K., Roukos, S., Ward, T. and Zhu, W.J., 2002, July. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Association for Computational Linguistics.
  • [19] Vinyals, O. and Le, Q., 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
  • [20] Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • [21] Chen, H., Liu, X., Yin, D. and Tang, J., 2017. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2), pp.25-35.
  • [22]

    Hochreiter, S. and Schmidhuber, J., 1997. Long short-term memory. Neural computation, 9(8), pp.1735-1780.

  • [23] Sankar, C. and Ravi, S., 2019. Deep Reinforcement Learning For Modeling Chit-Chat Dialog With Discrete Attributes, arXiv preprint arXiv:1907.02848.