Scalable Sentiment for Sequence-to-sequence Chatbot Response with Performance Analysis

by   Chih-Wei Lee, et al.
Academia Sinica

Conventional seq2seq chatbot models only try to find the sentences with the highest probabilities conditioned on the input sequences, without considering the sentiment of the output sentences. Some research works trying to modify the sentiment of the output sequences were reported. In this paper, we propose five models to scale or adjust the sentiment of the chatbot response: persona-based model, reinforcement learning, plug and play model, sentiment transformation network and cycleGAN, all based on the conventional seq2seq model. We also develop two evaluation metrics to estimate if the responses are reasonable given the input. These metrics together with other two popularly used metrics were used to analyze the performance of the five proposed models on different aspects, and reinforcement learning and cycleGAN were shown to be very attractive. The evaluation metrics were also found to be well correlated with human evaluation.


page 1

page 2

page 3

page 4


Investigation of Sentiment Controllable Chatbot

Conventional seq2seq chatbot models attempt only to find sentences with ...

An Adversarial Approach to High-Quality, Sentiment-Controlled Neural Dialogue Generation

In this work, we propose a method for neural dialogue response generatio...

Lancaster A at SemEval-2017 Task 5: Evaluation metrics matter: predicting sentiment from financial news headlines

This paper describes our participation in Task 5 track 2 of SemEval 2017...

Checklist Models for Improved Output Fluency in Piano Fingering Prediction

In this work we present a new approach for the task of predicting finger...

Evaluating the Performance of Reinforcement Learning Algorithms

Performance evaluations are critical for quantifying algorithmic advance...

Local Explanation of Dialogue Response Generation

In comparison to the interpretation of classification models, the explan...

Spinning Sequence-to-Sequence Models with Meta-Backdoors

We investigate a new threat to neural sequence-to-sequence (seq2seq) mod...

1 Introduction

Unlike goal-oriented dialogue systems [1, 2], a chatbot is aimed to chat with human users on any subject domain of daily lives [3, 4]. The conventional chatbot is based on a seq2seq model [5] to generate meaningful responses given the user input. It is in general emotionless, and this is a major limitation of chatbots today because the emotion plays a critical role in human social interactions especially in chatting [6]. So we wish to train the chatbot to generate responses with scalable sentiment by setting the mode for chatting. For example, for an input, “How was your day today?”, the chatbot may respond, “It is wonderful today” or “It is terrible today” depending on the sentiment set, in addition to simply generating a reasonable response. The mode can either be set by the developer or the user, or determined dynamically based on the context of the dialogue. The techniques mentioned here may be extended to conversational style adjustment, so the machine may imitate the conversational style of someone the user is familiar with, to make the chatbot more friendly or more personal [7, 8].

Substantial effort has been made focused on the conversation fluency and content quality of the generated responses, for example, by enriching the content diversity [9, 10, 11], considering some additional information [12], addressing unknown words [13, 14] and so on. Some works tried to generate responses with controllable factors. The sentiment of a given sentence was successfully modified using non-parallel data [15]. A chatbot which can change the style of responses by optimizing a given function related to the sentiment was also developed [16]. However, not too much work has been reported on scaling the sentiment of a chatbot, and how to properly evaluate a chatbot with adjustable sentiment is still a difficult problem [17, 18].

In this paper, we propose five approaches to scale the sentiment of chatbot responses and a set of evaluation metrics, and use these metrics to analyze the proposed approaches. The five proposed approaches are: persona-based model, reinforcement learning, plug and play model, sentiment transformation network and cycleGAN, all based on the seq2seq model. The set of four metrics to evaluate and analyze the different aspects of the chatbot responses are: two regarding if the responses are appropriate for the input; one regarding if the sentiment of the responses are properly modified; one regarding if the responses are grammatically good without considering the input. We then analyze the proposed approaches with these metrics, and find reinforcement learning and cycleGAN are very attractive.

2 Proposed Approaches

Section 2.1 briefly reviews the conventional seq2seq chatbot which was the basic model used by all the five proposed approaches presented in Section 2.2. Below we assume we wish to make the chatbot response positive conditioned on the input, although it is easy to generalize the approaches to scalable sentiment.

2.1 Seq2seq Model (baseline)

Here we use attention-based seq2seq model [19] as in Figure 1 to train a simple chatbot using a corpus of dialogue pairs. In all discussions here, is the input sentence to the seq2seq chatbot, and is the output of the seq2seq model. is the reference response in the training corpus. In training phase, we input the sentence

(a sequence of one-hot vectors) to the encoder, and the seq2seq model learns to maximize the probability of generating the sentence

given .

Figure 1: Seq2seq model.

2.2 The Five Proposed Approaches

2.2.1 Persona-Based Model

Figure 2: Persona-based Seq2seq model

Persona-based model was originally proposed to generate sentences mimicking the responses of specific speakers [12]. It is very close to the seq2seq model, except adding extra information to the input of the decoder at each time step. In the original work [12], this extra information is the trained speaker embedding. Here we replace the speaker embedding with a sentiment score (a scalar between and

) from a sentiment classifier as in Figure


. This sentiment classifier

[20] is trained with a corpus of sentences with labeled sentiments to determine a sentence is positive or not. The input of the classifier is a sentence , and the output is a score between and indicating how positive the input is. The input of the decoder at every time step is then the concatenation of the word embedding and a sentiment score. During training the sentiment score of the reference sentence is used, and the decoder learns to generate the reference sentence. For testing given the same input, we are able to scale the sentiment of the output by entering the desired sentiment score.

2.2.2 Reinforcement Learning

Here exactly the same seq2seq chatbot as in Figure 1 is used, except we design a set of reward functions to scale the response sentiment with reinforcement learning. Three components of the reward functions are developed as follow.

(1) Semantic Coherence 1: The response should be semantically coherent to the input , in addition to being a good sentence. So we pre-trained a different seq2seq model on a large dialogue corpus to estimate this semantic coherence with a probability . The first reward is therefore:


where and denote the input and response of the baseline seq2seq chatbot (not the pre-trained seq2seq model), and is the length of for normalization.

(2) Semantic Coherence 2: The semantic coherence mentioned above can be estimated in a completely different way. We use the same dialogue corpus to train a RNN discriminator, in which two RNN encoders are used to represent the input and output as two embeddings, and these two embeddings are concatenated and followed by a fully connected layer to produce a score between and , to indicate if and are good dialogue pairs. This score is therefore the second reward,


(3) Sentiment Score: The third reward is based on the sentiment classifier mentioned above in Section 2.2.1,


where is the seq2seq chatbot response.

The total reward is then the linear interpolation of the three rewards mentioned above,


where and are hyper-parameters ranging from to . We employ the reinforcement learning algorithm with policy gradient [21].

2.2.3 Plug and Play Model

Figure 3: Plug and play model. VRAE denotes variational recurrent auto-encoder

We borrow the concept of plug and play previously used to generate images [22] to generate dialogue response here, as shown in Figure 3. Here we pre-train a variational recurrent auto-encoder (VRAE) [23] in addition using the same dialogue corpus. The encoder of VRAE on the left transforms a sentence into a fixed-length latent vector , while the decoder of VRAE on the middle right generates a sentence based on a vector . The encoder and decoder of VRAE is also jointly learned from the dialogue corpus for the chatbot.

The following steps happens on-line when the user enters a sentence. Given an input , the seq2seq baseline first generates a response , which is then encoded into a latent code by the VRAE encoder. Then the latent code is modified into , based on the following equation:


where denotes the sentiment classifier, and

are the weights of the loss function term and the regularization term. The first term on the right-hand side of Eq.(

5), means we are looking for a code such that when it is decoded into a sentence using VRAE decoder, whose sentiment score should be maximized. The second term of Eq.(5) prevents the code being drifted too far from . To solve Eq.(5), we calculate the gradient of the sentiment score with respect to the latent code and apply gradient ascent to the latent code iteratively 111Since the argmax layer between the decoder and sentiment classifier in is non-differentiable, we use soft argmax [24] to approximate argmax and then the gradient can be back-propagated throughout the whole network, from the sentiment classifier to the decoder., until the sentiment score output reaches a pre-defined value. Because Eq.(5) has to be solved on-line after the user enters an input sentence, this approach is more time consuming.

2.2.4 Sentiment Transformation Network

This is very similar to the plug and play model previously mentioned in Section 2.2.3 and Figure 3, except here a sentiment transformation network with parameter set is learned, and , or maps the latent code to a vector , or to maximize the objective function with respect to instead of . So Eq.(5) is replaced by:


where and are the weights of the loss function term and regularization term. During training, we fix the weights of pre-trained VRAE and sentiment classifier but randomly initialize and then update the sentiment transformation network. During testing, the code is adjusted by the sentiment transformation network learned in Eq.(6), which generates the response.

2.2.5 CycleGAN (Cycle Generative Adversarial Network)

Figure 4: CycleGAN Model for sentiment transformation. and are two translators respectively from positive to negative and negative to positive, and and are two discriminators respectively for positive and negative sentiment.

Here we adopt the very powerful cycleGAN, which was shown very successful in image style transformation even without paired data [25]. Here we show the way to use cycleGAN to transform the sentiment of sentences from negative to positive as in Figure 4. The model is trained with two sets of sentences in a corpus with labeled sentiments: positive sentiment set and negative sentiment set . The sentences in the two sets are unpaired, or for a given sentence in , it is not known which sentence in corresponds to it. We train two seq2seq translators, transforming a negative sentence into positive and from positive to negative. We also train two discriminators, and . and takes a sequence of word embeddings as input and learn to distinguish whether this sequence is the word embeddings of a real sentence or generated by or . With the continuous word embeddings as the translator output, the gradient can be back-propagated from the discriminator to the translator. It’s worth mentioning that and transform sequences of word embeddings to sequences of word embeddings. We pre-train the word embedding model with Word2Vec [26]

and it is fixed during training the cycleGAN here. To transform the output sequence of word embeddings into a sentence, we simply select those words whose embedding has the highest cosine-similarity to each given word embedding in the sequence.

The concept of W-GAN [27] is used to train and . The loss function of the discriminator is:


Where is a negative sentence sampled from , and is the output of the translator taking as the input. learns to minimize Eq.(7), or to give as low score to the translated output as possible (the first term on the right) and give as high score to real positive sentence as possible (the second term). The loss function of the discriminators is parallel to Eq.(7),


As in Improved W-GAN, gradient penalty is applied here. The loss functions for training the translators and are:


The first terms on the right-hand side of Eqs.(9) and (10) are the same. Given a positive sentence , after transformed into a negative sentence by and then transformed back to positive by , it should be very close to the original sentence . Similar for the second terms. The last terms of Eqs.(9) and (10) are different. learns to generate output considered by as a real negative sentence, while learns to generate output considered by as a real positive sentence. In this way the translators , learn to transform the sentences from one sentiment (positive or negative) to the other. Notice that the discriminators . are jointly trained with the translators , . During testing, for any chatbot output , we simply use to transform it into a positive sentence .

3 Evaluation Metrics

Evaluation is always difficult in language generation, especially for chatbot. Here we propose two metrics: sentiment coherence 1 and 2 (COH1, COH2) specially for chatbots, which give scores regarding whether the output sentence is a proper response to the input sentence or not. They are in fact the Semantic Coherence 1 and 2 mentioned in Reinforcement Learning in Section 2.2.2 designed for the reward function. But the seq2seq model and the RNN discriminator use to obtain these two scores were re-trained here therefore are slightly different models, although trained with the same corpus.

The third metric is the Sentiment Classifier Score (SCL) used to measure how positive the output sentence is. This is in fact the sentiment classifier score used in the Persona-based model mentioned in Section 2.2.1. But the sentiment classifier used here is re-trained therefore is slightly different, although trained with the same corpus. The fourth metric is the Language Model Score (LM) to check if the output sentence is a good sentence in terms of a language model [28]. The language model used here was trained on the one billion word language modeling benchmark [29] using a two-layer GRU [30] model,


which is the language model probability for a sentence but normalized with the sentence length . Note that the third and fourth metrics, SCL and LM, consider the output sentence only but not the input . The first and second metrics, COH1 and COH2, however, consider the output given the input .

4 Experiments and Results

4.1 Experimental Setup

We trained all our models including the seq2seq baseline and the five proposed models using the Twitter chatting corpus available on Marsan-Ma’s github repository [31]

using tensorflow. It contains about 3.7M of dialogue pairs. The whole corpus is split into training and validation set. The latter included 28k dialogue pairs. The sentiment classifier used in this work was trained from the twitter sentiment analysis corpus

[32], which consists of 15M data with labeled sentiment ( or ). This corpus was also split into training and validation set. The trained sentiment classifier reached of accuracy on validation set. We trained six models, including the seq2seq baseline and the five models proposed, using the training set and evaluated these models using the validation set. The four evaluation metrics obtained are the average over the validation data.

4.2 Experimental Results

The results are listed in Table 1. Notice that the seq2seq baseline in the first row was used in the five proposed models, therefore we didn’t modify the sentiment for output of that model.

ModelMetrics Semantic Coh. Sent. Lang.
Reinforcement L.
Plug and Play
Transformation Net
Table 1: Evaluation results for the different models proposed, where COH1, COH2, SCL and LM stand for the four evaluation metrics: Semantic Coherence 1, Semantic Coherence 2, Sentiment Classifier Score and Language Model score respectively. The first row is for the seq2seq baseline.

4.3 Discussion on the Results

First consider the seq2seq baseline model. The sentiment classifier score (SCL) is which is close to . This means the baseline model was more or less un-biased on positive or negative sentiments. So it is a reasonable baseline. Below we divide the discussions on the proposed models into two parts considering the different architectures of these models.

4.3.1 Persona-Based Model and Reinforcement Learning

These two models directly modified the seq2seq model’s output, so the parameters of the seq2seq model were changed.

For the persona-based model, the SCL score is extremely high but on the other hand its COH1 score is extremely low. This is probably because we fed the model with the sentiment distribution from a pre-trained sentiment classifier, and as a result the model overfitted on this sentiment distribution. Therefore, it tries to output sentences not necessarily coherent to input, but with correct sentiment. We noticed that its output very often contained two phrases in one sentence, hence the language model score is lower than the other models.

The Reinforcement Learning model performed better than all other models in three out of the four metrics: COH1, COH2 and LM, except for the SCL score. This is because the reward and in Eqs.(1), (2) were in parallel with COH1 and COH2, and in Eq.(1) also considered the word ordering which gave high LM score. Its SCL score was also high (except not as high as the overfitted Persona-based model) because its reward is also in parallel with SCL, which made the output positive. Due to the sampling mechanism, the reinforcement learning model was able to generate diverse responses which other models couldn’t achieve.

From the data we also observed both the Persona-based and reinforcement learning models were able to make complicated changes to the output sentences which were rarely seen on other models.

4.3.2 Plug and Play, Sentiment Transformation Network and CycleGAN

Instead of modifying the parameters of the seq2seq model, these three models modify the responses after they are generated by the seq2seq model.

Plug and Play and Sentiment Transformation Network both tried to modify the latent code of the sentences and they both used the gradient of the sentiment classifier. The sentiment classifier primarily considered the sentiment without really encoding the semantics of the sentences, hence when maximizing the sentiment classifier’s output, the information from original input may be lost. This is probably why COH1 and COH2 scores of these two models are both lower than most of the others.

For CycleGAN, since the two translators directly output word embeddings carrying both sentiment and semantics, the translators were capable of finding the mapping between words like “bad” to “good”, “sorry” to “thank”, “can’t” to “can”. However, it could only change or delete some specific words but failed to make complex modification for the whole sentences. Since it only changes a few words of the original responses, the COH1, COH2 scores were not too far from the seq2seq baseline.

Some examples are shown on the following link:

4.4 Human Evaluation

Coherence Sentiment Grammar
Reinforcement L.
Plug and Play
Transformation Net
Table 2: Human evaluation scores on the three questions regarding Coherence, Sentiment and Grammar. The average scores were normalized to from to .

We performed subjective human evaluation with 30 subjects, all of whom were graduates students. They were asked to answer three questions about the output sentences: (1) Coherence: Is the output sentence a good response to the input? (2) Sentiment: Is the output sentence positive? (3) Grammar: Is the output sentence grammatically correct? They were asked to give scores ranging from to , based on a few reference examples with given scores , , to scale the scores. The average results (normalized to from to ) are listed in Table 2.

Since the subjective human evaluation questions are parallel to the objective machine evaluation scores, we calculate the Pearson correlation coefficients between Coherence, Sentiment and Grammar scores in Table 2 with respect to COH1, SCL, and LM scores in Table 1. The results are , and respectively. This showed the machine evaluation metrics used here were well correlated to the human evaluation.

5 Conclusion

In this paper, we try to scale or adjust the sentiment of the chatbot response given the input. We propose five different models for this tasks, all based on the conventional seq2seq model. We also propose two metrics to evaluate if the response is good for the given input. After careful evaluation and analysis for the five proposed models on different aspects, we found among the five proposed models, Reinforcement Learning and CycleGAN were the most attractive. The reinforcement learning was able to learn properly the different design goals and offer output sentences with good diversity. The cycleGAN model primarily performed word mapping on the original response, so the output sentence quality was more or less preserved. The Plug and Play model and Sentiment Transformation Network were not as successful at the moment, probably because it is not easy to modify the latent code of the sentences while preserving the semantics and sentence quality.