DeepAI
Log In Sign Up

Investigation of Sentiment Controllable Chatbot

07/11/2020
by   Hung-Yi Lee, et al.
0

Conventional seq2seq chatbot models attempt only to find sentences with the highest probabilities conditioned on the input sequences, without considering the sentiment of the output sentences. In this paper, we investigate four models to scale or adjust the sentiment of the chatbot response: a persona-based model, reinforcement learning, a plug and play model, and CycleGAN, all based on the seq2seq model. We also develop machine-evaluated metrics to estimate whether the responses are reasonable given the input. These metrics, together with human evaluation, are used to analyze the performance of the four models in terms of different aspects; reinforcement learning and CycleGAN are shown to be very attractive.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/07/2018

Scalable Sentiment for Sequence-to-sequence Chatbot Response with Performance Analysis

Conventional seq2seq chatbot models only try to find the sentences with ...
04/18/2018

Aspect Level Sentiment Classification with Attention-over-Attention Neural Networks

Aspect-level sentiment classification aims to identify the sentiment exp...
01/22/2019

An Adversarial Approach to High-Quality, Sentiment-Controlled Neural Dialogue Generation

In this work, we propose a method for neural dialogue response generatio...
07/24/2022

Towards a Sentiment-Aware Conversational Agent

In this paper, we propose an end-to-end sentiment-aware conversational a...
07/28/2020

Preparation of Sentiment tagged Parallel Corpus and Testing its effect on Machine Translation

In the current work, we explore the enrichment in the machine translatio...
02/21/2022

Domain-level Pairwise Semantic Interaction for Aspect-Based Sentiment Classification

Aspect-based sentiment classification (ABSC) is a very challenging subta...
05/27/2020

Chat as Expected: Learning to Manipulate Black-box Neural Dialogue Models

Recently, neural network based dialogue systems have become ubiquitous i...

I Introduction

In contrast to goal-oriented dialogue systems [lee2009example, wen2016network], chatbot chats with human users on any subject domain of daily life [serban2016building, shang2015neural]. The conventional chatbot is based on the seq2seq model [vinyals2015neural], generating meaningful responses given user input. It is usually emotionless, which is a major limitation of modern chatbots as emotion plays a critical role in human social interaction, especially in chatting [keltner1998emotion]. Hence we seek to train the chatbot to generate responses with scalable sentiment by setting the chat mode. For example, for the input “How was your day today?”, the chatbot may respond, “It is wonderful today” or “It is terrible today” depending on the sentiment set, in addition to simply generating a reasonable response. This mode can either be set by the developer or the user, or determined dynamically based on the dialogue context. The techniques mentioned here may be extended to conversational style adjustment, so the machine may imitate the conversational style of someone the user is familiar with, to make the chatbot more friendly or more personal [polzin2000emotion, hasegawa2013predicting].

Substantial effort has been focused on the conversational fluency and content quality of generated responses, for example, by enriching the content diversity [vijayakumar2016diverse, li2015diversity, li2016deep], considering additional information [li2016persona], and addressing unknown words [gu2016incorporating, eric2017copy]. Responses have also been generated with controllable factors. The sentiment of a given sentence can be modified using non-parallel data [shen2017style]. A chatbot can change the style of responses by optimizing a given sentiment-related function [mueller2017sequence]. However, little work has been reported on scaling the sentiment of a chatbot; it remains difficult to evaluate a chatbot with adjustable sentiment properly [shawar2007different, hung2009towards].

In this paper, we investigate four approaches to scale the sentiment of chatbot responses and use a set of evaluation metrics and human evaluation with which we analyze the approaches. This journal paper is an extension of a conference paper[SentimentLee2018], but with additional results on two more corpora.

Ii Related Work

The approaches presented in Sections III-BIII-CIII-D and III-E are related to Sections II-A1II-A2II-B1 and II-B2, respectively.

Ii-a Controllable Sentence Generation

Sentence generators based on deep learning show promising text generation capabilities, but cannot easily control the generated text. Hence, a series of research aims at controlling the generated sentences, for example, the writing styles or the topics of the generated sentences.

Ii-A1 Controlled by Input Factors

Sentence generation models can take some factors as input to influence the generation of its outputs. Conditioned recurrent neural networks (CRNN) are used to control the linguistic style of generated text 

[ficler-goldberg-2017-controlling]. Affect-LM customizes the degree of emotional content in generated sentences through an additional design parameter [AffectLM]. The conditional transformer language model (CTRL) [CTRL] trains to condition on control codes that govern style, content, and task-specific behavior.

This category of approach has been used in dialogue generation. The persona model [li2016persona] encodes personas in distributed embeddings that capture individual characteristics such as background information and speaking style, and the embeddings influence the output of the decoder. Instead of encoding personas, in Section III-B, the persona-based model is used to control sentiment. A conversational model is proposed to generate informative responses with controlled sentence function, for example, interrogative, imperative, declarative, etc [sentence_function]. This paper is closely related to the Emotional Chatting Machine (ECM) [zhou2017emotional]. ECM is a neural conversational model that can generate corresponding responses based on given emotional categories. The basic idea of ECM is similar to the persona-based model, but with more sophisticated network architectures, including internal and external memories. However, it needs the dialogues involving emotional responses to train the model, which is not always available.

Ii-A2 Controlled by External Function

In this approach, the sentence generation model is explicitly taught to generate sentences with certain aspects [ECIR_affect, Uber_PaP, RLarXiv19, SentiGAN]

. A hand-crafted or machine-learned function guide the sentence generator to output sentences considered having the desired aspect (like sentiment or topic) based on the function. This approach has been used to make the dialogue generation model generate emotional responses, but hand-crafted functions are used in the previous work 

[ECIR_affect]. It has been shown that the attribute models learned from data can successfully guide the sentence generator [Uber_PaP]. Because here we focus on dialogue generation, in Section III-C, besides considering the attribute, or sentiment, of the responses, we further guide the model to generate the responses coherent with the input sentences by coherence models.

Ii-B Text Style Transfer

The text style transfer model transfers the input sentence from one style into another. The text style transfer approaches that do not utilize parallel data are used in Sections III-D and III-E. Below are two main categories of approaches to achieve text style transfer without parallel data. All the related work mentioned below only focuses on text style transfer, not dialogue generation as in Section III.

Ii-B1 Manipulating Latent Space

The latent representations of auto-encoders can be manipulated to induce a change in the output space to achieve text style transfer [pmlr-v80-zhao18b, mueller2017sequence]. The approach used in Section III-D belongs to this category. One way to manipulate latent space to achieve text style transfer is feature disentangle. By separating the content from the style in the latent space, we can modify the style without changing the content [shen2017style, Fu2017StyleTI, disentangleNAACP19, disentangleAAAI2020].

Ii-B2 Direct Modification

Instead of manipulating the latent space, this category of approach directly finds a model that can transform the text from one style to another. The approach used in Section III-E belongs to this category. A simple approach is to delete phrases associated with the sentence’s style and retrieve new phrases to replace them [ruleNAACL18]. The idea similar to CycleGAN [zhu2017unpaired] or StarGAN [StarGAN], which has widely used in image style transfer, has also been used. This category of approaches uses a discriminator to control the style of the generated content [NIPS2018_7959, cycleGANACL19], and reconstruction loss to maintain the content [rewriting_ICLR19].

Iii Sentiment Controllable Chatbot

In Section III-A we briefly review the conventional seq2seq chatbot. The four approaches used here are presented in Sections III-B to III-E. All use the seq2seq chatbot as the basic model. The persona-based approach (Section III-B) and reinforcement learning (Section III-C) modify the training algorithm of the conventional seq2seq chatbot. Plug and play (Section III-D) and CycleGAN (Section III-E) modify the response of an off-the-shelf seq2seq chatbot. Below we assume that the chatbot response is to be positive conditioned on the input, although it is simple to generalize the approaches to scalable sentiment.

Iii-a Seq2seq Model

Here we use the attention-based seq2seq model [luong2015effective] shown in Figure 1 to train a simple chatbot using a corpus of dialogue pairs. In all discussions here, is the input sentence to the seq2seq chatbot, and is the output of the seq2seq model. is the reference response in the training corpus. In the training phase, we input the sentence

(a sequence of one-hot vectors) to the encoder, and the seq2seq model learns to maximize the probability of generating the sentence

given .

Fig. 1: Seq2seq model

Iii-B Persona-Based Model

Fig. 2: Persona-based Seq2seq model.

The persona-based model was originally proposed to generate sentences that mimic the responses of specific speakers [li2016persona]. It is very similar to the seq2seq model, except that extra information is added to the input of the decoder at each time step. In the original work [li2016persona], this extra information is the trained speaker embedding. Here we replace the speaker embedding with a sentiment score (a scalar between and

) from a sentiment classifier, as shown in Figure 

2

. This sentiment classifier 

[liu2012sentiment] is trained on a corpus of sentences with labeled sentiments to determine whether a sentence is positive or not. The input of the classifier is a sentence , and the output is a score between and indicating how positive the input is. The input of the decoder at every time step is then the concatenation of the word embedding and a sentiment score. During training the sentiment score of the reference sentence is used, and the decoder learns to generate the reference sentence. For testing given the same input, we scale the sentiment of the output by entering the desired sentiment score.

Iii-C Reinforcement Learning

Here we use exactly the seq2seq chatbot shown in Figure 1; the only modification is a set of reward functions designed to scale the response sentiment using reinforcement learning. The components of the reward functions are developed as follows.

  1. Semantic Coherence 1: In addition to being a good sentence, the response should be semantically relevant to the input . Hence we pre-train a different seq2seq model on a large dialogue corpus to estimate this semantic coherence with a probability . The first reward is therefore

    (1)

    where and denote the input and response of the baseline seq2seq chatbot (not the pre-trained seq2seq model), and is the length of for normalization.

  2. Semantic Coherence 2: The semantic coherence mentioned above can be estimated in a completely different way. We use the same dialogue corpus to train a RNN discriminator, in which two RNN encoders are used to represent the input and its corresponding response as two embeddings; these two embeddings are concatenated and followed by a fully connected layer to produce a score between and which indicates whether and are good dialogue pairs. This score is therefore the second reward:

    (2)
  3. Sentiment Score: The third reward is based on the sentiment classifier mentioned in Section III-B:

    (3)

    where is the seq2seq chatbot response.

The total reward is then the linear interpolation of the three rewards mentioned above:

(4)

where and are hyper-parameters ranging from to and 1. We employ the reinforcement learning algorithm with policy gradient [sutton2000policy].

Iii-D Plug and Play Model

Fig. 3: Plug and play model. VRAE denotes variational recurrent auto-encoder.

As shown in Figure 3, to generate dialogue responses here, we borrow the concept of plug and play as used in generating images [nguyen2016plug]. Here we pre-train a variational recurrent auto-encoder (VRAE) [fabius2014variational] in addition to using the same dialogue corpus. The VRAE encoder on the left transforms a sentence into a fixed-length latent vector , while the VRAE decoder on the middle right generates a sentence based on a vector . The VRAE encoder and decoder are also jointly learned from the dialogue corpus for the chatbot.

The following steps take place on-line, when the user enters a sentence. Given an input , the seq2seq baseline first generates a response which is then encoded into a latent code by the VRAE encoder. Then the latent code is modified into , based on the following equation:

(5)

where denotes the sentiment classifier and and

are the weights of the loss function term and the regularization term. The first term on the right-hand side of Eq. (

5) indicates that we seek a code such that when decoded into a sentence using the VRAE decoder, the resulting sentiment score is maximized. The second term of Eq. (5) prevents the code from drifting too far from . To solve Eq. (5), we calculate the gradient of the sentiment score with respect to the latent code and apply gradient ascent to the latent code iteratively, until the sentiment score output reaches a pre-defined value. Because Eq. (5) is solved on-line after the user enters an input sentence, this approach is more time consuming. Since the argmax layer between the decoder and sentiment classifier in is non-differentiable, we use soft argmax [kusner2016gans] to approximate argmax so that the gradient back-propagates throughout the whole network, from the sentiment classifier to the decoder.

Iii-E CycleGAN

Fig. 4: CycleGAN model for sentiment transformation. and are two translators respectively from positive to negative and negative to positive, and and are two discriminators respectively for positive and negative sentiment.

Here we adopt the very powerful cycle generative adversarial network (CycleGAN), which proved successful in image style transformation even without paired data [zhu2017unpaired]. As illustrated in Figure 4, we show a way to use CycleGAN to transform the sentiment of sentences from negative to positive. The model is trained on two sets of sentences in a corpus with labeled sentiments: a positive sentiment set and a negative sentiment set . The sentences in the two sets are unpaired; that is, for a given sentence in , it is not known which is the corresponding sentence in . We train two seq2seq translators: to transform a negative sentence to positive and for positive to negative. We also train discriminators and . They take a sequence of word embeddings as input and learn to distinguish whether the sequence is from word embeddings of a real sentence, or it was generated by or . With the continuous word embeddings as the translator output, the gradient can be back-propagated from the discriminator to the translator. Note that and transform sequences of word embeddings to sequences of word embeddings. We pre-train the word embedding model with Word2Vec [mikolov2013efficient]

; here it is fixed during CycleGAN training. To transform the output sequence of word embeddings into a sentence, we simply select those words whose embeddings have the highest cosine-similarity to each given word embedding in the sequence.

The concept of W-GAN [arjovsky2017wasserstein] is used to train and . The loss function of the discriminator is

(6)

where is a negative sentence sampled from , and is the output of translator taking as the input. learns to minimize Eq. (6), that is, to give low scores to the translated output (the first term on the right) and high scores to real positive sentences (the second term). The loss function of the discriminators is parallel to Eq. (6):

(7)

As in improved W-GAN, gradient penalty is applied here. The loss functions for training translators and are

(8)
(9)

The first terms on the right-hand side of Eqs. (8) and (9) are the same. Given a positive sentence , after being transformed into a negative sentence by and then transformed back to positive by , it should be very close to the original sentence ; likewise for the second terms. The last terms of Eqs. (8) and (9) are different: learns to generate output that is considered by to be a real negative sentence, whereas learns to generate output that is considered by to be a real positive sentence. In this way translators and learn to transform the sentences from one sentiment (positive or negative) to the other. Notice that the discriminators and are jointly trained with the translators and . During testing, for any chatbot output , we simply use to transform it into a positive sentence .

Iv Experimental Setup

We trained and tested all our models, including the seq2seq and the four proposed models, on the following three corpora. The first two are in Chinese, whereas the third is in English. Using the training set, we trained five models, including the seq2seq baseline and the four proposed models; we evaluated these models using the testing set. All the evaluation metrics obtained are the average over the testing data. More dataset details are provided in Appendix A.

  1. We used the Chinese Emotional Conversation Generation (CECG) task [CECG], originally offered by the NII Testbeds and Community for Information Access Research (NTCIR) Project for the Short Text Conversation Task (STC) competition. CECG contains around 1.7M dialogue pairs; each sentence is labeled by one of the following six kinds of sentiments: like, sad, disgusted, angry, happy, and other. We reclassified five of these six sentiment categories into positive and negative categories: like and happy as positive sentiments, and sad, disgusted, and angry as negative sentiments. Both corpora were split into training and testing sets (the latter included 1k dialogue pairs). This corpus was used in both the sentiment classifier and the other models.

  2. We collected data from the PTT Boy-Girl board containing the titles and all the article replies from page 1 to page 4000, and we used the articles as context and the replies as responses. Replies include “like” and “boo”, which roughly correspond to positive and negative sentiment. However, as this dataset contains no true dialogue data, and the sentiment is not always precise, we use this dataset only for demonstration and not the main experiment. As with the previous dataset, this corpus was also used in both the sentiment classifier and other models.

  3. For English, the Twitter chatting corpus is available on Marsan-Ma’s GitHub repository [Marsan-Ma]

    using TensorFlow. This corpus, which contains 3.7M dialogue pairs, is split into training and testing sets, the latter of which includes 28k dialogue pairs. The sentiment classifier used in this work was trained from the Twitter Sentiment Analysis Corpus 

    [pak2010twitter], which consists of 15M data with labeled sentiment ( or ). This corpus was also split into a training and testing set. The trained sentiment classifier achieved an accuracy on the validation set.

V Evaluation

V-a Evaluation Metrics

Evaluation is always difficult in language generation; this is even more so for chatbots. Here we propose two metrics: semantic coherence 1 and 2 (COH1, COH2) for chatbots, which are scores reflecting the degree to which the output sentence is a proper response to the input sentence . These are in fact the semantic coherence 1 and 2 mentioned in Section III-C (Reinforcement Learning) designed for the reward function. However, the seq2seq model and the RNN discriminator used to obtain these two scores are re-trained here and are thus different models.

The third metric is the sentiment classifier score (SCL) used to measure how positive the output sentence is. This is the sentiment classifier score used in the persona-based model mentioned in Section III-B. Likewise, the sentiment classifier used here is re-trained and is thus different.

The fourth metric is the language model score (LM) which measures whether the output sentence is a good sentence in terms of a language model [mikolov2010recurrent]. The language model used here is composed of a two-layer GRU [cho2014learning] model:

(10)

which is the language model probability for a sentence , normalized with the sentence length . Eq. (10) is also known as the negative log perplexity (PPL).

Note that SCL and LM, the third and fourth metrics, consider only the output sentence  – not the input . COH1 and COH2, the first and second metrics, however, consider the output given the input . In all the following tables, larger evaluation metrics represent better performances.

V-B Metric Models

The coherence score (1 and 2) in Chinese uses the PTT Gossiping board dialogue corpus collected by Justin Yang [Justin-Yang], which contains about 400,000 dialogue pairs. Here we split the data into training and testing data sets; the latter contains 1,000 pairs, whereas the other owns the rest.

For the LM score, in Chinese, we crawled the replies of articles from the top 50 most popular PTT boards, for a total of 25 million tokens of data. In English, the corpus of the One Billion Word Benchmark [chelba2013one] was used.

The coherence score for English and sentiment score for both Chinese and English share the same data with the experiments mentioned in Section IV. Although the metrics are trained with the same corpus, as mentioned above, the models of the metrics are different because of re-training, which guarantees a certain extent of fairness.

Vi Individual Models

Due to the space limitation, we only show the results of the individual models on Chinese CECG dataset. The details of the hyper-parameter setup in the following experiments are shown in Appendix B.

Vi-a Sentiment Classifier

To find a proper sentiment classifier, we evaluated six different ways of segmenting Chinese words. Including word-based and character-based methods, we also evaluated different neural network architectures: a CNN with max pooling (CNN for short), a GRU with the last hidden state output (GRU-last), and a GRU averaging all hidden state outputs for the whole sequence (GRU-avg). The total number of characters used in character-based segmentation was 7,297, and the number of words in word-based segmentation was 50,000. We trained these six models with a batch size of 32 and 50,000 epochs. We evaluated the performance using the accuracy and area under the Receiver Operating Characteristics curve (AUC). For the accuracy score, the output scores of the models greater than 0.5 were taken as predictions of positive sentiment; otherwise, they were negative. The predictions were then compared to the real answers to calculate the accuracy rate. The best results of all architectures shown in Table 

I, GRU-last with word segmentation yielded the best performance on both accuracy and AUC scores. GRU-last is then applied in the following experiments as the sentiment classifier.

Seg. Word-based Character-based
Struc. CNN GRU-last GRU-avg CNN GRU-last GRU-avg
Acc
AUC
TABLE I: Evaluation of sentiment classifiers under different NN architectures on Chinese CECG dataset.

Vi-B Persona-Based Model

With a fixed sentiment score input of 1.0, we first compared the performance between the 64-, 128-, and 256-neuron seq2seq models of the persona-based model during the inference step. As shown in Table 

II, the 256-neuron model scored highest in terms of COH1, COH2, and LM, while the 128-neuron model yielded the best performance on the SCL score. The numbers of neurons did not have a remarkable influence on the results. We chose the 256-neuron model in the following experiments.

Metrics COH1 COH2 SCL LM
Sentiment Score 64 -9.760 0.631 0.920 -2.621
128 -9.505 0.625 0.950 -2.310
256 -9.338 0.639 0.925 -2.178
TABLE II: Evaluation of persona-based models of different neuron sizes on Chinese CECG dataset.

Further, we applied other sentiment scores as input to generate sentences with different degrees of sentiment. Table III shows the results of 0.0, 0.5, and 1.0 as inputs respectively on the 256-neuron model during inference. As expected, the lower we set the input, the lower the resultant sentiment scores: this suggests the model is already transforming sentences to different sentiments.

Metrics COH1 COH2 SCL LM
Sentiment score 0.0 -7.381 0.656 0.040 -1.862
0.5 -7.777 0.659 0.443 -2.682
1.0 -9.338 0.639 0.925 -2.178
TABLE III: Evaluation of persona-based model under different sentiment score inputs.

Tables V show examples of how different sentiment scores affect the input sentences.

Input [月亮] 今晚出去走走,享受廣州長假最後的安寧。 終於下班了。今晚非常了得 [耶]
[Moon] Go out for a walk tonight and enjoy the last night of your long vacation in Guangzhou Finally got off work. Very much tonight [ya]
Sentiment score 0.0 你又要去了? 我也是不想上班了?
You’re going again? I don’t want to go to work either
0.5 你在哪裏 我也是剛下班的??
Where are you? I just got off work too
1.0 好開心喔 恭喜恭喜
Sounds like fun Congratulations
(a)
TABLE V: Sentences generated by persona-based model under different sentiment scores on Chinese CECG dataset.

Vi-C Reinforcement Learning

In this experiment, different coefficient reward combinations were adopted to test the performance of the model. Below we report the results after 2500 training epochs.

In the first three sets of experiments, we fixed as 0.0 and adjusted the proportion of and to determine how the four metrics were affected. The results in Table VI show that the COH1 score increases as rises from 0.0 to 0.8, and the SCL score falls as decreases from 1.0 to 0.2, which is not surprising as rewards , , and in Eqs. (1), (2), and (3) were in parallel with the COH1, COH2, and SCL score respectively. Interestingly, COH2 degrades compared to the pretrained MLE baseline, showing that the RL model is highly goal-oriented – to improve COH1 and SCL, it sacrifices COH2 performance. The result also implies that COH1 and COH2 are nearly mutually independent as the increase of seems to little affect COH2. The LM score, on the other hand, generally is improved comparing with the typical seq2seq model: this is perhaps attributable to , which also takes into account word ordering. In the last set of experiments, we increase to 0.3 to remedy the COH2 score. The result shows that COH1 and COH2 are close to the baseline; at the same time, though, the SCL and LM scores improve.

Input COH1 COH2 SCL LM
Seq2seq(baseline) -8.6 0.664 0.33 -1.574
0.0 0.0 1.0 -9.518 0.589 0.992 -1.471
0.5 0.0 0.5 -8.813 0.587 0.778 -1.160
0.8 0.0 0.2 -8.419 0.588 0.658 -0.967
0.3 0.3 0.4 -8.840 0.641 0.779 -0.940
TABLE VI: Evaluation of reinforcement learning models with different reward combinations at 2500 iterations on Chinese CECG dataset.

Vi-D Plug and Play

VAE is used for sentence generation in the plug and play model, and KL cost annealing and vocabulary truncation are applied to improve VAE performance. In this experiment, we first compared the performance between models with and without KL cost annealing. These two models truncate vocabulary randomly at a probability of 0.3. In Table VII, with KL cost annealing, the KL loss of model increases at first and decreases afterward. This observation shows that the model at first decreases the negative log likelihood instead of improving the KL loss. This is beneficial to the training process once the model gives high priority to the VAE reconstruction.

Epochs 10000 20000 30000 40000
KL cost annealing Without Total loss 28.394 11.163 5.290 3.057
KL Loss 0.111 0.043 0.021 0.014
With Total loss 23.059 9.186 4.352 2.627
KL loss 0.132 0.194 0.020 0.037
TABLE VII: Loss of plug and play models with or without KL cost annealing under different epochs on Chinese CECG dataset.

Table VIII compares three difference truncation probabilities: 0.0, 0.3, and 0.7 (0.0 corresponds to no vocabulary truncation). The best is 0.3, and the worst is 0.7. This shows that a small proportion of truncation helps VAE to depend less on its own language model; a high proportion hinders training.

Input 10000 20000 30000 40000
Vocabulary truncation % 0.0 Total 24.622 11.003 4.474 2.675
loss
KL 0.176 1.568 0.056 0.092
loss
0.3 Total 23.059 9.186 4.352 2.627
loss
KL 0.132 0.194 0.020 0.037
loss
0.7 Total loss 25.361 10.140 4.962 2.830
KL 0.164 0.084 0.036 0.014
loss
TABLE VIII: Plug and play model loss with different vocabulary truncation proportions under different epochs on Chinese CECG dataset.

In Table IX, we see that the 0.3 output successfully reconstructs the original input. The 0.0 output also seems reasonable, although it fails to reconstruct the original sentence: it replaces “禮貌” (“courtesy”) with “動作” (“action”), probably because it depends more on the decoder language model. With 0.7 truncation, the model outputs an incorrect sentence, replacing “提高” (“improve”) with “以” (“by”); this could be because the many unknown words hinder the model from learning correct sentence grammar. The model with KL cost annealing and with a vocabulary truncation of 0.3 outperforms all others, so we will use this setting in the following experiments.

Original input 禮貌能提高一個的素質
Courtesy can improve quality
Vocabulary truncation % 0.0 動作能提高一個的素質
Action can improve quality
0.3 禮貌能提高一個的素質
Courtesy can improve quality
0.7 禮貌能以一個的素質
Courtesy can be by quality
TABLE IX: Sentence reconstruction of plug and play models with different vocabulary truncation proportions on Chinese CECG dataset.

Tables X apply sentiment gradient ascent and descent respectively. Both positive and negative sentiment transfers generate corresponding sentiment outputs. However, most sentiment transfers replace words with strong sentiment bias. For example, in Table X, “哈哈” (“haha”) is added after a positive transfer (sentiment gradient ascent), and “淘汰” (“eliminate”) is added after a negative transfer (sentiment gradient descent); although each of these words is a strong indicator of sentiment style, it leads to incorrect grammar.

Original input 果然昇仙了大霧… 終於搞完了,接待真不是人幹的活[怒罵]
Sure enough, it was a big fog… Finally finished – reception sucks! [roaring]
Original output 慢走。。 你們這接待搞得人都徹底消失了。
Walk yourself out.. Your “reception” has driven everyone away.
Sentiment
score
Gradient
ascent
期待豁。 哈哈規定地方搞得人都徹底消失了。
Looking forward to it. Ha ha, stipulating the location has driven everyone away.
Gradient
descent
安息。。 你們這搞找得人都成淘汰了
Rest in peace.. You’ve made everyone to eliminated.
TABLE X: Sentences generated by plug and play models on Chinese CECG dataset.

Vi-E CycleGAN

When training the CycleGAN model, heterogeneous input styles are preferred; thus the data used here is re-marked and selected by the sentiment classifier, leaving only those sentences with strong style bias. In the CycleGAN model training phase, training alternates between the generator and discriminator. Once one of the agents (generator or discriminator) is trained, the other’s parameters are fixed and used for inference. If one is better than the other, a training imbalance occurs, and the process is no longer “adversarial”. Since the discriminator is easier to train in most cases, the discriminator is stronger than the generator generally, leading to an awkward generator. We evaluate two kinds of generators: generators G and F, created after the generator and the discriminator are trained alternatively using different training epochs. Furthermore, we also tried to add Identity Loss (short for ID loss), which was used to help to keep the content after transferring, in total loss during training the generator. The result is illustrated in Tables XI and XII: the training epoch ratios of the generator and discriminator are 1:1, 1:3, and 3:1 after 10000 iterations and 100000 iterations respectively.

The 1:1 result is most stable. For the 1:3 model, although it achieves good performance at 10000 iterations, the model’s generator and discriminator both collapse at 100000 iterations; this may be because the generator is unable to deceive the robust discriminator after several iterations of training; repeated training in this case leads to collapse. Although the 3:1 model is relatively stable, its generator is not as good as that of the 1:1 model. Moreover, the experiment also shows if there is an improvement when the ID loss is added: the result shows no significant difference from when only generator loss is used as the total loss.

Epochs COH1 COH2 SCL LM
Gener-
ator
type
Generator
epochs
Discri-
minator
epochs
G : neg to pos 1 1 -9.071 0.665 0.683 -4.171
3 1 -9.145 0.668 0.602 -4.326
1 3 -9.154 0.675 0.679 -4.237
1(ID loss) 1 -9.064 0.667 0.659 -4.193
F : pos to neg 1 1 -9.215 0.666 0.279 -4.427
3 1 -9.229 0.667 0.254 -4.369
1 3 -9.244 0.664 0.203 -4.285
1(ID loss) 1 -9.22 0.664 0.25 -4.426
TABLE XI: CycleGAN model under different training epoch combinations at a total of 10000 epochs on the Chinese CECG dataset.
Epochs COH1 COH2 SCL LM
Gener-
ator
type
Generator
epochs
Discri-
minator
epochs
G : neg to pos 1 1 -9.206 0.667 0.682 -4.13
3 1 -9.2 0.669 0.654 -4.227
1 3 -17.341 0.536 0.013 -8.731
1(ID loss) 1 -9.197 0.667 0.664 -4.121
F : pos to neg 1 1 -9.192 0.658 0.201 -4.662
3 1 -9.203 0.666 0.222 -4.418
1 3 -17.341 0.536 0.013 -8.731
1(ID loss) 1 -9.167 0.665 0.213 -4.422
TABLE XII: CycleGAN model under different training epoch combinations at a total of 100000 epochs on the Chinese CECG dataset.

The 1:1 model with 100000 training iterations is the best model; there will be a comparison between this model and other trained models later. Tables XIII contain examples of sentences transformed by generators G and F. In the left table of Table XIII, as the un-transformed output is already a positive response, generator G has little effect on the result; generator F, however, produces an identifiable result with its clearly negative sentence. In the right table of Table XIII, the original output is a negative response; generator G substitutes “討厭” (“hate”) for “喜歡” (“like”), making a more positive sentence. Generator F makes little modification to the sentence.

Original input 再見廈門。再見朋友們。。 老天爺為什麼要發明洗頭這項運動[淚]
Goodbye Xiamen. Goodbye friends. Why did God invent shampooing? [sob]
Original output 廈門人民隨時歡迎你歸來![酷] 我最討厭的就是洗頭
The people of Xiamen welcome you back any time! [cool] What I hate most is shampooing
Generator G 廈門人民隨時歡迎你歸來![酷酷] 我最喜歡的就是洗頭
The people of Xiamen welcome you back any time! [cool cool] What I hate most is shampooing
Generator F 廈門人民隨時怕你歸來![抓狂] 我真討厭的就是洗頭
The people of Xiamen will always fear your return! [crazy] What I really hate is shampooing
TABLE XIII: Senteces generated by the CycleGAN model on Chinese CECG dataset.

Vii Comparison of All Models

Vii-a Chinese CECG Dataset

Table XIV evaluates the four models using four metrics. The results better than the baselines are in blue, and the ranking of each method is also in the table. The persona-based model (using sentiment score equal to 1.0 as input during inference) yields the highest SCL. The RL model yields the best performance on COH1 and LM. The plug and play and CycleGAN models transform the original output sentences directly, resulting in similar SCL scores, but worse than persona-based model and RL model. CycleGAN has better LM score than plug and play, which shows that CycleGAN does better with respect to grammar.

ModelMetrics Semantic Coh. Sent. Lang.
COH1 COH2 SCL LM
Seq2seq (baseline)
Persona-based
Reinforcement L.
Plug and Play
CycleGAN
TABLE XIV: Evaluation on Chinese CECG dataset. COH1, COH2, SCL, and LM stand for semantic coherence 1, semantic coherence 2, sentiment classifier score, and language model score, respectively. The results better than the baselines are in blue, and the ranking of each method is also in the tables.

Vii-B Chinese PTT Dataset

In this section, we switched to the PTT dataset to demonstrate the performance of the above four models. To train a sentiment classifier, we also tried CNN, GRU-last, and GRU-avg on different word segmentation methods at first. However, as the dirty PTT dataset resulted in poor sentiment classifier performance, we improved the sentiment classifier using the following three steps. First, we compared the performance between CNN, GRU-last, and GRU-avg, and chose GRU-last because of its out-performance (see Table XV). Second, we picked the more credible data (higher absolute scores) by GRU-last, shrinking the data from 1,402,303 to 1,149,000. Finally, we trained a new sentiment classifier on the remaining data using GRU-last-2. This final sentiment classifier was then used in the four models. GRU-last-2 here indicates the same architecture as GRU-last but not the same model.

Segm. Word-based Character-based
Struct. CNN GRU-last GRU-avg GRU-last-2 CNN GRU-last GRU-avg
Acc
AUC
TABLE XV: Sentiment classifiers of different NN architectures on PTT dataset.

The evaluation results on the PTT dataset are shown in Table XVI. SCL scores are all high in these four models. It shows that all the four approaches can successfully generate positive responses, but persona-based and RL models are better than plug and play and Cycle GAN. Although the SCL score of persona-based model is the highest, it obtained the worst COH1 and COH2, which shows that its generated responses are not coherent with respect to the inputs. Similar to that on the CECG dataset, the RL model yields the best results in terms of COH1 and LM.

Modelmetrics Semantic coh. Sent. Lang.
COH1 COH2 SCL LM
Seq2seq(baseline)
Persona-based
Reinforcement L.
Plug and play
CycleGAN
TABLE XVI: Model evaluation on PTT dataset.

Vii-C English Corpus

The results are listed in Table XVII. For the persona-based model, its SCL score is the highest; however, its COH1, COH2, and LM are the lowest. The RL model performed better than all other models in three out of the four metrics: COH1, COH2, and LM, but not the SCL score. The SCL scores of plug and play model and CylcleGAN are worse than persona-based and RL models.

Modelmetrics Semantic coh. Sent. Lang.
COH1 COH2 SCL LM
Seq2seq(baseline)
Persona-based
Reinforcement L.
Plug and play
CycleGAN
TABLE XVII: Evaluation of models on English dataset.

Vii-D Examples

Tables XVIII list outputs of the four models on Chinese CECG dataset. The persona-based and RL models usually output reasonable and positive sentences. From the data we also observed both the persona-based and RL models made complicated changes to output sentences; this was rarely seen in other models. The plug and play model tends to generate collapsed sentences. The CycleGAN model typically adjusts sentences by replacing negative words with positive ones; this is not always a significant transformation.

Input 想到一個還不錯的點子,對自己的智商有了信心。 要吃水煮魚了哇賽好開心
Come out a good idea and is confident in my IQ now. I’m going to eat boiled fish, wow I’m so excited
Original output 也許是錯覺[淚] 哪家好吃
Could be an illusion [sob] Which restaurant do you recommend?
Persona-based 哈哈,你是不是覺得你的智商是很不錯的 哈哈,我也是
Haha, you think you have a high IQ? Haha, me too
Reinforcement L. 贊同。 哈哈,我也想去吃呢。
I agree. Haha, I want to get some too.
Plug and Play 波波那。)瞬間 好好吃?
Bobona. ) suddenly Is it good?
CycleGAN 也許是錯覺[哈哈] 哪家好吃
Could be an illusion [Haha] Which restaurant do you recommend?
TABLE XVIII: Sentences generated by four models on Chinese CECG dataset.

Viii Human Evaluation

We performed a subjective human evaluation against sentences generated by five models each on three corpora (CECG, PTT, and English), for a total of 15 results with 30 subjects, all of whom were graduate students. They were asked to answer three questions about the output sentences: (1) Coherence: Is the output sentence a good response to the input? (2) Sentiment: Is the output sentence positive? (3) Grammar: Is the output sentence grammatically correct? They were asked to give scores ranging from to , based on a few reference examples with given scores , , to scale the scores. The average results (normalized to the range from to ) of three models are listed in Tables XIX, XX, and XXI. The results better than the baselines are in blue, and the ranking of each method is also in the tables.

For coherence, all the approaches are worse than the baselines, except RL models on Chinese PTT dataset. Except for the second rank on the English corpus, the RL models worked the best among the four approaches in terms of coherence. For sentiment, all the models can successfully modify the sentiment of responses, except plug and play models, which performed the worst on every corpus. For grammar, the RL models yielded the best performance on each corpus.

Coherence Sentiment Grammar
Seq2seq(baseline)
Persona-based
Reinforcement L.
Plug and play
CycleGAN
TABLE XIX: Human evaluation scores on Chinese CECG dataset for coherence, sentiment, and grammar. Average scores normalized to . The results better than the baselines are in blue, and the ranking of each method is also in the table.
Coherence Sentiment Grammar
Seq2seq(baseline)
Persona-based
Reinforcement L.
Plug and play
CycleGAN
TABLE XX: Human evaluation scores on Chinese PTT dataset for coherence, sentiment, and grammar. Average scores normalized to .
Coherence Sentiment Grammar
Seq2seq(baseline)
Persona-based
Reinforcement L.
Plug and play
CycleGAN
TABLE XXI: Human evaluation scores on English dataset for coherence, sentiment, and grammar. Average scores normalized to .

Ix Discussion

The persona-based model and RL model can both generate the responses largely different from the original seq2seq model. For the persona-based model, its sentiment score was the highest on all the datasets in terms of both machine and human evaluation (the only exception is the human evaluation on Chinese CECG). However, its coherence and grammar were both worse than the RL model in terms of both machine and human evaluation. This shows that although persona-based model can successfully generate very positive responses, the coherence and the grammar of the responses are poor. It tries to output sentences that carry the correct sentiment, but not necessarily relevant given the input. The RL model is generally the best model of the three corpora from all aspects. This is because the reward and in Eqs. (1) and (2) were in parallel with coherence, and in Eq. (1) also takes into account word ordering which leads to correct grammar. Its sentiment score was also high (although not as high as the persona-based model) because its reward is also in parallel with the sentiment, which yielded positive output. One may argue that the RL model overfits the machine-evaluation metrics since it learns to optimize those metrics. This issue is addressed in Section VIII through human evaluation. Humans also consider that the RL model has reasonable responses with correct grammar.

The plug and play model and CycleGAN only transform the output responses of an off-the-shelf seq2seq model. The plug and play model attempted to modify the latent code of the sentences. As the sentiment classifier primarily considered sentiment without really encoding sentence grammar, when maximizing the sentiment classifier’s output, it sometimes transforms the original output of the seq2seq model into collapsed sentences. For CycleGAN, since the two translators directly outputted word embeddings carrying both sentiment and semantics, the models found mappings between words like “bad” to “good”, “sorry” to “thank”, “can’t” to “can”. However, this entailed only changes or deletions of specific words and not complex modifications of whole sentences. Since it only changes a few words of the original responses, its responses are not far from the original ones. This explains why the grammar of plug and play is worse than CycleGAN in terms of both machine and human evaluation in all cases. When it comes to sentiment and coherence, humans always consider CycleGAN is better than plug and play, whereas their performances on COH1, COH2, and SCL are comparable. Because it is difficult for humans to read the sentences with poor grammar, humans usually consider that the sentiments of the collapsed sentences are incorrect, and they are less coherent with the inputs.

X Conclusion

In this paper, we attempted to adjust the sentiment of the chatbot response given the input. We investigated four different models for the tasks; All of the models are based on the conventional seq2seq model. The performances of the four models in terms of machine-evaluated metrics and human evaluation are reported. The persona-based model and RL models, which alter the original seq2seq model parameters, yield good results. The persona-based model is good at exporting sentences of high sentiment score that might be suitable for cases when a chatbot only needs to reply with simple sentences that carry strong sentiment. The RL model generates high quality sentences, which is likely to prolong conversations. On the other hand, if there is already a running, functional chatbot, and the only thing to do is to transfer its sentiment (or style), then the CycleGAN model might be a better choice than plug and play. As the CycleGAN model primarily performs word mappings on the original response, the output sentence quality is more or less preserved. The plug and play model currently yields poor performance, probably because it is difficult to modify the latent code of a sentence while preserving its semantics and sentence quality.

References

Appendix A Dataset Statistics

Below we present the details of the datasets used in different tasks. In the sentiment classifiers (Section A), there are three corpora: Chinese, PTT, and English. As described in Section VII-B, the corpus used in the PTT task was refined by a special filtering processing which reduced the size from 1,402K to 1,149K. In the experiments (Section B), each of the three corpora are shared by four tasks – the persona-based model, the reinforcement learning model, the plug and play model, and the CycleGAN model. In the metrics (Sections C and D), there are for metrics for the Chinese, PTT, and English tasks. The Chinese and PTT tasks share the same metrics trained by the corpora described in Section C. The WMT11 corpus used for the LM in Section D used the data pre-processing described in [chelba2013one]. For word-based segmentation, low-frequency words were eliminated to reduce the word dimensions, which led to the vocabulary size shown in the table, which was smaller than the actual size.

A-a Sentiment Classifiers

Task Corpus Training set Testing set Total Vocab size Word
segment
type
Chinese CECG 16,999K 1k 1.7M 50K word
[CECG]
PTT PTT 1,148K 1K 1,149K 50K word
Boy-Girl
English Twitter 14,999K 1K 15M 50K word
[pak2010twitter]

A-B Experiments

Task Corpus Training set Testing set Total Vocab size Word
segment
type
Chinese CECG 16,999K 1k 1.7M 50K word
[CECG]
PTT PTT 1,148K 1K 1,149K 50K word
Boy-Girl
English Twitter 3,672K 28K 3.7M 50K word
[Marsan-Ma]

A-C Metrics – Chinese & PTT

Task Corpus Training set Testing set Total Vocab size Word
segment
type
COH1 PTT 415K 1K 416K 6,185 char
[Justin-Yang]
COH2 PTT 415K 1K 416K 6,185 char
[Justin-Yang]
SCL CECG 16,999K 1K 1.7M 50K word
[CECG]
LM PTT 24,999K 1K 25M 50K word
Replies

A-D Metrics – English

Task Corpus Training set Testing set Total Vocab size Word
segment
type
COH1 Twitter 415K 1K 416K 6,185 char
[pak2010twitter]
COH2 Twitter 415K 1K 416K 6,185 char
[pak2010twitter]
SCL Twitter 14,999K 1K 15M 50K word
[pak2010twitter]
LM WMT11 799,984K 160K 0.8B 790K word
[chelba2013one] (words) (words) (words)

Appendix B Hyper-Parameter Selection

Hyper-parameters were first chosen from the Chinese experiments and used for the subsequent PTT and English experiments. However, the epochs chosen for each task differed to ensure the best performance.

B-a Sentiment Classifier

  • unit size: 256

  • layer size: 1

  • batch size: 32

  • max sequence length: 40

  • learning rate: 0.001 (no decay)

  • epochs: 50,000 (Chinese), 50,000 (PTT), 50,000 (English)

  • word embedding dimension: 300

B-B Persona-Based Model

  • unit size of seq2seq model (both encoder and decoder): 256

  • layer size of seq2seq model (both encoder and decoder): 1

  • batch size: 64

  • max sequence length: 15

  • learning rate: 0.001 (no decay)

  • epochs: 100,000 (Chinese), 100,000 (PTT), 100,000 (English)

  • word embedding dimension: 300

B-C Reinforcement Learning Model

  • unit size of seq2seq model (both encoder and decoder): 300

  • layer size of seq2seq model (both encoder and decoder): 4

  • batch size: 64

  • max sequence length: 50

  • learning: initialized as 0.5 and decay every 500 iterations with a weight of 0.99

  • epochs for seq2seq model: 100,000 (Chinese), 100,000 (PTT), 100,000 (English)

  • epochs for RL model: 2,000 (Chinese), 1,000 (PTT), 2,000 (English)

  • word embedding dimension: 300

  • coefficient of , and : 0.3,0.3,0.4

B-D Plug and Play Model

  • unit size for VAE RNN: 500*2 (bidirectional)

  • unit size of seq2seq model (both encoder and decoder): 500

  • layer size of seq2seq model (both encoder and decoder): 1

  • batch size: 48

  • max sequence length: 15

  • learning rate: 0.001 (no decay)

  • epochs: 40,000 (Chinese), 50,000 (PTT), 40,000 (English)

  • word embedding dimension: 300

  • sentiment gradient weight: 400

  • L2 gradient weight: 25

B-E CycleGAN Model

  • unit size of seq2seq model (both encoder and decoder): 256

  • layer size of seq2seq model (both encoder and decoder): 1

  • batch size: 32

  • max sequence length: 15

  • learning rate: 0.0001 (no decay)

  • epochs: 100,000 (Chinese), 100,000 (PTT), 80,000 (English)

  • word embedding dimension: 300

  • ratio between training iterations of discriminator and of generator: 1:1

B-F Metric – COH1

  • unit size of seq2seq model (both encoder and decoder): 300

  • layer size of seq2seq model (both encoder and decoder): 4

  • batch size: 32

  • max sequence length: 50

  • learning: initialized as 0.5 and decay every 500 iterations with a weight of 0.99

  • epochs: 120,000 (Chinese), 150,000 (PTT), 100,000 (English)

  • word embedding dimension: 300

B-G Metric – COH2

  • unit size of seq2seq model (both encoder and decoder): [200,100,100,200]

  • layer size of seq2seq model (both encoder and decoder): 4

  • batch size: 64

  • max sequence length: 30

  • learning: initialized as 0.0005 and decay every 5000 iterations with a weight of 0.98

  • epochs: 100,000 (Chinese), 120,000 (PTT), 100,000 (English)

  • word embedding dimension: 300

B-H Metric – SCL

  • unit size of seq2seq model (both encoder and decoder): 300

  • layer size of seq2seq model (both encoder and decoder): 3

  • batch size: 64

  • max sequence length:

  • learning: initialized as 0.0005 and decay every 5000 iterations with a weight of 0.98

  • epochs: 50,000 (Chinese), 50,000 (PTT), 50,000 (English)

  • word embedding dimension: 300

B-I Metric – LM

  • unit size: 256

  • layer size: 1

  • batch size: 32

  • max sequence length: 40

  • learning rate: 0.001 (no decay)

  • epochs: 50,000 (Chinese), 75,000 (PTT), 50,000 (English)

  • word embedding dimension: 300