Generating Emotionally Aligned Responses in Dialogues using Affect Control Theory

03/07/2020 ∙ by Nabiha Asghar, et al. ∙ University of Waterloo 0

State-of-the-art neural dialogue systems excel at syntactic and semantic modelling of language, but often have a hard time establishing emotional alignment with the human interactant during a conversation. In this work, we bring Affect Control Theory (ACT), a socio-mathematical model of emotions for human-human interactions, to the neural dialogue generation setting. ACT makes predictions about how humans respond to emotional stimuli in social situations. Due to this property, ACT and its derivative probabilistic models have been successfully deployed in several applications of Human-Computer Interaction, including empathetic tutoring systems, assistive healthcare devices and two-person social dilemma games. We investigate how ACT can be used to develop affect-aware conversational agents, which produce emotionally aligned responses to prompts and take into consideration the affective identities of the interactants.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In the rapidly evolving field of text-based human-computer interaction (HCI), there is an increasing focus on developing dialogue systems111Also known as conversational agents, or virtual assistants, or chatbots. that are emotion/affect aware. Affectively cognizant conversational agents have been shown to provide companionship to humans [Prendinger and Ishizuka2005, Catania et al.2019], help improve emotional wellbeing [Ghandeharioun et al.2018], give medical assistance in a more humane way [Malhotra et al.2015], help students learn efficiently [Kort, Reilly, and Picard2001], and assist mental healthcare provision to alleviate bullying [Gordon et al.2019], suicide and depression [Jaques et al.2017, Taylor et al.2017]. The importance of the agent’s affect awareness is obvious in open-domain dialogue (e.g. for entertainment or companionship). In addition, task-oriented settings like customer service can also benefit from virtual agents that are responsive towards implicit or explicit emotion cues from the user, such as expressing dissatisfaction about a product or negotiating price.

Recently breakthroughs in natural language processing (NLP) have significantly advanced the state-of-the-art in emotion-aware text-based dialogue generation. Several neural-network based affective dialogue models have been explored in the literature 

[Shen et al.2017, Zhang and Wang2018]. However, most of these systems suffer from one or more of the following challenges. First, they model emotion as a set of discrete categories. This is a prohibitive assumption, because humans often experience emotions as a continuum (i.e., a mixture of several feelings of varying intensity) rather than a single emotion of fixed intensity. Second, these studies do not take into account the affective identity of the user during the interaction. For instance, a conversation happening between two friends would typically be very different from the one between two enemies, but this is not accounted for by most modern systems.

To address these limitations, we propose to augment neural dialogue models with Affect Control Theory [Heise2007, ACT], a socio-mathematical model of affect. ACT models the affective/emotional aspects of social interactions between two humans. Given the affective identity of each of the two interactants, ACT prescribes affective actions for them that are mutually aligned towards minimizing conflict. Since ACT is primarily a theory of interactions, it lends itself naturally to the dialogue setting. We augment text-based dialogue agents with the ability to reason about affect using ACT. In doing so, we enable them to perceive human emotions (conveyed through the text) and produce emotionally appropriate textual responses based in an affective context of identities.

ACT and neural dialogue models have fundamentally different representation spaces, thus integrating them is not straightforward. ACT operates in a 3-dimensional continuous affective space, where the basis vectors are Evaluation (E), Potency (P) and Activity (A). On the other hand, actions in a dialogue system are typically sentences that convey some affect as well as one or more propositions. For instance, the sentences “Could you please make me some tea” and “Go make me some tea” convey the same propositional action (asking for tea) but their affect is vastly different. The former can be seen as a request or appeal, whereas the latter is more of a command.

To explore ACT’s viability for text-based dialogue generation, and to deal with the representation-space discrepancy, we propose a neural encoder-decoder dialogue pipeline shown in Figure 1. For a given input sentence/prompt, a sentence-to-EPA (S2EPA) function maps the input to an EPA vector, such that the vector appropriately conveys the affect of the input sentence. ACT is queried with this vector, and produces the response EPA vector. Then an EPA-to-sentence (EPA2S) function maps this response EPA, as well as the input prompt, to an output response, generated word by word, that is semantically relevant to the prompt and conveys the affect of the response EPA. To the best of our knowledge, we are the first to bring ACT to the domain of dialogue generation.

Figure 1: Pipeline to integrate Affect Control Theory (ACT) into a dialogue system. The two components S2EPA and EPA2S are depicted as blackboxes, and are described later in the paper.

Related Work

Most of the early affective dialogue systems were retrieval-based or slot-based, and used hand-crafted speech and text-based features [Callejas, Griol, and López-Cózar2011, Hasegawa et al.2013, Pittermann, Pittermann, and Minker2010]. More recently, with the advent of sophisticated and highly flexible neural network models [Serban et al.2017, Shang, Lu, and Li2015, Sutskever, Vinyals, and Le2014, Vinyals and Le2015], the focus has shifted to building data-driven end-to-end dialogue models. Retrieval-based systems are still popular because they are more controllable, require less training data and are more efficient [Gordon et al.2019, Huang et al.2017, Zhou et al.2018]. However, generative models dominate this space because they generalize well [Rashkin et al.2018, Vadehra2018]. This work falls in the latter category.

A large part of the affective dialogue literature treats emotion as a set of discrete categories, where each category corresponds to a type of biological response. For instance, some studies focus on producing sentiment-appropriate responses, where sentiment refers to positive, negative or neutral emotion [Kong et al.2019, Shi and Yu2018]. Other works use a larger set of discrete emotions [Dryjański et al.2018, Ghosh et al.2017, Zhang and Wang2018, Zhou et al.2017], based on the different psychological theories of emotion [Ekman1992, Plutchik1980]. A recent research trend, propelled by social media growth, is to categorize emotions using the emoji222An emoji is a symbol of emotional expression, such as a smiling/frowning face, a flower, etc. spectrum. This enables model training using massive weakly labelled datasets, e.g., from Twitter [Park2018, Xie et al.2016, Zhou and Wang2018]. For instance,  fung2018empathetic fung2018empathetic and park2018finding park2018finding train emotion embeddings on tweets with hashtags and emojis as labels. These embeddings can be used downstream in other NLP tasks, such as dialogue systems.

In recent years, several Seq2Seq-based affective conversational models have been proposed. Emotional Chatting Machine [Zhou et al.2017, ECM] takes as input a prompt and the desired emotion category of the response, and produces a response. ECM operates on 8 discrete emotion categories, has an internal memory that encodes how much an emotion has already been expressed, and an external memory that decides whether to choose an emotional or generic (non-emotional) word at a given step during decoding. dryjanski2018affective dryjanski2018affective inject predefined sentiment to a neutral utterance by inferring the phrases and their insertion points. lubis2018eliciting lubis2018eliciting jointly train a Seq2Seq model and an emotion encoder. The emotion encoder maintains the emotional context during a conversation, and is trained using the SEMAINE dataset (2000 samples) [McKeown et al.2012] where utterances are labeled on the valence and arousal axes. asghar2018affective asghar2018affective use a continuous, three dimensional representation of emotions, which is used to augment pretrained word embeddings, training objectives, and beam search inference. vadehra2018creating vadehra2018creating train Seq2Seq with an adversarial objective to remove affect from the learned representation of the input utterance, and generate the response based on this representation and the target affect label (one of seven discrete emotion categories). rashkin2018know rashkin2018know have released EmpatheticDialogues, a dataset of 25000 conversations grounded in emotional situations to facilitate training and evaluation of dialogue systems. They show that finetuning existing dialogue models on this dataset boosts their affective quality significantly.

Conditional Variational Autoencoders 

[Sohn, Lee, and Yan2015, CVAEs] have become another popular choice for neural dialogue models. CVAEs have recently been used for affect-controlled dialogue generation, where the model is conditioned on positive-negative-neutral sentiment tags [Shen et al.2017] or more fine-grained emotion categories  [Zhang and Wang2018]. kong2019adversarial kong2019adversarial use an adversarial approach for sentiment control which can be applied to CVAEs too.

In this work, we follow [Asghar et al.2018] and use a continuous, three dimensional representation of emotions. The three dimensions are Evaluation, Potency and Activity, and have been validated by several pioneering research studies in psychology [Osgood, May, and Miron1975, Russell and Mehrabian1977, Heise1979, Russell2003]. Intuitively, a continuous and multi-dimensional representation of emotions makes sense; as humans we experience emotions as a mixture of several feelings of varying intensity, rather than a single emotion of fixed intensity. Moreover, continuous emotion vectors fit well with dialogue models that are trained end-to-end. Using this 3D representation of emotions, we propose 1) a Seq2Seq based model inspired from  [Asghar et al.2018], and 2) a CVAE based model; it is inspired from [Shen et al.2017] and [Zhang and Wang2018], but leverages Affect Control Theory as an external model of affect for conditional response generation.

Affect Control Theory

Affect Control Theory (ACT) arises from work on the psychology and sociology of human social interaction [Heise2007]. ACT proposes that social perceptions, behaviours, and emotions are guided by a psychological need to minimize the differences between culturally shared fundamental sentiments about social situations and the transient impressions resulting from the interactions between elements within those situations. Fundamental sentiments, , are representations of social objects, such as interactants’ identities and behaviours, as vectors in a 3D affective space, hypothesised to be a universal organising principle of human socio-emotional experience [Osgood, May, and Miron1975]. The basis vectors of affective space are called Evaluation/valence, Potency/control, and Activity/arousal (EPA). EPA profiles of concepts can be measured with the semantic differential

, a survey technique where respondents rate affective meanings of concepts on numerical scales with opposing adjectives at each end (e.g., good, nice vs. bad, awful for E, weak, little vs. strong, big for P, and calm, passive vs. exciting, active for A). Affect control theorists have compiled lexicons of a few thousand words along with average EPA ratings obtained from survey participants who are knowledgeable about their culture 

[Heise2010]. For example, most English speakers agree that professors are about as nice as students (E), more powerful (P) and less active (A). The corresponding EPAs are for professor and for student333 Unless otherwise noted, all EPA labels and values in the paper are taken from the Indiana 2002-2004 ACT lexicon [Heise2010]. Values range by historical convention from to .. In Japan, professor has the same P () but students are seen as less powerful ().444taken from the Japan 1989-2002 dataset [Smith et al.2006]

Social events cause transient impressions, (also three dimensional in EPA space) of identities and behaviours that may deviate from their corresponding fundamental sentiments, . ACT models this formation of impressions from events presented as triples actor-behaviour-object. Consider, for example, a professor (actor) who yells (behaviour) at a student (object). Most would agree that this professor appears considerably less nice (E), a bit less potent (P), and certainly more aroused (A) than the cultural average of a professor. Such transient shifts in affective meaning caused by specific events are described with models of the form , where

is a matrix of statistically estimated prediction coefficients from empirical impression-formation studies and

is a vector of polynomial features in and . In ACT, the weighted sum of squared Euclidean distances between fundamental sentiments and transient impressions is called deflection , and is hypothesised to correspond to an aversive state of mind that humans seek to avoid. This affect control principle allows ACT to compute prescriptive actions for humans: those that minimize the deflection. Emotions in ACT are computed as a function of the difference between fundamentals and transients [Heise2007], and are thought to be communicative signals of vector deflection that help maintain alignment between cooperative agents.

For two given identities of the actors (two EPA vectors) and an initial EPA action by one actor, ACT predicts the optimal response for the second actor through prediction equations. For example, let the two identities be friend (EPA:) and enemy (EPA:). Note that these two identities do not ”get along” normally, and therefore we expect that each has an incorrect view of the other’s identity (they are actually both enemies). ACT’s reidentification can provide these two with more appropriate identities after a few interactions. Let the initial action by friend be greet. Then, at time step 1, we query ACT with the event (friend, greet, enemy). The output is the optimal action enemy should take. In this case, it is belittle. At the next time step, we can query ACT with the event (enemy, belittle, friend) to predict the optimal action to be taken by friend (in this case welcome (EPA:)), or we can use a suggestion for reidentification of the friend as a klutz (EPA:). In this way, ACT can be queried sequentially to carry out long interactions.

ACT’s predictions can be explored through computer simulations, via a freely available software called INTERACT555The ACT software, called INTERACT, is publicly available at

Proposed Model

An overview of the proposed ACT conversational model is shown in Figure 1. ACT is instantiated with two affective identities, one each for the human participant and the artificial agent. Given an input prompt (a sentence), a sentence-to-EPA (S2EPA) function maps the prompt to an EPA vector. This EPA vector acts as the affective action for one of the interactants in ACT, and ACT produces the EPA vector of the affective action taken by the other participant. This target EPA vector, together with the prompt, is used to generate a response sentence using an EPA-to-sentence (EPA2S) function. This response can be treated as the next prompt in the conversation, and the process continues.

We propose the following strategies to build the S2EPA and EPA2S functions.

  1. S2EPA: To map sentences to the 3-dimensional EPA space, we modify the output of a pretrained and publicly available BiLSTM network called DeepMoji [Felbo et al.2017]

    , which produces a probability distribution over a set of 64

    emojis given an input sentence. We achieve this by manually labeling the 64 emojis with EPA vectors, and taking a weighted average (using the softmax probabilities) of these vectors. We are making the assumption here that the same EPA would be generated by translating a sentence into a semantic behaviour label (e.g. translating ”Go make me some tea” into the label ”command” and thereby to a negative and powerful EPA), as will be generated by translating a sentence into an emoji (e.g. ”Go make me some tea” with an angry emoji, and thereby to the same negative and powerful EPA). This is shown in Figure 2.

  2. EPA2S: To generate a sentence given an input prompt and a target EPA vector, we explore two models, traditional Seq2Seq [Sutskever, Vinyals, and Le2014] with attention and a conditional variational autoencoder (CVAE) [Sohn, Lee, and Yan2015]. In Seq2Seq, the target EPA and input response are passed through the encoder together to produce a fixed-length context vector. This context is passed through the decoder to generate a response. On the other hand, the CVAE model encodes the input into a Gaussian latent space. A sample from this latent space is then propagated through a decoder to generate an appropriate response.

We now describe each of S2EPA and EPA2S in detail.

Figure 2: S2EPA: A pretrained BiLSTM network with attention [Felbo et al.2017], tweaked to produce EPA vectors instead of emojis.

Sentence to EPA (S2epa)

The goal of S2EPA

function is to generate an EPA representation of a given sentence. If we had access to a large amount of sentences labeled with EPAs, we could simply train a recurrent neural network to approximate the sentence-to-EPA mapping. However, building such a dataset is time-consuming and expensive. The other option is to use the word-level EPA values, but then semantic understanding of the sentence is required. That is, a sentence needs to be parsed into actor-behaviour-object triples 

[Alhothali and Hoey2017]. To get around this issue, we use a pre-trained publicly available sentence-to-emoji model and tweak its output to suit our needs.

Concretely, we use DeepMoji, a pretrained BiLSTM network with attention [Felbo et al.2017]666The pretrained DeepMoji model is publicly available at

. This model is trained on a dataset of 1.2 billion tweets labeled with emojis. Given an input sentence, the model produces a probability distribution over 64 emojis. We use this model to our advantage as follows. We ask two human annotators to label these 64 emojis with EPA vectors. We average these annotations to assign a single EPA vector to each emoji. Then, given an input query, we take the weighted average of the 64 EPA vectors, where the weights are produced by the softmax layer. This gives us the desired sentence to EPA mapping. The architecture of

S2EPA is shown in Figure 2.

EPA to Sentence (Epa2s)

The goal of EPA2S function is to generate a response sentence, given the input prompt and a target EPA vector, such that the response conveys the same affect as the target EPA. To build EPA2S, we explore two methods, Seq2Seq and CVAE.


One straightforward model for EPA2S is Seq2Seq with attention, where the input sentence is concatenated with the target EPA and passed into the encoder. This produces a fixed-length context vector. Given this context vector, the decoder sequentially produces the response while attending to the encoder’s hidden states.


CVAE is another viable model for EPA2S. Let denote a training sample, where and are sequences of tokens denoting the prompt and the response respectively, and is an EPA vector denoting the desired affect of the response. The CVAE consists of a context encoder, utterance encoder and a decoder. The context encoder uses an RNN to map to a fixed-length vector , and then passes to an MLP, which outputs the parameters of the probability distribution ; this distribution is called the prior. Similarly, the utterance encoder uses an RNN to map to a fixed-length vector , and then passes to an MLP that outputs the parameters of the probability distribution . This is the posterior. A latent vector is then sampled from . The decoder RNN parameterizes the distribution ; it takes as input and produces a distribution over the response sequences. The CVAE objective is to maximize the reconstruction probability of , and minimize the KL divergence between the prior and the posterior . This is given by


where and denote the parameters of the context encoder, the utterance encoder, and the decoder respectively. This training process is depicted in Figure 3.

For inference, the goal is to generate a response given an input sentence and a target EPA . are passed through the context encoder, and a latent variable is sampled from . Then are passed to the decoder to generate a response. This process is depicted in Figure 4.

Figure 3: CVAE training architecture.
Figure 4: CVAE at inference time: this is the EPA2S function.


Training, Data and Setup

The Seq2Seq model with attention contains a single layer BiGRU network as the encoder, and a single layer GRU network as the decoder, each layer containing 300 cells. For the CVAE model, each encoder contains 1) a single-layer BiGRU, each direction containing 300 GRU cells, and 2) a two-layer MLP. The CVAE decoder is a single-layer GRU network of 300 cells. The variables and are 300-dimensional and 3-dimensional respectively.

For both models, the vocabulary size is fixed to 24000 and the embedding layer is initialized with 300-dimensional GloVe embeddings [Pennington, Socher, and Manning2014]

. The models are implemented in PyTorch 0.4 and optimized with Adam 

[Kingma and Ba2015] with an initial learning rate of and other default parameters. For training, we use the Cornell Movie Corpus [Danescu-Niculescu-Mizil and Lee2011], which contains 220k prompt-response pairs from movie conversations. We split the data into 200k, 10k and 10k samples for training, validation and testing. For a given pair of ACT identities and , we construct the training data as follows. For each conversation in the Cornell corpus, we assume that the two identities say the utterances alternately. Then, for each training sample in the corpus, we query ACT with the event (, S2EPA(), ) or (, S2EPA(), ) depending on who uttered . This gives us the optimal response EPA. We include this target EPA vector to the training sample to obtain the triple .

For the CVAE model, we follow [Kingma and Welling2013]; we compute the reconstruction loss with a single sample from , and compute the KL divergence in closed form. Furthermore, to prevent the degenerate case where the KL divergence is equal to zero, we use KL annealing, following  [Bowman et al.2016]. Degeneracy occurs when the network sets the posterior to be equal to the prior , implying that the network ignores the latent variable. This is sometimes referred to as the vanishing latent variable problem. KL annealing circumvents this issue by adding a weight to the KL term during training. In the beginning, this weight is zero, so the network encodes useful information in without worrying about staying close to the prior. As training progresses, the weight is slowly increased till it reaches one.


Existing automated dialogue evaluation metrics are not suitable for assessing the quality of an open-domain and affective conversational model 

[Asghar et al.2018, Liu et al.2016]. It is also unclear how to evaluate affective aspects by automated metrics. Therefore, we recruit human judges to evaluate the proposed models, following previous studies [Mou et al.2016, Shang, Lu, and Li2015].

We carry out the following three experiments.

Experiment # 1

First, we assess the quality of EPA vectors produced by the S2EPA model. Some example sentences from the Cornell test set are shown in Table 1, along with their EPA predictions produced by S2EPA. We also include the closest word labels for each EPA from the ACT lexicon of behaviours.

Sentence EPA Closest ACT Labels
i think i am in love [1.60, 0.95, 0.55] caution, collaborate with
i hate you [-1.63, 0.85, 0.49] malign, injure
i have no fear of failure [0.64, 1.27, 0.80] train, confront
what the hell are you doing? [-1.64, 0.41, 1.39] badger, club
he’s determined, unstoppable [0.66, 1.87, 1.45] apprehend, challenge
what do i do for fun? [-0.35, -0.21, -0.04] poke, gawk at
will you have dinner with me? [0.91, 0.45, 0.79] concur with, jest with
please don’t talk with food in your mouth [-0.82, 0.10, -0.64] defer to, monitor
i insist on being told exactly what you have in mind [0.06, 0.03, 0.13] joggle, beckon to
you go ahead and relax, i’ll cook [0.95, 0.32, 0.47] pay for, concur with
i’ve been thinking about you [1.59, 1.12, 0.66] caution, collaborate with
you are despicable [-1.74, 0.86, 0.94] kick, club
i quit. [-0.1, 0.89, 0.17] search, smirk at
how about a drink? [0.60, 0.42, 1.06] query, jest with
there is nothing for me here anymore [-0.56, 0.30, 0.14] flee, sound out
Table 1: Examples of EPA vectors (and their closest word labels in ACT) produced for input sentences by S2EPA.

We note that the model’s EPA predictions are generally appropriate, and in many cases they are in alignment with the ACT behaviour labels. For instance, ‘i think i am in love’ is fairly positive due to the presence of the word love; it is moderately potent and slightly active because of the phrase i think. The closest labels in the ACT lexicon are caution and collaborate with. Among these, caution seems to describe the input well. A similar phenomenon is seen for the input ‘i hate you’, whose EPA prediction closely matches the ACT labels malign, injure. An interesting case is ‘i have no fear of failure’: it has two negative and strong words fear and failure. Yet, the model correctly predicts that the overall sentiment of the response is positive and powerful, and is described well by the label confront.

We also see some negative examples. The E value of ‘i quit’ is , but it should be much more negative. The closest ACT labels search and smirk at don’t make sense either. Similarly, the input ‘i’ve been thinking about you’ is composed of fairly neutral individual words; however the model correctly predicts that overall the sentence is positive, moderately potent and slightly active. On the other hand, its ACT labels caution and collaborate with don’t seem altogether appropriate.

To quantify the success rate of alignment between sentences and S2EPA

’s predictions, we take a test set of 100 sentences (from the Cornell corpus, as before) and ask 3 human judges to classify the EPA predictions as ‘good’ (1) or ‘bad’ (0). Averaging these ratings yields a score of 0.739, indicating that 73.9% of the EPA predictions (on the test set) are in alignment with the input sentence affect. Thus, overall the

S2EPA function works reasonably well.

Experiment # 2

Next, we analyse the affective quality of responses produced by the two variants of the EPA2S model: Seq2Seq and CVAE.

First, we present the outputs of the traditional Seq2Seq model (without the vectors as input). Table 2 shows its predictions on a small test set of inputs chosen from the Cornell test set. We observe that the responses are grammatically well-formed but generally short and vague; this is a known shortcoming of naive Seq2Seq models.

Traditional Seq2Seq (baseline)
i hate you that is okay
i think i’m in love who is it
i have no fear of failure i have it in me
how about a drink? sure why not
i’ve been thinking about you that makes it okay
there is nothing for me here anymore so what do you want
you are despicable so are you
what the hell are you doing? i will get it for you
i quit i don’t think so
please don’t talk with food in your mouth sure if you want to
i insist on being told exactly what you have in mind i have it all
Table 2: The outputs of traditional Seq2Seq with attention, without labels.

Next, we add an EPA vector to each training sample; in this experiment the is chosen manually from the ACT lexicon. Table 3 shows the input prompts, target EPA vectors and the corresponding text generated by EPA2S-Seq2Seq and EPA2S-CVAE.

Line # Target (Manually Chosen) EPA2S-Seq2Seq EPA2S-CVAE
1 [1.71,1.39,-0.90] (calm) you know me what do you want
2 i hate you [-0.50,0.72,0.81] (criticize) okay man can you scream
3 [-0.83,-0.93,0.44] (hide from) you write a proper part for me i feel so tired
4 [0.98,0.38,0.02] (agree with) who is it i don’t really know you
5 i think i’m in [-1.39,-0.47,2.15] (laugh at) wait up please yeah but don’t make any noise
6 love [-1.53,-0.20,-0.19] (ignore) i don’t think so we should find a leader to fight
7 [2.14,1.21,-0.17] (appreciate) yes i believe it when you say
8 i have no fear [-1.61,0.66,1.25] (antagonize) i don’t know i need to leave early tomorrow
9 of failure [1.90,0.82,-0.11] (smile at) what do you say i know you, unk
10 [0.98,0.38,0.02] (agree with) sure that’s nice let me see what i can do about you
11 how about a [-1.05,-0.69,0.33] (avoid) i’m sorry i can’t there is something on the clouds
12 drink? [1.18,1.47,0.20] (charm) how long have you been awake i’m going with you baby
13 there is nothing [2.12,1.12,-0.81] (comfort) yeah you know me it is better this way
14 for me here [1.64,1.17,0.47] (encourage) no it is it’s not too late to try
15 anymore [1.27,1.14,1.44] (entertain) not now you need to calm down
16 [-1.53,-0.20,-0.19] (ignore) that is okay man can you scream
17 you are [-0.83,-0.93,0.44] (hide from) i will not go to him you can show the way
18 despicable [1.71,1.39,-0.90] (calm) i am your friend these are great times we live in
19 [2.12, 1.12, -0.81] (comfort) i can go right now i am singing for her
20 what the hell [-1.05,-0.69,0.33] (avoid) i can ask you what do you want
21 are you doing? [1.27,1.14,1.44] (entertain) what do you think? i won’t mind a shower
22 [2.12,1.12,-0.81] (comfort) i don’t think so you can do better than me
23 i quit. [1.64,1.17,0.47] (encourage) you know me it’s not too late to try
24 [-0.50,0.72,0.81] (criticize) not now who can help you
25 please don’t [-1.39,-0.47,2.15] (laugh at) i can do it what do you want
26 talk with food [-1.53,-0.20,-0.19] (ignore) okay so this is how it is
27 in your mouth [1.18,1.47,0.20] (charm) okay i will that’s okay for you
28 i insist on being [2.12, 1.12, -0.81] (comfort) i can help you with that i won’t mind some baby
29 told exactly what [1.64, 1.17, 0.47] (encourage) i am your friend you can take it off
30 you have in mind [1.27, 1.14, 1.44] (entertain) excited for you please take it back
Table 3: Example outputs generated by EPA2S for a given input sentence and EPA vector.

Similar to the Seq2Seq baseline, we see short and non-committal responses by EPA2S-Seq2Seq. As far as their quality and relevance is concerned, we see some positive examples (Lines 1, 4, 6, 7, 10, 11, 14, 17, 18, 19, 23, 26, 28, 30) where the output sentences are well-aligned with the inputs and ; the rest of the examples show output that is syntactically coherent but does not align well with either or or both. For instance, in Line 2, ‘okay’ is a valid response to ‘i hate you’, but it does not correspond to criticizing. Similarly, in Line 5, the response ‘wait up please’ is not relevant to the input ‘i think i’m in love’ or the target affect of laugh at. Overall, the results are pretty evenly divided between positive and negative examples.

We see similar results for EPA2S-CVAE. There are some positive examples (Lines 2, 3, 7, 12, 13, 14, 16, 18–23, 26). On the other hand we see several outputs that are contextually relevant but affectively misaligned (Lines 1, 4, 5, 15, 24, 30). The responses are generally longer and less vague than baseline Seq2Seq and EPA2S-Seq2Seq.

To quantify the performance of the two EPA2S variants, we set up an experiment as follows. Given a test set of 100 sentences and the target vector, we ask 3 human judges to specify whether the predicted response aligns with , , both or none. The results are presented in Table 4. Overall, the results are evenly distributed across the four classes. Strictly speaking, the success rate (alignment with both and ) is 23.1% and 27.6% respectively for EPA2S-Seq2Seq and EPA2S-CVAE.

% Alignment with and 23.1 27.6
% Alignment with only 25.5 22.0
% Alignment with only 22.6 20.7
% Alignment with neither nor 28.8 29.7
Table 4: Evaluating the two EPA2S variants.
Model Syntactic Natural- Emotional
Coherence ness Approp.
Traditional Seq2Seq (baseline) 1.48 0.69 0.41
ACT with S2EPA & EPA2S-Seq2Seq (friend-friend) 1.59 0.73 0.39
ACT with S2EPA & EPA2S-CVAE (friend-friend) 1.57 0.68 0.47
ACT with S2EPA & EPA2S-Seq2Seq (enemy-enemy) 1.54 0.82 0.49
ACT with S2EPA & EPA2S-CVAE (enemy-enemy) 1.55 0.73 0.59
Table 5: Comparing the different ACT conversation models. Arrows provide statistical significance of results. The up arrows indicate that the model’s score is significantly better than the baseline (). The down arrows indicate that the model’s score is not significantly better than the baseline ().
Line Target (ACT) & Closest ACT Labels Defl. EPA2S-Seq2Seq EPA2S-CVAE
1 i hate you [2.52, 2.52, -0.41] (care for, caress) 17.09 that’s not the point you must be tired now
2 i think i’m in love [3.13, 1.70, 1.39] (thank, kiss) 1.84 i’m glad you like it i wouldn’t do you if i were you
3 i have no fear of failure [3.72, 1.90, 1.3] (thank, propose marriage to) 4.36 well that’s me i will ride with you love
4 how about a drink? [3.37, 1.68, 0.92] (reward, thank) 4.06 sure that’s nice i have money
5 i’ve been thinking about you [3.12, 1.96, 1.31] (thank, kiss) 1.87 okay i like you
6 there is nothing for me here anymore [3.55, 1.99, 0.45] (embrace, propose marriage to) 9.05 i don’t think so it is better this way
Table 6: The full ACT conversational model with ACT identities friend-friend.
Line Target (ACT) and Closest ACT Labels Defl. EPA2S-Seq2Seq EPA2S-CVAE
1 i hate you [-0.27, 0.35, 0.77] (bellow at) 2.21 i am not your friend man can you scream
2 you are despicable [-0.18, 0.55, 0.58] (disagree with) 4.32 i don’t care for you you can calm down
3 what the hell are you doing [-0.29, 0.35, 0.74] (bellow at) 3.56 i can ask you it i need to leave
4 i quit. [-0.17, 0.39, 0.75] (giggle at) 5.29 well that’s me it is too late
5 please don’t talk with food in your mouth [-0.09, 0.48, 0.64] (disagree with) 6.30 not now go away dog
6 i insist on being told exactly what you have in mind [-0.17, 0.33, 1.12] (be sarcastic toward) 4.01 yeah you know me i am singing for you
Table 7: The full ACT conversational model with ACT identities enemy-enemy.

Experiment # 3

We now test the full model (the dialogue pipeline shown in Figure 1), where the two functions S2EPA and EPA2S are integrated with ACT. That is to say, the target EPA vectors are produced by ACT. We use two ACT settings for identities: friend-friend and enemy-enemy.

First, we quantitatively compare the quality of four variants of the ACT model with the baseline Seq2Seq model in Table 5. We ask three human judges to rate the responses of each model on 100 test prompts, given the affective identities of the two participants. The three evaluation axes are syntactic coherence (Does the response make grammatical sense?), naturalness (Could the response have been plausibly produced by a human?), and emotional appropriateness (Is the response emotionally suitable for the prompt?) [Asghar et al.2018]. For each of these axes, the judges are asked to assign each response an integer score of 0 (bad), 1 (satisfactory), or 2 (good). The scores are then averaged for each axis. We also compute the statistical significance of the results using one-tailed Wilcoxon’s Signed Rank Test [Wilcoxon1945] with significance level set to 0.05. This is indicated through arrows in Table 5: a down-arrow indicates that the model performed equally well as the baseline, and an up-arrow indicates that the model performed significantly better than the baseline. We see that all four models perform on par with the baseline, as far as syntactic coherence is concerned. EPA2S-Seq2Seq’s naturalness for the enemy-enemy setting is significantly better than others, whereas EPA2S-CVAE’s emotional appropriateness is the highest for the enemy-enemy setting (as indicated by arrows).

Next, we examine the results qualitatively. We first analyse the setting where the ACT identity of both interactants is friend. The results are shown in Table 6. We see that ACT produces target actions that are very friendly and nice (e.g. care for, thank, kiss, embrace). This is consistent with the responder’s identity of friend. Both Seq2Seq and CVAE produce responses that are generally well-formed and relevant to the input prompt , but they sometimes seem to ignore . Though the affective interpretation of the responses is very subjective, we observe that Seq2Seq produces emotionally aligned responses in Lines 2 and 4, while CVAE produces affectively appropriate results on Lines 1, 5 and 6. We also include the ACT deflection values in the table for the sake of completeness.

In the second setting, we set both the ACT identities to enemy. The results are presented in Table 7. We observe that the actions predicted by ACT are not friendly anymore (giggle at, disagree with, bellow at, be sarcastic toward); these behaviours are consistent with the responder’s identity of enemy. Here we see that the responses often align well with in many cases. The positive examples for Seq2Seq are Lines 1,2 5 and 6; those for CVAE are Lines 1, 2, 3 and 5, 6.

Overall, it can be concluded that the performance of ACT response generation is generally better than baseline models, as far as contextual relevance and emotional appropriateness are concerned. We do see some negative examples too: they can be attributed to the underwhelming performance of the EPA2S models for certain ACT identity settings.


Based on the three experiments presented in the previous section, our main takeaway is that S2EPA often performs reasonably well, whereas EPA2S may be more susceptible to the choice of ACT identities. In this work we chose two simple settings friend-friend and enemy-enemy as canonical examples. However, in the real world, identities are much more nuanced; this is also true for our training set of movie transcripts. In fact, a big chunk of the training set may not align well with either of the two settings. As an example, for the friend-friend setting, we should only train using movie conversations that happen between two friendly identities (e.g. mother and child, two colleagues, two friendly spouses). Thus, ACT identities need to be chosen more carefully and, once they are fixed, the appropriate training examples from the data should be used for training.

We highlight some other shortcomings of our models that contribute to partially negative results.

  • The sentence-level EPA vectors predicted by the S2EPA model may not be precisely accurate, as expected by ACT. Thus, even small discrepancies on the EPA scale can be detrimental to the CVAE or Seq2Seq learning. One way to alleviate this problem is to carry out a large-scale user study and construct a lexicon of sentence-EPA pairs, much like the word-level ACT lexicon. However, such a lexicon would have to be gathered for each identity pair independently.

  • For the EPA2S models, a dataset of 220k triples may be small enough to cause over-fitting. Ideally, for each , the data should contain examples with different vectors; this would allow the model to understand how the response affect should vary for a fixed . In turn, this would enable the model to control and capture global affective features more effectively. Constructing such a dataset may be time-consuming and expensive. Another possible approach may be to disentangle the multidimensional representations of affect and content, following [Hu et al.2017, John et al.2019]. In this case, generating the sentences may be less noisy and the dataset for training the model would not require as many examples.

  • The process of converting EPA values to appropriate conversational responses is a hard problem in general, even for humans. For example, given ‘i failed my exam’ and (without a word label), it is not obvious how to come up with an appropriately worded, grammatically correct response that precisely conveys the right amount of evaluation, potency and activity. Furthermore, each EPA may correspond to many valid sentences, and each sentence may have many valid EPA ratings, due to the subjectivity of the task.


We propose a neural conversational system that uses Affect Control Theory (ACT) to guide the generation of affective dialogue responses. In particular, we develop models that convert dialogue actions (i.e. sentences) to ACT actions (EPA vectors) and vice versa. We also discuss their relative strengths and weaknesses. The experiments generally show positive results, and highlight some key limitations of the proposed models. We provide ideas about how the limitations can be addressed in the future.


We thank Olga Vechtomova and Ilya Shapiro for their feedback and insightful discussions.

Part of this work is presented in Chapter 4 of the first author’s Ph.D. thesis [Asghar2019].


  • [Alhothali and Hoey2017] Alhothali, A., and Hoey, J. 2017. Semi-supervised affective meaning lexicon expansion using semantic and distributed word representations. arXiv preprint arXiv:1703.09825.
  • [Asghar et al.2018] Asghar, N.; Poupart, P.; Hoey, J.; Jiang, X.; and Mou, L. 2018. Affective neural response generation. In European Conference on Information Retrieval, 154–166. Springer.
  • [Asghar2019] Asghar, N. 2019. Emotion-aware and human-like autonomous agents. Ph.D. Thesis, Cheriton School of Computer Science, University of Waterloo.
  • [Bowman et al.2016] Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A.; Jozefowicz, R.; and Bengio, S. 2016. Generating sentences from a continuous space. In CoNLL, 10–21.
  • [Callejas, Griol, and López-Cózar2011] Callejas, Z.; Griol, D.; and López-Cózar, R. 2011. Predicting user mental states in spoken dialogue systems. EURASIP J. Advances in Signal Processing 2011(1):6.
  • [Catania et al.2019] Catania, F.; Di Nardo, N.; Garzotto, F.; and Occhiuto, D. 2019. Emoty: An emotionally sensitive conversational agent for people with neurodevelopmental disorders. In Proceedings of the 52nd Hawaii International Conference on System Sciences.
  • [Danescu-Niculescu-Mizil and Lee2011] Danescu-Niculescu-Mizil, C., and Lee, L. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Workshop on Cognitive Modeling and Computational Linguistics. Association for Computational Linguistics.
  • [Dryjański et al.2018] Dryjański, T.; Bujnowski, P.; Choi, H.; Podlaska, K.; Michalski, K.; Beksa, K.; and Kubik, P. 2018.

    Affective natural language generation by phrase insertion.

    In IEEE International Conference on Big Data, 4876–4882.
  • [Ekman1992] Ekman, P. 1992. An argument for basic emotions. Cognition & emotion 6(3-4):169–200.
  • [Felbo et al.2017] Felbo, B.; Mislove, A.; Søgaard, A.; Rahwan, I.; and Lehmann, S. 2017. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In EMNLP, 1615–1625.
  • [Fung et al.2018] Fung, P.; Bertero, D.; Xu, P.; Park, J. H.; Wu, C.-S.; and Madotto, A. 2018. Empathetic dialog systems. In LREC.
  • [Ghandeharioun et al.2018] Ghandeharioun, A.; McDuff, D.; Czerwinski, M.; and Rowan, K. 2018. Emma: An emotionally intelligent personal assistant for improving wellbeing. arXiv preprint arXiv:1812.11423.
  • [Ghosh et al.2017] Ghosh, S.; Chollet, M.; Laksana, E.; Morency, L.-P.; and Scherer, S. 2017. Affect-LM: A neural language model for customizable affective text generation. In ACL.
  • [Gordon et al.2019] Gordon, C.; Leuski, A.; Benn, G.; Klassen, E.; Fast, E.; Liewer, M.; Hartholt, A.; and Traum, D. 2019. Primer: An emotionally aware virtual agent. In Proceedings of the IUI Workshop on User-Aware Conversational Agents.
  • [Hasegawa et al.2013] Hasegawa, T.; Kaji, N.; Yoshinaga, N.; and Toyoda, M. 2013. Predicting and eliciting addressee’s emotion in online dialogue. In ACL (Volume 1: Long Papers), 964–972.
  • [Heise1979] Heise, D. R. 1979. Understanding Events: Affect and the Construction of Social Action. New York: Cambridge University Press.
  • [Heise2007] Heise, D. R. 2007. Expressive Order: Confirming Sentiments in Social Actions. Springer.
  • [Heise2010] Heise, D. R. 2010. Surveying Cultures: Discovering Shared Conceptions and Sentiments. Wiley.
  • [Hu et al.2017] Hu, Z.; Yang, Z.; Liang, X.; Salakhutdinov, R.; and Xing, E. P. 2017. Toward controlled generation of text. In ICML, 1587–1596.
  • [Huang et al.2017] Huang, C.-Y.; Labetoulle, T.; Huang, T.-H.; Chen, Y.-P.; Chen, H.-C.; Srivastava, V.; and Ku, L.-W. 2017. Moodswipe: A soft keyboard that suggests messagebased on user-specified emotions. In EMNLP: System Demonstrations, 73–78.
  • [Jaques et al.2017] Jaques, N.; Taylor, S.; Sano, A.; and Picard, R. 2017.

    Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction.

    In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 202–208.
  • [John et al.2019] John, V.; Mou, L.; Bahuleyan, H.; and Vechtomova, O. 2019. Disentangled representation learning for non-parallel text style transfer. In ACL, 424–434.
  • [Kingma and Ba2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
  • [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • [Kong et al.2019] Kong, X.; Li, B.; Neubig, G.; Hovy, E.; and Yang, Y. 2019. An adversarial approach to high-quality, sentiment-controlled neural dialogue generation. arXiv preprint arXiv:1901.07129.
  • [Kort, Reilly, and Picard2001] Kort, B.; Reilly, R.; and Picard, R. W. 2001. An affective model of interplay between emotions and learning: Reengineering educational pedagogy-building a learning companion. In IEEE International Conference on Advanced Learning Technologies, 43–46.
  • [Liu et al.2016] Liu, C.-W.; Lowe, R.; Serban, I.; Noseworthy, M.; Charlin, L.; and Pineau, J. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In EMNLP, 2122–2132.
  • [Lubis et al.2018] Lubis, N.; Sakti, S.; Yoshino, K.; and Nakamura, S. 2018. Eliciting positive emotion through affect-sensitive dialogue response generation: A neural network approach. In AAAI.
  • [Malhotra et al.2015] Malhotra, A.; Yu, L.; Schröder, T.; and Hoey, J. 2015. An exploratory study into the use of an emotionally aware cognitive assistant. In

    AAAI Workshop: Artificial Intelligence Applied to Assistive Technologies and Smart Environments

  • [McKeown et al.2012] McKeown, G.; Valstar, M.; Cowie, R.; Pantic, M.; and Schroder, M. 2012. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing 3(1):5–17.
  • [Mou et al.2016] Mou, L.; Song, Y.; Yan, R.; Li, G.; Zhang, L.; and Jin, Z. 2016. Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation. In COLING, 3349–3358.
  • [Osgood, May, and Miron1975] Osgood, C. E.; May, W. H.; and Miron, M. S. 1975. Cross-Cultural Universals of Affective Meaning. University of Illinois Press.
  • [Park2018] Park, J. H. 2018. Finding good representations of emotions for text classification. arXiv preprint arXiv:1808.07235.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. GloVe: Global vectors for word representation. In EMNLP, 1532–1543.
  • [Pittermann, Pittermann, and Minker2010] Pittermann, J.; Pittermann, A.; and Minker, W. 2010. Emotion recognition and adaptation in spoken dialogue systems. Int. J. Speech Technology 13(1):49–60.
  • [Plutchik1980] Plutchik, R. 1980. A general psychoevolutionary theory of emotion. In Theories of emotion. Elsevier. 3–33.
  • [Prendinger and Ishizuka2005] Prendinger, H., and Ishizuka, M. 2005. The empathic companion: A character-based interface that addresses users’affective states. Applied Artificial Intelligence 19(3-4):267–285.
  • [Rashkin et al.2018] Rashkin, H.; Smith, E. M.; Li, M.; and Boureau, Y.-L. 2018. I know the feeling: Learning to converse with empathy. arXiv preprint arXiv:1811.00207.
  • [Russell and Mehrabian1977] Russell, J. A., and Mehrabian, A. 1977. Evidence for a three-factor theory of emotions. Journal of research in Personality 11(3):273–294.
  • [Russell2003] Russell, J. A. 2003. Core affect and the psychological construction of emotion. Psychological Review 110(1):145–172.
  • [Serban et al.2017] Serban, I. V.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A. C.; and Bengio, Y. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, 3295–3301.
  • [Shang, Lu, and Li2015] Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine for short-text conversation. In ACL-IJCNLP, 1577–1586.
  • [Shen et al.2017] Shen, X.; Su, H.; Li, Y.; Li, W.; Niu, S.; Zhao, Y.; Aizawa, A.; and Long, G. 2017. A conditional variational framework for dialog generation. In ACL (Volume 2: Short Papers), 504–509.
  • [Shi and Yu2018] Shi, W., and Yu, Z. 2018. Sentiment adaptive end-to-end dialog systems. In ACL (Volume 1: Long Papers), 1509–1519.
  • [Smith et al.2006] Smith, H.; Matsuno, T.; Ike, S.; and Umino, M. 2006. Mean affective ratings of 1,894 concepts by japanese undergraduates, 1989-2002 [computer file]. Distributed at Affect Control Theory Website, Program Interact ¡ socpsy/ACT/interact/JavaInteract.html¿.
  • [Sohn, Lee, and Yan2015] Sohn, K.; Lee, H.; and Yan, X. 2015. Learning structured output representation using deep conditional generative models. In NIPS, 3483–3491.
  • [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS, 3104–3112.
  • [Taylor et al.2017] Taylor, S. A.; Jaques, N.; Nosakhare, E.; Sano, A.; and Picard, R. 2017. Personalized multitask learning for predicting tomorrow’s mood, stress, and health. IEEE Transactions on Affective Computing.
  • [Vadehra2018] Vadehra, A. 2018. Creating an emotion responsive dialogue system. Master’s thesis, University of Waterloo.
  • [Vinyals and Le2015] Vinyals, O., and Le, Q. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
  • [Wilcoxon1945] Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics Bulletin 1(6):80–83.
  • [Xie et al.2016] Xie, R.; Liu, Z.; Yan, R.; and Sun, M. 2016. Neural emoji recommendation in dialogue systems. arXiv preprint arXiv:1612.04609.
  • [Zhang and Wang2018] Zhang, R., and Wang, Z. 2018. Learning to converse emotionally like humans: A conditional variational approach. In CCF International Conference on Natural Language Processing and Chinese Computing, 98–109.
  • [Zhou and Wang2018] Zhou, X., and Wang, W. Y. 2018. Mojitalk: Generating emotional responses at scale. In ACL (Volume 1: Long Papers), 1128–1137.
  • [Zhou et al.2017] Zhou, H.; Huang, M.; Zhang, T.; Zhu, X.; and Liu, B. 2017. Emotional chatting machine: Emotional conversation generation with internal and external memory. arXiv preprint arXiv:1704.01074.
  • [Zhou et al.2018] Zhou, L.; Gao, J.; Li, D.; and Shum, H.-Y. 2018. The design and implementation of xiaoice, an empathetic social chatbot. arXiv preprint arXiv:1812.08989.