A Persona-based Multi-turn Conversation Model in an Adversarial Learning Framework

by   Oluwatobi O. Olabiyi, et al.
Capital One

In this paper, we extend the persona-based sequence-to-sequence (Seq2Seq) neural network conversation model to multi-turn dialogue by modifying the state-of-the-art hredGAN architecture. To achieve this, we introduce an additional input modality into the encoder and decoder of hredGAN to capture other attributes such as speaker identity, location, sub-topics, and other external attributes that might be available from the corpus of human-to-human interactions. The resulting persona hredGAN (phredGAN) shows better performance than both the existing persona-based Seq2Seq and hredGAN models when those external attributes are available in a multi-turn dialogue corpus. This superiority is demonstrated on TV drama series with character consistency (such as Big Bang Theory and Friends) and customer service interaction datasets such as Ubuntu dialogue corpus in terms of perplexity, BLEU, ROUGE, and Distinct n-gram scores.



There are no comments yet.


page 1

page 2

page 3

page 4


An Adversarial Learning Framework For A Persona-Based Multi-Turn Dialogue Model

In this paper, we extend the persona-based sequence-to-sequence (Seq2Seq...

A Persona-Based Neural Conversation Model

We present persona-based models for handling the issue of speaker consis...

A Hybrid Solution to Learn Turn-Taking in Multi-Party Service-based Chat Groups

To predict the next most likely participant to interact in a multi-party...

Multi-turn Dialogue Response Generation in an Adversarial Learning Framework

We propose an adversarial learning approach to the generation of multi-t...

Improving Retrieval Modeling Using Cross Convolution Networks And Multi Frequency Word Embedding

To build a satisfying chatbot that has the ability of managing a goal-or...

Enhance word representation for out-of-vocabulary on Ubuntu dialogue corpus

Ubuntu dialogue corpus is the largest public available dialogue corpus t...

CharacterChat: Supporting the Creation of Fictional Characters through Conversation and Progressive Manifestation with a Chatbot

We present CharacterChat, a concept and chatbot to support writers in cr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent advances in machine learning especially with deep neural networks has led to tremendous progress in natural language processing and dialogue modeling research

[1, 2, 3]. Nevertheless, developing a good conversation model capable of fluent interactions between a human and a machine is still in its infancy. Most existing work relies on a limited dialogue history to produce responses with the assumption that the model parameters will capture all the modalities within a dataset. However, this is not true as dialogue corpora tend to be strongly multi-modal and practical neural network models find it difficult to disambiguate characteristics such as speaker personality, location, and sub-topic in the data.

Most work in this domain has primarily focused on optimizing dialogue consistency. For example, Serban et al. [3, 4, 5] and Xing et al. [6]

introduce a Hierarchical Recurrent Encoder-Decoder (HRED) network architecture that combines a series of recurrent neural networks to capture long-term context state within a dialogue. However, the HRED system suffers from lack of diversity and does not support any guarantees on the generator output since the output conditional probability is not calibrated. Olabiyi et al.

[7] tackle this problem by training a modified HRED generator alongside an adversarial discriminator in order to provide a stronger guarantee to the generator’s output. While the hredGAN system improves upon response quality, it does not capture speaker and other attributes modalities within a dataset and fails to generate persona-specific responses in datasets with multiple modalities.

At the same time, there has been some recent work on introducing persona into dialogue models. For example, Li et al. [8] integrate learnable speaker attributes into a single-turn (Seq2Seq) generative dialogue model. In this work, Li et al. consider persona models: one with Speaker-only representation and another with Speaker and Addressee representations (Speaker-Addressee model), both of which capture aspects of speaker identity and interactions. Nguyen et al. [9] continue the same line of thought by considering a Seq2Seq dialogue model with Responder-only representations. In both cases, the attribute representation is learned during the system training. Zhang et al. [10] propose a slightly different approach. Here, the attributes are a set of sentences describing the profile of the speaker. In this case, the attribute representation is not learned. The system however learns how to attend to different parts of the attributes during training. Still, the above persona-based models have a limited dialogue history (single-turn), suffer from exposure bias worsening the trade off between personalization and conversation quality, and cannot generate multiple responses given a dialogue context. This is evident in the relatively short and generic responses even though they generally capture the persona of the speaker.

Source Speaker D(G(.)) Utterance
Big Bang Theory and Friends TV Series
Context_0 Penny NA You have to, we all have to be there at the same time to yell surprise!.
Context_1 Howard NA Okay, you have to understand something, we’re in a hospital right now.
Response_0 Penny 0.1705 Okay, I’m sorry. I’m not really even like that.
Response_1 Penny 0.1426 Okay, I’m sorry, I’m sorry.
Response_2 Penny 0.0688 Okay.
Context_0 Leonard NA Holy crap, Sheldon, did you just figure out a method for synthesizing a new stable super-heavy element?
Context_1 Sheldon NA Did I? Well, that can’t be right. No one’s ever done that before. Except me, because I just did it. Sheldon and his brain, yeah
Response_0 Leonard 0.3976 Sheldon , what are you doing ?
Response_1 Leonard 0.3503 Sheldon ?
Context_0 Leonard NA She’s gotten really hooked on Age of Conan, she’s playing non-stop.
Response_0 Raj 0.4890 Okay , so we know , what about you and Penny doing here ?
Response_1 Raj 0.3586 Okay , so we would have to say that about people ?
Response_2 Raj 0.1113 Okay , let ’ s go .
Context_0 asker NA all i need some quick amarok help. i cant hear my music
Context_1 helper NA is amarok muted?
Context_2 asker NA no
Response_0 helper 0.3079 use the UNK drivers , and then run the ” UNK ” command to get the UNK
Response_1 helper 0.1283 what is the error message ?
Response_2 helper 0.0725 what version of ubuntu ?
Context_0 asker NA anyone had problems with the kernel update from today? giving me a kernel panic
Context_1 helper NA you can select previous kernels at the bootloader (grub) menu on booth.
Response_0 asker 0.3665 it says that it is not installed . . .
Response_1 asker 0.3195 i’m not sure what i can find . . .
Response_2 asker 0.0186 it’s a UNK , I’m not sure of the way .
Context_0 asker NA how do I install Ubuntu?
Response_0 helper 0.5797 use the alternate cd , it should be in the repos , it’s a good place to get the source of the kernel
Response_1 helper 0.1984 use the UNK package , it should work . . .
Response_2 helper 0.0131 use the UNK

TABLE I: Sample of outputs with discriminator score.

To overcome these limitations, we propose , a multi-modal dialogue system which additionally conditions the adversarial framework proposed by Olabiyi et al. [7] on speaker and/or utterance attributes in order to maintain response quality of and still capture speaker and other modalities within a conversation. The attributes can be seen as another input modality as the utterance. The attribute representation is an embedding that is learned together with the rest of model parameters, similar to [8]. The introduction of attributes allows the model to generate responses conditioned on particular attribute(s) across conversation turns. Since the attributes are discrete, it also allows for exploring what-if scenarios of model responses. We train and sample the proposed phredGAN similar to the procedure for [7]. To demonstrate model capability, we train on customer service related data such as the Ubuntu Dialogue Corpus (UDC) that is strongly bimodal between question poster and answerer, and character consistent TV scripts from two popular series, The Big Bang Theory and Friends with quantitative and qualitative analysis. We demonstrate system superiority over and the state-of-the-art persona conversational model in terms of dialogue response quality and quantitatively with perplexity, BLEU, ROUGE, and distinct n-gram scores.

The rest of the paper is organized as follows: In section II, we describe the model architecture. Section III describes model training and evaluation while section IV contains experiments and result discussion. Finally, section V concludes and explores some future work directions.

Ii Model Architecture

In this section, we briefly introduce the state-of-the-art

model and subsequently show how we derive the persona version by combining it with the distributed representation of the dialogue speaker and utterance attributes.

Ii-a : Adversarial Learning Framework

The proposed by Olabiyi et. al [7] contains three major components.

Encoder: The encoder consists of three RNNs, the that encodes an utterance for attention memory, the that encodes an utterance for dialogue context, and the that encodes a multi-turn dialogue context from the outputs. The final states of and are concatenated to form the final encoder state.

Generator: The generator contains a single decoder RNN, , initialized with the final encoder state. The inputs consists of output, embedding of the ground truth or previously generated tokens, as well as noise samples and the attention over the outputs.

Discriminator: The discriminator also contains a single RNN, , initialized by the final encoder state. In the case of [7], it is a bidirectional RNN that discriminates at the word level to capture both the syntactic and semantic difference between the ground truth and the generator output.

Problem Formulation: The [7] formulates multi-turn dialogue response generation as: given a dialogue history of sequence of utterances, , where each utterance contains a variable-length sequence of word tokens such that for vocabulary , the dialogue model produces an output , where is the number of generated tokens. The framework uses a conditional GAN structure to learn a mapping from an observed dialogue history to a sequence of output tokens. The generator, , is trained to produce sequences that cannot be distinguished from the ground truth by an adversarially trained discriminator, , akin to a two-player min-max optimization problem. The generator is also trained to minimize the cross-entropy loss between the ground truth , and the generator output . The following objective summarizes both goals:


where and

are hyperparameters and

and are defined in Eqs. (5) and (7) of [7] respectively.

Note that the generator and discriminator share the same encoder and embedding representation of the word tokens.

Fig. 1: The PHRED generator with local attention - Encoder: Encoder RNN, , Attention RNN, and the Context RNN, . Generator: Decoder RNN, . The same encoder is shared between the generator and the discriminator (Fig. 1 of [7]). The attributes C allow the generator to condition its response on the utterance attributes such as speaker identity, subtopics, and so on.

Ii-B : Persona Adversarial Learning Framework

The proposed architecture of is very similar to that of summarized above. The only difference is that the dialogue history is now where is additional input that represents the speaker and/or utterance attributes. Note that can either be a sequence of tokens or a single token such that for vocabulary . The embedding for attribute tokens is also learned similar to that of word tokens. The modified system is as follows:

Encoder: In addition to the three RNNs in the encoder of , if the attribute is a sequence of tokens, then another RNN, is used to summarize the token embeddings; otherwise a single attribute embedding is concatenated with the output of as shown in Fig. 1.

Generator: In addition to the in the generator of , if the attribute is a sequence of tokens, then another RNN, is used to summarize the attribute token embeddings; otherwise a single attribute embedding is concatenated with the other inputs of as shown in Fig. 1.

Discriminator: In addition to the in the discriminator of , if the attribute is a sequence of tokens, then the same is used to summarize the attribute token embeddings; otherwise the single attribute embedding is concatenated with the other inputs of in Fig. 1 of [7].

Noise Injection: Although [7] demonstrated that injecting noise at the word level seems to perform better than at the utterance level for , we found that this is dataset-dependent for . The model with utterance-level noise injection and word-level noise injection are tagged and respectively.

The losses, and in eq. (1) are then respectively updated as:


The addition of speaker or utterance attributes allows the dialogue model to exhibit personality traits giving consistent responses across style, gender, location, and so on.

0:   A generator with parameters .
0:   A discriminator with parameters .
0:   Training hyperparameters, , and .
   for number of training iterations do
       Initialize to zero_state,
       Sample a mini-batch of conversations, , with utterances. Each utterance mini batch contains word tokens.
       for  to  do
           Update the context state.
           Compute the generator output similar to Eq. (11) in [7].
           Sample a corresponding mini batch of utterance .
       end for
       Compute the adversarial discriminator accuracy over utterances and
       if  then
           Update ’s with gradient of the discriminator loss.
       end if
       if  then
           Update with the generator’s MLE loss only.
           Update with attribute, adversarial and MLE losses.
       end if
   end for
Algorithm 1 Adversarial Learning of phredGAN

Iii Model Training and Inference

Iii-a Model Training

We train both the generator and the discriminator (with a shared encoder) of using the same training procedure in Algorithm 1 with [7]. Since the encoder, word embeddings and attribute embeddings are shared, we are able to train the system end-to-end with back-propagation.

Encoder: The encoder RNNs, , and are bidirectional while is unidirectional. All RNN units are 3-layer GRU cells with a hidden state size of 512. We use a word vocabulary size, , with a word embedding size of 512. The number of attributes, is dataset-dependent but we use an attribute embedding size of 512. In this study, we only use one attribute per utterance, so there is no need to use RNNs, and to combine the attribute embeddings.

Generator: The generator RNN, is also a 3-layer GRU cell with a hidden state size of 512. The outputs are connected to the input using an additive attention mechanism [11].

Discriminator: The discriminator RNN, , is a bidirectional RNN, each 3-layer GRU cell having a hidden state size of 512. The output of both the forward and the backward cells for each word are concatenated and passed to a fully-connected layer with binary output. The output is the probability that the word is from the ground truth given the past and future words of the sequence as well as the responding speaker’s embedding.

Others: All parameters are initialized with Xavier uniform random initialization [12]. Due to the large word vocabulary size, we use sampled softmax loss [13] for MLE loss to expedite the training process. However, we use full softmax for model evaluation. The parameter update is conditioned on the discriminator accuracy performance as in [7] with and

. The model is trained end-to-end using the stochastic gradient descent algorithm. Finally, the model is implemented, trained, and evaluated using the TensorFlow deep learning framework.

Iii-B Model Inference

We use an inference strategy similar to the approach in Olabiyi et. al [7]. The only differences between training and inference are : (i) The generator is run in autoregressive mode with greedy decoding by passing the previously generated word token to the input of the at the next step. (ii) A modified noise sample is passed into the generator input.

For the modified noise sample, we perform a linear search for with sample size based on the average discriminator loss, [7] using trained models run in autoregressive mode to reflect performance in actual deployment. The optimum value is then used for all inferences and evaluations. During inference, we condition the dialogue response generation on the encoder outputs, noise samples, word embedding, and the attribute embedding of the intended responder. With multiple noise samples, , we rank the generator outputs by the discriminator which is also conditioned on the encoder outputs, and the intended responder’s embedding. The final response is the response ranked highest by the discriminator.

Iv Experiments and Results

In this section, we explore ’s results on two conversational datasets and compare its performance to the persona system in Li et al. [8] and [7] in terms of quantitative and qualitative measures.

Iv-a Datasets

TV Series Transcripts dataset [3]. We train our model on transcripts from the two popular TV drama series, Big Bang Theory and Friends. Following a preprocessing setup similar to [8], we collect utterances from the top 12 speakers from both series to construct a corpus of 5,008 lines of multi-turn dialogue. We split the corpus into training, development, and test sets with 94%, 3%, and 3% proportions, respectively, and pair each set with a corresponding attribute file that maps speaker IDs to utterances in the combined dataset.

Due to the small size of the combined transcripts dataset, we first train our model on the larger Movie Triplets Corpus (MTC) by Banchs et al. [14]

which consists of 240,000 dialogue triples. We pre-train our model on this dataset to initialize our model parameters to avoid overfitting on a relatively small persona TV series dataset. After pre-training on MTC, we reinitialize the attribute embeddings in the generator from a uniform distribution following a Xavier initialization

[12] for training on the combined person TV series dataset.

Ubuntu Dialogue Corpus (UDC) dataset [4]. We train our model on 1.85 million conversations of multi-turn dialogue from the Ubuntu community hub, with an average of 5 utterances per conversation. We assign two types of speaker IDs to utterances in this dataset: questioner and helper. We follow the same training, development, and test split as the UDC dataset in [7], with 90%, 5%, and 5% proportions, respectively.

While the overwhelming majority of utterances in UDC follow two speaker types, the dataset does include utterances that are not classified under either a questioner or helper speaker type. To remain consistent, we assume that there are only two speaker types within this dataset and that the first utterance of every dialogue is from a questioner. This simplifying assumption does introduce a degree of noise into the model’s ability to construct attribute embeddings. However, our experimental results demonstrate that our model is still able to differentiate between the larger two speaker types in the dataset.

Metric SM111SM stands for Speaker Model and SAM stands for Speaker-Addressee Model[8] SAM[8]
TV Series
Perplexity 25.0 25.4 25.9
BLEU-4 1.88% 1.90% 3.00%
ROGUE-2 - - 0.4044
Distinct-1 - - 0.1765
Distinct-2 - - 0.2164
TABLE II: vs. Seq2Seq Persona Models [8] Performance.

Iv-B Evaluation Metrics

We use similar evaluation metrics as in

[7] including perplexity, BLEU [15], ROUGE [16], and distinct n-gram [17] scores.

Iv-C Baseline

We compare our system to [8]

which uses a Seq2Seq framework in conjunction with learnable persona embeddings. Their work explores two persona models to incorporate vector representations of speaker interaction and speaker attributes into the decoder of their Seq2Seq model, i.e., Speaker and Speaker-Addressee models. While we compare with both models quantitatively, we mostly compare with the Speaker-Addressee model qualitatively. Our quantitative comparison uses perplexity and BLEU-4 scores as those are the ones reported in

[8]. In addition, we also measure our model performance in terms of ROGUE and Distinct n-gram scores for the purpose of completeness. For fair comparison, we use the same TV drama series dataset used in their study.

We also compare our system to from [7] in terms of perplexity, ROGUE, and distinct n-grams scores. In [7], the authors recommend the version with word-level noise injection, , so we use this version in our comparison. Also for fair comparison, we use the same UDC dataset as reported in [7]. The only addition we made is to add the speaker attribute to the utterances of the dataset as described in the Dataset subsection.

Iv-D Hyperparameter search

Prior to evaluation, we determine the noise injection method and the optimum noise variance

that are suitable for the two datasets. We consider the two variations of , that is, for utterance-level and for word-level noise injections. We notice a partial mode collapse with on the combined TV transcripts, likely due to the high variation of word-level perturbation on a very limited dataset. However, this issue was rectified by . Therefore, we use for the combined TV series dataset and for the UDC dataset. We perform a linear search for optimal noise variance values between 1 and 30 at an increment of 1, with a sample size of . We obtained an optimal of 2 with for the combined TV series dataset and an optimal of 5 for for the much larger UDC dataset.

Metric [7]
Perplexity 48.18 27.3
ROUGE-2 (F1) 0.1252 0.1692
DISTINCT-1 14.05% 20.12%
DISTINCT-2 31.24% 24.53%

TABLE III: vs. [7] Performance

Iv-E Results

After training phredGAN models on the TV series and UDC datasets, we ran inference on some example dialogue contexts. The responses and their discriminator scores from are listed in Table I. The table shows that (i) can handle multi-turn dialogue context with utterances and corresponding persona attributes; (ii) generates responses conditioned on a persona attribute; (iii) generates multiple responses per dialogue context and scores their human likelihood by the discriminator. We observe that the discriminator score is generally reasonable with longer, more informative, and more persona-related responses receiving higher scores. It is worth noting that this behavior, although similar to the behavior of a human judge, is learned without supervision. Furthermore, we observe that responses retain contextual consistency, sometimes referencing background information inherent in the conversation between two speakers. For example, in the second sample of the TV series in Table I, generator, Leonard refers to Sheldon by name. Also, in the third sample, , Raj refers to Penny when responding to Leonard who happens to be Penny’s boy friend. We observe similar persona-based response generation for the UDC dataset with distinct communication styles between the asker and the helper.

We will now present performance comparisons of against the baselines, , and Li et al.’s persona Seq2Seq models.

Iv-E1 Quantitative Analysis

We first report the performance of on TV series transcripts in table II. Our system actually performs slightly worse than both variations of the Speaker Model and Speaker-Addressee systems in [8] in terms of the perplexity measure. This is because the entropy of multi-turn dialogue is higher than that of single-turn. Similar observations have been made by Serban et al. [3] about seq2seq and HRED dialogue models. However, gives a significantly larger BLEU-4 score than both Speaker and Speaker-Addressee models. We attribute this improvement to (i) the multi-turn dialogue context, and (ii) training in an adversarial framework, which forces the model to produce longer, more informative, and diverse responses that have high topic relevance even with a limited dataset. Also, unlike Speaker-Addressee models which suffer from lower response quality due to persona conditioning, conditioning the generator and discriminator of on speaker embeddings does not compromise the system’s ability to produce diverse responses. This problem might have been alleviated by the adversarial training too.

We also compare with the recommended variant of that includes word-level noise injection at the decoder on the UDC dataset. The results are summarized in Table III. We note an improvement in a variety of evaluation metrics including perplexity, ROUGE, and distinct n-grams, with the exception of distinct 2-grams. This is expected as should be generally less diverse than since the number of distinct data distribution modes is more for the dataset due to the persona attributes. However, this leads to better response quality with persona, something not achievable with . Also, the better ROUGE(f1) score indicates that is able to strike a better balance between diversity and precision while still capturing the characteristics of the speaker attribute modality in the UDC dataset.

Iv-E2 Qualitative Analysis

A qualitative assessment of these results are in Table IV with responses from several characters in the TV series dataset and the two characters in UDC.

We see that for TV drama series, responses are comparatively more informative than that of the Speaker-Addressee model of [8]. For example, with Speaker-Addressee model, nearly all the characters in the TV series respond with “Of course I love you.” to the dialogue context, “Do you love me?” despite the fact that some of the responders sometimes have unfriendly relationship with the addressee. Many of the novel situations explored by are unachievable with the Speaker-Addressee model due to a lack of informative responses. For example, by conditioning as Sheldon from The Big Bang Theory and asking “Do you like me?”, our model responds with annoyance if conditioned as Penny (“No, you don’t understand. You’re an idiot”), brevity with Leonard (“Yes?”), and sarcasm with Raj (“Well , you know , we could be a little more than my friend’s friends.”) The wide range of responses indicate our model’s ability to construct distinct attribute embeddings for each character even from a limited dataset. The other interesting responses in Table IV indicate our model’s ability to infer not only the context of the conversation but important character information about the Addressee.

We also see similar results with our model’s output on UDC. We demonstrate that by conditioning as either a helper or questioner from the UDC dataset, our model is able to respond differently to input utterances as well as stay close to the context of the conversation.

Model Speaker Response
Big Bang Theory Series
Context Sheldon Do you like me?
phredGAN Leonard Yes ?
phredGAN Penny No , you don ’ t understand . You ’ re an idiot .
phredGAN Howard Well , I don ’ t know what to say . This is the part I ’ ve ever seen .
phredGAN Raj Well , you know , we could be a little more than my friends ’ s friends .
Context Emily Do you like me?
Speaker-Addressee Sheldon Of course, I love you, Emily
Speaker-Addressee Penny Of course, I love you, Emily.
Speaker-Addressee Rachel Of course I love you.
Context_0 Howard We can’t go home.
Context_1 Leonard Alright, I’ll be right back.
phredGAN Howard Okay , well , before you and Penny get out and get married with Penny just get married.
Context_0 Amy Spent a lot of my childhood throwing coins into wishing wells hoping for friends
Context_1 Penny Whoa, whoa, give me a second to catch up here. Hold on
Context_2 Amy That’d be my boyfriend. Happier playing his dopey Star Trek game with his friends than hanging out with me.
phredGAN Penny You know , when you ’ re being so nice about Sheldon and I want to take him to like his own girlfriend
Context Sheldon I recently read that during World War Two, Joseph Stalin had a research program to create supersoldiers by having women impregnated by gorillas.
predGAN Howard You ’ re right . I don ’ t know what you ’ re talking about , Sheldon .
Friends TV Series
Context Rachel Do you like me?
phredGAN Joey Yeah .
phredGAN Ross Well , no , no , no . I mean , I just , I don ’ t know . . .
phredGAN Monica Well , I do . And I ’ m not gonna tell you another guy .
phredGAN Chandler No , I do not . You are a mean little man . .
Context Rachel Do you like me?
Speaker-Addressee Chandler Of course I love you.
Context Asker i need to install ubuntu on a pendrive so i can install software on the pendrive after booting from the usb, is it possible?
phredGAN Helper you can just boot into the live cd and install grub.
Context Asker anyone had problems with the kernel update from today? giving me a kernel panic there a simple way to make a rollback? anyone if I can go back to my previous kernel?
phredGAN Helper what version of ubuntu are you using ?
Context Helper did you see my message about looking at other log files?
predGAN Asker yeah , it ’ s not that I ’ m trying to get the UNK to work
Context Asker you can select previous kernels at the bootloader (grub) menu on boot
phredGAN Helper i ’ m not to be able to get the grub to the grub menu
TABLE IV: phredGAN sample responses on TV Series and UDC.

V Conclusion and Future Work

In this article, we improve upon state-of-the-art persona-based response generation models by adding attribute embeddings to in order to capture a representation of speaker identity and style in a multi-turn dialogue. Our model outperforms the existing Speaker-Model and Speaker-Addressee models from [8] with respect to BLEU score and even improves upon with respect to ROUGE, distinct n-grams, and perplexity scoring. We also see a qualitative improvement in response quality that more clearly separates speaker identity from even limited datasets.

In the future, we hope to carry out a human evaluation to confirm qualitative improvement of our model’s outputs. We also hope to extend this work to Multi-class GAN by optimizing over two separate adversarial losses, one for speaker attribute classification and the other for dialogue generation real/fake discrimination. This will allow us to further improve on persona distinctions without the loss of response quality.