Follow Alice into the Rabbit Hole: Giving Dialogue Agents Understanding of Human Level Attributes

10/18/2019 ∙ by Aaron W. Li, et al. ∙ 5

For conversational AI and virtual assistants to communicate with humans in a realistic way, they must exhibit human characteristics such as expression of emotion and personality. Current attempts toward constructing human-like dialogue agents have presented significant difficulties. We propose Human Level Attributes (HLAs) based on tropes as the basis of a method for learning dialogue agents that can imitate the personalities of fictional characters. Tropes are characteristics of fictional personalities that are observed recurrently and determined by viewers' impressions. By combining detailed HLA data with dialogue data for specific characters, we present a dataset that models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs. We then introduce a three-component system, ALOHA (which stands for Artificial Learning On Human Attributes), that combines character space mapping, character community detection, and language style retrieval to build a character (or personality) specific language model. Our preliminary experiments demonstrate that ALOHA, combined with our proposed dataset, can outperform baseline models at identifying correct dialogue responses of any chosen target character, and is stable regardless of the character's identity, genre of the show, and context of the dialogue.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Attempts toward constructing human-like dialogue agents have met significant difficulties, such as maintaining conversation consistency [30]. This is largely due to inabilities of dialogue agents to engage the user emotionally because of an inconsistent personality [19]. Many agents use personality models that attempt to map personality attributes into lower dimensional spaces (e.g. the Big Five [11]). However, these represent human personality at a very high-level and lack depth. They prohibit the ability to link specific and detailed personality traits to characters, and to construct large datasets where dialogue is traceable back to these traits.

For this reason, we propose Human Level Attributes (HLAs), which we define as characteristics of fictional characters representative of their profile and identity. We base HLAs on tropes collected from TV Tropes [24], which are determined by viewers’ impressions of the characters. See Figure 1 for an example. Based on the hypothesis that profile and identity contribute effectively to language style [18], we propose that modeling conversation with HLAs is a means for constructing a dialogue agent with stable human-like characteristics. By collecting dialogues from a variety of characters along with this HLA information, we present a novel labelling of this dialogue data where it is traceable back to both its context and associated human-like qualities.

Figure 1: Example of a character and its associated HLAs (tropes) on the left and dialogue lines on the right.

We also propose a system called ALOHA (Artificial Learning On Human Attributes) as a novel method of incorporating HLAs into dialogue agents. ALOHA maps characters to a latent space based on their HLAs, determines which are most similar in profile and identity, and recovers language styles of specific characters. We test the performance of ALOHA in character language style recovery against four baselines, demonstrating outperformance and system stability. We also run a human evaluation supporting our results.

Our major contributions are: (1) We propose HLAs as personality aspects of fictional characters from the audience’s perspective based on tropes; (2) We provide a large dialogue dataset traceable back to both its context and associated human-like attributes; (3) We propose a system called ALOHA that is able to recommend responses linked to specific characters. We demonstrate that ALOHA, combined with the proposed dataset, outperforms baselines. ALOHA also shows stable performance regardless of the character’s identity, genre of the show, and context of the dialogue. We plan to release all of ALOHA’s data and code.

2 Related Work

Task completion chatbots (TCC), or task-oriented chatbots, are dialogue agents used to fulfill specific purposes, such as helping customers book airline tickets, or a government inquiry system. Examples include the AIML based chatbot [21] and DIVA Framework [28]. While TCC are low cost, easily configurable, and readily available, they are restricted to working well for particular domains and tasks.

Open-domain chatbots are more generic dialogue systems. An example is the Poly-encoder from humeau2019real humeau2019real. It outperforms the Bi-encoder [15, 4] and matches the performance of the Cross-encoder [27, 26] while maintaining reasonable computation time. It performs strongly on downstream language understanding tasks involving pairwise comparisons, and demonstrates state-of-the-art results on the ConvAI2 challenge [3]. Feed Yourself [8] is an open-domain dialogue agent with a self-feeding model. When the conversation goes well, the dialogue becomes part of the training data, and when the conversation does not, the agent asks for feedback. Lastly, Kvmemnn [5] is a key-value memory network with a knowledge base that uses a key-value retrieval mechanism to train over multiple domains simultaneously. We use all three of these models as baselines for comparison. While these can handle a greater variety of tasks, they do not respond with text that aligns with particular human-like characteristics.

li2016persona li2016persona defines persona (composite of elements of identity) as a possible solution at the word level, using backpropagation to align responses via word embeddings. bartl2017retrieval bartl2017retrieval uses sentence embeddings and a retrieval model to achieve higher accuracy on dialogue context. liu2019emotion liu2019emotion applies emotion states of sentences as encodings to select appropriate responses. pichl2018alquist pichl2018alquist uses knowledge aggregation and hierarchy of sub-dialogues for high user engagement. However, these agents represent personality at a high-level and lack detailed human qualities.

LIGHT [25] models adventure game characters’ dialogues, actions, and emotions. It focuses on the agent identities (e.g. thief, king, servant) which includes limited information on realistic human behaviours. pasunuru2018game pasunuru2018game models online soccer games as dynamic visual context. wang2016learning wang2016learning models user dialogue to complete tasks involving certain configurations of blocks. antol2015vqa antol2015vqa models open-ended questions, but is limited to visual contexts. bordes2016learning bordes2016learning tracks user dialogues but is goal-oriented. ilinykh2019meetup ilinykh2019meetup tracks players’ dialogues and movements in a visual environment, and is grounded on navigation tasks. All of these perform well in their respective fictional environments, but are not a strong representation of human dialogue in reality.

3 Methodology

3.1 Human Level Attributes (HLA)

We collect HLA data from TV Tropes [24], a knowledge-based website dedicated to pop culture, containing information on a plethora of characters from a variety of sources. Similar to Wikipedia, its content is provided and edited collaboratively by a massive user-base. These attributes are determined by human viewers and their impressions of the characters, and are correlated with human-like characteristics. We believe that TV Tropes is better for our purpose of fictional character modeling than data sources used in works such as shuster2019engaging shuster2019engaging because TV Tropes’ content providers are rewarded for correctly providing content through community acknowledgement.

TV Tropes defines tropes as attributes of storytelling that the audience recognizes and understands. We use tropes as HLAs to calculate correlations with specific target characters. We collect data from numerous characters from a variety of TV shows, movies, and anime. We filter and keep characters with at least five HLA, as those with fewer are not complex enough to be correctly modeled due to reasons such as lack of data. We end up eliminating 5.86% of total characters, and end up with 45,821 characters and 12,815 unique HLA, resulting in 945,519 total character-HLA pairs. Each collected character has 20.64 HLAs on average. See Figure 1 for an example character and their HLAs.

3.2 Overall Task

Our task is the following, where denotes “target”:

Given a target character with HLA set , recover the language style of without any dialogue of provided.

For example, if Sheldon Cooper from The Big Bang Theory is , then is the set of HLA on the left side of Figure 1.

We define the language style of a character as its diction, tone, and speech patterns. It is a character specific language model refined from a general language model. We must learn to recover the language style of without its dialogue as our objective is to imitate human-like qualities, and hence the model must understand the language styles of characters based on their traits. If we feed ’s dialogue during training, the model will likely not effectively learn to imitate language styles based on HLAs, but based on the correlation between text in the training and testing dialogues [12].

We define character space as the character representations within the HLA latent space (see Figure 2), and the set as the set of all characters. We define Observation (OBS) as the input that is fed into any dialogue model. This can be a single or multiple lines of dialogue along with other information. The goal of the dialogue model is to find the best response to this OBS.

Figure 2: t-SNE visualization of the character space generated by our Character Space Module (CSM) based on HLAs.

3.3 Aloha

We propose a three-component system called ALOHA to solve the task (see Figure 3

). The first component, Character Space Module (CSM), generates the character space and calculates confidence levels using singular value decomposition

[20] between characters (for to where ) and in the HLA-oriented neighborhood.

The second component, Character Community Module (CCM), ranks the similarity between our target character with any other character by the relative distance between them in the character space.

The third component, Language Style Recovery Module (LSRM), recovers the language style of without its dialogue by training the BERT bi-ranker model [2] to rank responses from similar characters. Our results demonstrate higher accuracy at retrieving the ground truth response from . Our system is also able to pick responses which are correct both in context as well as character space.

Hence, the overall process for ALOHA works as follows. First, given a set of characters, determine the character space using the CSM. Next, given a specific target character, determine the positive community and negative set of associated characters using the CCM. Lastly, using the positive community and negative set determined above along with a dialogue dataset, recover the language style of the target.

Figure 3: Overall system architecture.

3.4 Character Space Module (CSM)

CSM learns how to rank characters. We can measure the interdependencies between the HLA variables [9]

and rank the similarity between the TV show characters. We use implicit feedback instead of neighborhood models (e.g. cosine similarity) because it can compute latent factors to transform both characters and HLAs into the same latent space, making them directly comparable.

We define a matrix that contains binary values, with if character has HLA in our dataset, and otherwise. We define a constant that measures our confidence in observing various character-HLA pairs as positive. controls how much the model penalizes the error if the ground truth is . If and the model guesses incorrectly, we penalize by times the loss. But if and the model guesses a value greater than 0, we do not penalize as has no impact. This is because can either represent a true negative or be due to a lack of data, and hence is less reliable for penalization. See Equation 1. We find that using provides decent results.

We further define two dense vectors

and . We call the “latent factors for character ”, and the “latent factors for HLA ”. The dot product of these two vectors produces a value () that approximates (see Figure 4). This is analogous to factoring the matrix into two separate matrices, where one contains the latent factors for characters, and the other contains the latent factors for HLAs. We find that and being 36-dimensional produces decent results. To bring as close as possible to

, we minimize the following loss function using the Conjugate Gradient Method

[23]:

(1)

The first term penalizes differences between the model’s prediction () and the actual value (). The second term is an L2 regularizer to reduce overfitting. We find provides decent results for 500 iterations (see Section 5.1).

Figure 4: Illustration of our Collaborative Filtering procedure. Green check-marks indicate a character having an HLA, and ‘X’ indicates otherwise. We randomly mask 30% of this data for validation, as marked by the ‘?’.

3.5 Character Community Module (CCM)

CCM aims to divide characters (other than ) into a positive community and a negative set. We define this positive community as characters that are densely connected internally to within the character space, and the negative set as the remaining characters. We can then sample dialogue from characters in the negative set to act as the distractors (essentially negative samples) during LSRM training.

As community finding is an ill-defined problem [7], we choose to treat CCM as a simple undirected, unweighted graph. We use the values learned in the CSM for and for various values of and , which approximate the matrix . Similar to hu2008collaborative hu2008collaborative, we can calculate the correlation between two rows (and hence two characters).

We then employ a two-level connection representation by ranking all characters against each other in terms of their correlation with . For the first level, the set is the top 10% (4582) most highly correlated characters with out of the 45,820 total other characters that we have HLA data for. For the second level, for each character in , we determine the 30 most heavily correlated characters with as set . The positive set are the characters which are present in at least 10 sets. We call this value 10 the minimum frequency. All other characters in our dialogue dataset make up the negative set . These act as our positive community and negative set, respectively. See Algorithm 1 in Appendix A for details, and Figure 5 for an example.

Figure 5: Illustration of the two-level connection representation procedure, using a minimum frequency of two. The transparent red circle indicates the first level set (), while the blue ones indicate the sets . The lines indicate connections between the characters within the community structure of .

3.6 Language Style Recovery Module (LSRM)

LSRM creates a dialogue agent that aligns with observed characteristics of human characters by using the positive character community and negative set determined in the CCM, along with a dialogue dataset, to recover the language style of without its dialogue. We use the BERT bi-ranker model from the Facebook ParlAI framework [16], where the model has the ability to retrieve the best response out of 20 candidate responses. [3, 25, 30] choose 20 candidate responses, and for comparison purposes, we do the same.

Bert

[2] is first trained on massive amounts of unlabeled text data. It jointly conditions on text on both the left and right, which provides a deep bi-directional representation of sentence inference. BERT is proven to perform well on a wide range of tasks by simply fine-tuning on one additional layer. We are interested in its ability to predict the next sentence, called Next Sentence Prediction. We perform further fine-tuning on BERT for our target character language style retrieval task to produce our LSRM model by optimizing both the encoding layers and the additional layer. We use BERT to create vector representations for the OBS and for each candidate response. By passing the first output of BERT’s 12 layers through an additional linear layer, these representations can be obtained as 768-dimensional sentence-level embeddings. It uses the dot product between these embeddings to score candidate responses and is trained using the ranking loss.

Candidate response selection

is similar to the procedure from previous work done on grounded dialogue agents [30, 25]. Along with the ground truth response, we randomly sample 19 distractor

responses from other characters from a uniform distribution of characters, and call this process

uniform character sampling. Based on our observations, this random sampling provides multiple context correct responses. Hence, the BERT bi-ranker model is trained by learning to choose context correct responses, and the model learns to recover a domain-general language model that includes training on every character. This results in a Uniform Model that can select context correct responses, but not responses corresponding to a target character with specific HLAs.

We then fine-tune on the above model to produce our LSRM model with a modification: we randomly sample the 19 distractor responses from only the negative character set instead. We choose the responses that have similar grammatical structures and semantics to the ground truth response, and call this process negative character sampling. This guides the model away from the language style of these negative characters to improve performance at retrieving responses for target characters with specific HLAs. Our results demonstrate higher accuracy at retrieving the correct response from character , which is the ground truth.

4 Experiment

4.1 Dialogue Dataset

To train the Uniform Model and LSRM, we collect dialogues from 327 major characters (a subset of the 45,821 characters we have HLA data for) in 38 TV shows from various existing sources of clean data on the internet, resulting in a total of 1,042,647 dialogue lines. We use a setup similar to the Persona-Chat dataset [30] and Cornell Movie-Dialogs Corpus [1], as our collected dialogues are also paired in terms of valid conversations.111Our dataset has much more dialogue per character compared to Persona-Chat and Cornell Movie-Dialogs Corpus as we need sufficient data to learn each character’s dialogue style. See Figure 1 for an example of these dialogue lines.

4.2 HLA Observation Guidance (HLA-OG)

We define HLA Observation Guidance (HLA-OG) as explicitly passing a small subset of the most important HLAs of a given character as part of the OBS rather than just an initial line of dialogue. This is adapted from the process used in zhang2018personalizing zhang2018personalizing and wolf2019transfertransfo wolf2019transfertransfo which we call Persona Profiling. Specifically, we pass four HLAs that are randomly drawn from the top 40 most important HLAs of the character.222See Appendix B for details on how we choose these HLAs. We use HLA-OG during training of the LSRM and testing of all models. This is because the baselines (see Section 5.3) already follow a similar process (Persona Profiling) for training. For the Uniform Model, we train using Next Sentence Prediction (see Section 3.6). For testing, HLA-OG is necessary as it provides information about which HLAs the models should attempt to imitate in their response selection. Just passing an initial line of dialogue replicates a typical dialogue response task without HLAs. See Table 1. Further, we also test our LSRM by explicitly passing four HLAs of ‘none’ along with the initial line of dialogue as the OBS (No HLA-OG in Table 1).

Table 1: Example for Persona Profiling, HLA-OG, No HLA-OG, and Next Sentence Prediction. All lines under OBS are fed together as input to the dialogue retrieval model.

4.3 Training Details

BERT bi-ranker

is trained by us on the Persona-Chat dataset for the ConvAI2 challenge. Similar to zhang2018personalizing zhang2018personalizing, we cap the length of the OBS at 360 tokens and the length of each candidate response at 72 tokens.333Tokens here refer to the WordPiece tokens used by BERT. We use a batch size of 64, learning rate of 5e-5, and perform warm-up updates for 100 iterations. The learning rate scheduler uses SGD optimizer with Nesterov’s accelerated gradient descent [22] and is set to have a decay of 0.4 and to reduce on plateau.444We are able to recover up to 78% Hits@1 accuracy on Persona-Chat (see Section 5.4).

Uniform Model

is produced by finetuning the BERT bi-ranker on the dialogue data discussed in Section 4.1

using uniform character sampling. We use the same hyperparameters as the BERT bi-ranker along with half-precision operations (i.e. float16 operations) to increase batch size as recommended

[10].

Lsrm

is produced by finetuning on the Uniform Model discussed above using negative character sampling. We use the same hyperparameters as the BERT bi-ranker along with half-precision operations (i.e. float16 operations) to increase batch size as recommended.

5 Evaluation

5.1 CSM Evaluation

We begin by evaluating the ability of the CSM component of our system to correctly generate the character space. To do so, during training, 30% of the character-HLA pairs (which are either 0 or 1) are masked, and this is used as a validation set (see Figure 4). For each character , the model generates a list of the 12,815 unique HLAs ranked similarly to hu2008collaborative hu2008collaborative for . We look at the recall of our CSM model, which measures the percentage of total ground truth HLAs (over all characters ) present within the top N ranked HLAs for all by our model. That is:

(2)

where are the ground truth HLAs for , and are the top N ranked HLAs by the model for . We use , and our model achieves 25.08% recall.

To inspect the CSM performance, we use the T-distributed Stochastic Neighbor Embedding (t-SNE) [14]

to reduce each high-dimensionality data point to two-dimensions via Kullback-Leibler Divergence

[13]

. This allows us to map our character space into two-dimensions, where similar characters from our embedding space have higher probability of being mapped close by. We sampled characters from four different groups or regions. As seen in Figure 

2, our learned character space effectively groups these characters, as similar characters are adjacent to one another in four regions.

5.2 Automatic Evaluation Setup

Five-Fold Cross Validation

is used for training and testing of the Uniform Model and LSRM. The folds are divided randomly by the TV shows in our dialogue data. We use the dialogue data for 80% of these shows as the four-folds for training, and the dialogue data for the remaining 20% as the fifth-fold for validation/testing. The dialogue data used is discussed in Section 4.1. This ensures no matter how our data is distributed, each part of it is tested, allowing our evaluation to be more robust to different characters. See Appendix C for five-fold cross validation details and statistics.

Five Evaluation Characters

are chosen, one from each of the five testing sets above. Each is a well-known character from a separate TV show, and acts as a target character for evaluation of every model. We choose Sheldon Cooper from The Big Bang Theory, Jean-Luc Picard from Star Trek, Monica Geller from Friends, Gil Grissom from CSI, and Marge Simpson from The Simpsons. We choose characters of significantly different identities and profiles (intelligent scientist, ship captain, outgoing friend, police leader, and responsible mother, respectively) from shows of a variety of genres to ensure that we can successfully recover the language styles of various types of characters. We choose well-known characters because humans require knowledge on the characters they are evaluating (see Section 5.5).

For each of these five evaluation characters, all the dialogue lines from the character act as the ground truth responses. The initial dialogue lines are the corresponding dialogue lines to which these ground truth responses are responding. For each initial dialogue line, we randomly sample 19 other candidate responses from the associated testing set using uniform character sampling. Note that this is for evaluation, and hence we use the same uniform character sampling method for all models including ALOHA. The use of negative character sampling is only in ALOHA’s training.

5.3 Baselines

We compare against four dialogue system baselines: Kvmemnn, Feed Yourself, Poly-encoder, and a BERT bi-ranker baseline trained on the Persona-Chat dataset using the same training hyperparameters (including learning rate scheduler and length capping settings) described in Section 4.3.555See Section 2 for more details about the first three models. For the first three models, we use the provided pretrained (on Persona-Chat) models. We evaluate all four on our five evaluation characters discussed in Section 5.2.

5.4 Key Evaluation Metrics

Hits@n/N

is the accuracy of the correct ground truth response being within the top ranked candidate responses out of total candidates. We measure Hits@1/20, Hits@5/20, and Hits@10/20.

Mean Rank

is the average rank that a model assigns the ground truth response among the 20 total candidates.

Mean Reciprocal Rank (MRR)

[29] looks at the mean of the multiplicative inverses of the rank of each correct answer out of a sample of queries :

(3)

where refers to the rank position of the correct response for the -th query, and refers to the total number of queries in .

-score

equals . For dialogue, precision is the fraction of words in the chosen response contained in the ground truth response, and recall is the fraction of words in the ground truth response contained in the chosen response.

Bleu

[17] generally indicates how close two pieces of text are in content and structure, with higher values indicating greater similarity. We report our final BLEU scores as the average scores of 1 to 4-grams.

5.5 Human Evaluation Setup

We conduct a human evaluation with 12 participants, 8 male and 4 female, who are affiliated project researchers aged 20-39 at the University of [ANON]. We choose the same five evaluation characters as in Section 5.2. To control bias, each participant evaluates one or two characters. For each character, we randomly select 10 testing samples (each includes an initial line of dialogue along with 20 candidate responses, one of which is the ground truth) from the same testing data for the automatic evaluation discussed in Section 5.2.

These ten samples make up a single questionnaire presented in full to each participant evaluating the corresponding character, and the participant is asked to select the single top response they think the character would most likely respond with for each of the ten initial dialogue lines. See Figure 6 for an example. We mask any character names within the candidate responses to prevent human participants from using names to identify which show the response is from.

Figure 6: Example of what a human participant sees on each page of the questionnaire, along with their chosen response and the ground truth. As seen, there are multiple context correct (but not necessarily HLA correct) candidate responses.

Each candidate is prescreened to ensure they have sufficient knowledge of the character to be a participant. We ask three prescreening questions where the participant has to identify an image, relationship, and occupation of the character.666See Appendix D for prescreening questions and examples. All 12 of our participants passed the the prescreening.

6 Results and Analysis

6.1 Evaluation Results

Table 2: Average automatic and human evaluation results.
Table 3: Average Hits@1/20 scores by evaluation character.

Table 2 shows average results of our automatic and human evaluations.777See Appendix E for a relative frequency histogram of the percentage of correctly chosen responses by human participants. Table 3 shows average Hits@1/20 scores by evaluation character. See Appendix F for detailed evaluation results. ALOHA is the model with HLA-OG during training and testing, and ALOHA (No HLA-OG) is the model with HLA-OG during training but tested with the four HLAs in the OBS marked as ‘none’ (see Section 4.2). See Appendix G for demo interactions between a human, BERT bi-ranker baseline, and ALOHA for all five evaluation characters.

6.2 Evaluation Challenges

The evaluation of our task (retrieving the language style of a specific character) is challenging and hence the five-fold cross validation is necessary for the following reasons:

  • The ability to choose a context correct response without attributes of specific characters may be hard to separate from our target metric, which is the ability to retrieve the correct response of a target character by its HLAs. However, from manual observation, we noticed that in the 20 chosen candidate responses, there are typically numerous context correct responses, but only one ground truth for the target character (for an example, see Figure 6). Hence, a model that only chooses dialogue based on context is distinguishable from one that learns HLAs.

  • Retrieving responses for the target character depends on the other candidate responses. For example, dialogue retrieval performance for Grissom from CSI, which is a crime/police context, is higher than other evaluation characters (see Table 3), potentially due to other candidate responses not falling within the same crime/police context.

6.3 Performance: ALOHA vs. Humans

As observed from Table 2, ALOHA has a performance relatively close to humans. Human Hits@1/20 scores have a mean of 40.67% and a median over characters of 40%. The limited human evaluation sample size limits what can be inferred, but it indicates that the problem is solved to the extent that ALOHA is able perform relatively close to humans on average. Notice that even humans do not perform extremely well, demonstrating that this task of character based dialogue retrieval is more difficult than typical dialogue retrieval tasks [25, 3].

Looking more closely at each character from Table 3, we can see that human evaluation scores are higher for Sheldon and Grissom. This may be due to these characters having more distinct personalities, making them more memorable.

We also look at Pearson correlation values of the Hits@1/20 scores across the five evaluation characters. For human versus Uniform Model, this is -0.4694, demonstrating that the Uniform Model, without knowledge of HLAs, fails to imitate human impressions. For human versus ALOHA, this is 0.4250, demonstrating that our system is able to retrieve character responses somewhat similarly to human impressions. Lastly, for human versus the difference in scores between ALOHA and Uniform Model, this is 0.7815. The difference between ALOHA and the Uniform Model, which is based on the additional knowledge of the HLAs, is hence shown to improve upon the Uniform Model similarly to human impressions. This demonstrates that HLAs are indeed an accurate method of modeling human impressions of character attributes, and also demonstrates that our system, ALOHA, is able to effectively use these HLAs to improve upon dialogue retrieval performance.

6.4 Performance: ALOHA vs. Baselines

ALOHA, combined with the HLAs and dialogue dataset, achieves a significant improvement on the target character language style retrieval task compared to the baseline open-domain chatbot models. As observed from Table 2, ALOHA achieves a significant boost in Hits@n/N accuracy and other metrics for retrieving the correct response of five diverse characters with different identities (see Section 5.2).

6.5 Performance: ALOHA vs. Uniform Model

We observe a noticeable improvement in performance between ALOHA and the Uniform Model in recovering the language styles of specific characters that is consistent across all five folds (see Tables 2 and 3), indicating that lack of knowledge of HLAs limits the ability of the model to successfully recover the language style of specific characters. We claim that, to the best of our knowledge, we have made the first step in using HLA-based character dialogue clustering to improve upon personality learning for chatbots.

ALOHA demonstrates an accuracy boost for all five evaluation characters, showing that the system is robust and stable and has the ability to recover the dialogue styles of fictional characters regardless of the character’s profile and identity, genre of the show, and context of the dialogue.

6.6 Performance: HLA-OG

As observed from Table 2, ALOHA performs slightly better overall compared to ALOHA (No HLA-OG). Table 3 shows that this slight performance increase is consistent across four of the five evaluation characters. In the case of Sheldon, the HLA-OG model performs a bit worse. This is possibly due to the large number of Sheldon’s HLAs (217) compared to the other four evaluation characters (average of 93.75), along with the limited amount of HLAs we are using for guidance due to the models’ limited memory. In general, HLA Observation Guidance during testing appears to improve upon the performance of ALOHA, but this improvement is minimal.

7 Conclusion and Future Work

We proposed Human Level Attributes (HLAs) as a novel approach to model human-like attributes of characters, and collected a large volume of dialogue data for various characters with complete and robust profiles. We also proposed and evaluated a system, ALOHA, that uses HLAs to recommend tailored responses traceable to specific characters, and demonstrated its outperformance of the baselines and ability to effectively recover language styles of various characters, showing promise for learning character or personality styles. ALOHA was also shown to be stable regardless of the character’s identity, genre of show, and context of dialogue.

Potential directions for future work include training ALOHA with a multi-turn response approach [30] that tracks dialogue over multiple responses, as we could not acquire multi-turn dialogue data for TV shows. Another potential is the modeling of the dialog counterpart (e.g. the dialogue of other characters speaking to the target character). Further, performing semantic text exchange on the chosen response with a model such as SMERTI [6] may improve the ability of ALOHA to converse with humans. This is because the response may be context and HLA correct, but incorrect semantically (e.g. the response may say the weather is sunny when it is actually rainy). HLA-aligned generative models is another area of exploration. Typically, generative models produce text that is less fluent, but further work in this area may lead to better results. Lastly, a more diverse and larger participant pool is required due to the limited size of our human evaluation.

References

  • [1] C. Danescu-Niculescu-Mizil and L. Lee (2011) Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd workshop on cognitive modeling and computational linguistics, pp. 76–87. Cited by: §4.1.
  • [2] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.3, §3.6.
  • [3] E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, D. Kiela, A. Szlam, I. Serban, R. Lowe, et al. (2019) The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098. Cited by: §2, §3.6, §6.3.
  • [4] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2018)

    Wizard of wikipedia: knowledge-powered conversational agents

    .
    arXiv preprint arXiv:1811.01241. Cited by: §2.
  • [5] M. Eric and C. D. Manning (2017) Key-value retrieval networks for task-oriented dialogue. arXiv preprint arXiv:1705.05414. Cited by: §2.
  • [6] S. Y. Feng, A. W. Li, and J. Hoey (2019-08) Keep Calm and Switch On! Preserving Sentiment and Fluency in Semantic Text Exchange. arXiv e-prints, pp. arXiv:1909.00088. External Links: 1909.00088 Cited by: §7.
  • [7] S. Fortunato and D. Hric (2016) Community detection in networks: a user guide. Physics reports 659, pp. 1–44. Cited by: §3.5.
  • [8] B. Hancock, A. Bordes, P. Mazare, and J. Weston (2019) Learning from dialogue after deployment: feed yourself, chatbot!. arXiv preprint arXiv:1901.05415. Cited by: §2.
  • [9] Y. Hu, Y. Koren, and C. Volinsky (2008) Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272. Cited by: §3.4.
  • [10] S. Humeau, K. Shuster, M. Lachaux, and J. Weston (2019) Real-time inference in multi-sentence tasks with deep pretrained transformers. arXiv preprint arXiv:1905.01969. Cited by: §4.3.
  • [11] O. P. John, S. Srivastava, et al. (1999) The big five trait taxonomy: history, measurement, and theoretical perspectives. Handbook of personality: Theory and research 2 (1999), pp. 102–138. Cited by: §1.
  • [12] M. Joshi, O. Levy, D. S. Weld, and L. Zettlemoyer (2019) BERT for coreference resolution: baselines and analysis. arXiv preprint arXiv:1908.09091. Cited by: §3.2.
  • [13] S. Kullback and R. A. Leibler (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §5.1.
  • [14] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne.

    Journal of machine learning research

    9 (Nov), pp. 2579–2605.
    Cited by: §5.1.
  • [15] P. Mazaré, S. Humeau, M. Raison, and A. Bordes (2018) Training millions of personalized dialogue agents. arXiv preprint arXiv:1809.01984. Cited by: §2.
  • [16] A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston (2017) ParlAI: a dialog research software platform. arXiv preprint arXiv:1705.06476. Cited by: §3.6.
  • [17] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, pp. 311–318. External Links: Link, Document Cited by: §5.4.
  • [18] J. W. Pennebaker and L. A. King (1999) Linguistic styles: language use as an individual difference.. Journal of personality and social psychology 77 (6), pp. 1296. Cited by: §1.
  • [19] H. Rashkin, E. M. Smith, M. Li, and Y. Boureau (2019) Towards empathetic open-domain conversation models: a new benchmark and dataset. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 5370–5381. Cited by: §1.
  • [20] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl (2000) Application of dimensionality reduction in recommender system-a case study. Technical report Minnesota Univ Minneapolis Dept of Computer Science. Cited by: §3.3.
  • [21] M. S. Satu, M. H. Parvez, et al. (2015) Review of integrated applications with aiml based chatbot. In 2015 International Conference on Computer and Information Engineering (ICCIE), pp. 87–90. Cited by: §2.
  • [22] I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013)

    On the importance of initialization and momentum in deep learning

    .
    In International conference on machine learning, pp. 1139–1147. Cited by: §4.3.
  • [23] G. Takács, I. Pilászy, and D. Tikk (2011) Applications of the conjugate gradient method for implicit feedback collaborative filtering. In Proceedings of the fifth ACM conference on Recommender systems, pp. 297–300. Cited by: §3.4.
  • [24] tvtropes.org (2004)(Website) External Links: Link Cited by: §1, §3.1.
  • [25] J. Urbanek, A. Fan, S. Karamcheti, S. Jain, S. Humeau, E. Dinan, T. Rocktäschel, D. Kiela, A. Szlam, and J. Weston (2019) Learning to speak and act in a fantasy text adventure game. arXiv preprint arXiv:1903.03094. Cited by: §2, §3.6, §3.6, §6.3.
  • [26] J. Vig and K. Ramea (2019-01)

    Comparison of transfer-learning approaches for response selection in multi-turn conversations

    .
    pp. . Cited by: §2.
  • [27] T. Wolf, V. Sanh, J. Chaumond, and C. Delangue (2019)

    Transfertransfo: a transfer learning approach for neural network based conversational agents

    .
    arXiv preprint arXiv:1901.08149. Cited by: §2.
  • [28] M. Xuetao, F. Bouchet, and J. Sansonnet (2009) Impact of agent’s answers variability on its believability and human-likeness and consequent chatbot improvements. In Proc. of AISB, pp. 31–36. Cited by: §2.
  • [29] D. Zhang and W. S. Lee (2003) A web-based question answering system. Cited by: §5.4.
  • [30] S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: i have a dog, do you have pets too?. arXiv preprint arXiv:1801.07243. Cited by: §1, §3.6, §3.6, §4.1, §7.