One Chatbot Per Person: Creating Personalized Chatbots based on Implicit User Profiles

08/20/2021 ∙ by Zhengyi Ma, et al. ∙ Université de Montréal 0

Personalized chatbots focus on endowing chatbots with a consistent personality to behave like real users, give more informative responses, and further act as personal assistants. Existing personalized approaches tried to incorporate several text descriptions as explicit user profiles. However, the acquisition of such explicit profiles is expensive and time-consuming, thus being impractical for large-scale real-world applications. Moreover, the restricted predefined profile neglects the language behavior of a real user and cannot be automatically updated together with the change of user interests. In this paper, we propose to learn implicit user profiles automatically from large-scale user dialogue history for building personalized chatbots. Specifically, leveraging the benefits of Transformer on language understanding, we train a personalized language model to construct a general user profile from the user's historical responses. To highlight the relevant historical responses to the input post, we further establish a key-value memory network of historical post-response pairs, and build a dynamic post-aware user profile. The dynamic profile mainly describes what and how the user has responded to similar posts in history. To explicitly utilize users' frequently used words, we design a personalized decoder to fuse two decoding strategies, including generating a word from the generic vocabulary and copying one word from the user's personalized vocabulary. Experiments on two real-world datasets show the significant improvement of our model compared with existing methods. Our code is available at



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. An example of the personalized chatbot serving as an intelligent agent for user Tom when Tom is busy.

Faced with extensive information available on the Internet, it is very appealing to have an intelligent assistant that can provide the most relevant information (Ma et al., 2020; Zhou et al., 2020a, b), collaborate with us on day-to-day problems (Luo et al., 2019; Zhou et al., 2018a), or even act as our agent for some specific tasks (Wooldridge and Jennings, 1995). Towards this ultimate goal, in the dialogue system area, building digital agents has attracted more and more attention and had some preliminary applications in our daily life (Luo et al., 2019; Qiu et al., 2017; Zhou et al., 2018a). In the future, some chit-chat between humans will inevitably be completed by digital agents. In this paper, we carry out a preliminary study toward this goal and focus on the problem of developing personalized chatbots. A personalized chatbot aims at leveraging personalized information (e.g., a predefined persona (Zhang et al., 2018; Liu et al., 2020)) to provide personalized responses when communicating with others. Such personalized information can help chatbots generate more consistent and informative replies. More importantly, if the personalized information can be well captured, the chatbots can perform similar behaviors like real users (e.g., serving as an agent of the user and give similar responses to others when the user is busy, as shown in Figure 1), thus having the potential to be an intelligent agent with that specific personality (Wooldridge and Jennings, 1995).

Many models have been proposed for improving the chatbot’s capability to generate personalized responses. Early studies tried to integrate the user ID embeddings to a sequence-to-sequence (Seq2Seq) model for identifying the user and generating user-related responses (Chan et al., 2019; Li et al., 2016b; Bak and Oh, 2019). Recently, some studies proposed to assign predefined personas to chatbots so as to generate more personalized responses (Zhang et al., 2018; Qian et al., 2018; Song et al., 2020b). They assumed that personality can be described in several sentences or attributes and help chatbots generate more persona-related responses. Different from existing studies, in this work, we propose letting the chatbot learn the implicit user profile automatically from the user dialogue history and generate personalized responses based on the learned user profile. In this way, the chatbot can be personalized by user’s (e.g., user A) historical data, behave like this user, act as this user’s agent, and chat with any other users (e.g., user B, C, etc.).

Our idea is motivated by: (1) Contrary to explicit persona descriptions, user dialogue history is easier to be gathered on the user’s client devices. It is evident that obtaining explicit descriptions for massive users is impractical in real applications (Chan et al., 2019): First, users may be lazy to set their profile before using the chatbot (Gerlach and Kuo, 1991). Second, manually collecting user profiles is costly and time-consuming. Furthermore, even if the user profile is collected, it cannot be updated with the change of user interests, thus may be ineffective over time. Finally, a fixed set of properties is not suitable for describing all users. (2) The user dialogue history contains massive personalized information, which is suitable for learning user profiles automatically. As shown in Figure 1, the dialogue history of a user includes their historical responses, and the corresponding posts issued by other users. Intuitively, users’ historical responses can often reflect their language style, background knowledge, frequently used words, and even their interests. For example, there may be many electronic device names that appeared in the historical responses of an electronic hobbyist (e.g., MacBook in Figure 1). Besides, the interaction content and style between a specific user and others can be captured from the historical post-response pairs. When faced with a new input post, the chatbot can look for the historical data, check how the user has responded to a similar post before, and apply similar interactions to generate a suitable response. In addition, user profiles learned from historical data can be gradually updated with more data being collected. In summary, the user dialogue history is easy to obtain and appropriate for building user profile.

To achieve our idea, we propose a model DHAP for personalized chatbots, which focuses on learning implicit user profile from user Dialogue History Automatically and generating Personalized responses. In our model, a general user profile representation is firstly constructed from the user’s historical responses to capture the general information, including user interest, background knowledge, and speaking style. This is implemented by a personalized language model based on Transformer. Then, we design a personalized post encoder to construct the personalized post representation. The general user profile is utilized in the post encoder to better capture the semantic information of the input post. Next, we build a key-value memory network to store the user’s historical post-response pairs. Based on this history memory, the dynamic post-aware user profile is built by highlighting the historical responses relevant to the current input post. Finally

, we design a personalized decoder to fuse the learned user profile into the response generation process. The personalized decoder can switch between generating a word from a generic vocabulary and copying a word from the user’s personalized vocabulary. Experimental results on two large-scale datasets show that our proposed DHAP significantly outperforms existing response generation models in various evaluation metrics.

Our contributions are three-fold: (1) We learn the implicit user profile from user’s dialogue history automatically for generating personalized responses. By this means, our method can be applied without additional annotations on user profiles, and create personalized chatbots as user’s digital agents. (2) We build two kinds of user profiles from the dialogue history, including the general user profile reflecting the user’s general information and the dynamic post-aware user profile to apply similar interactions in the historical data for the current input. (3) We design a personalized decoder to coordinate two personalized decoding strategies, including generating a word from the generic vocabulary and copying a word from the personalized vocabulary to leverage the user’s word preference.

Figure 2. The overall structure of the proposed model DHAP, which consists of (1) a history encoder, (2) a personalized post encoder, (3) a user history memory, and (4) a personalized decoder.

2. Related Work

Open-domain Chatbots and Response Generation

Open-domain chatbots have attracted more and more attention, due to their broad application in real applications, such as Microsoft XiaoIce (Zhou et al., 2018a). Typical methods can be categorized into two groups: retrieval-based and generation-based. Retrieval-based methods aim to select a suitable response from a large repository (Wu et al., 2017; Zhu et al., 2020b, 2021), while generation-based methods aim at generating a response from scratch (Shang et al., 2015; Sordoni et al., 2015; Serban et al., 2016; Zhu et al., 2020a). In this study, we focus on the response generation problem.

Some early studies treat the response generation task as a statistical machine translation problem because of its end-to-end and data-driven features (Ritter et al., 2011; Vinyals and Le, 2015)

. More recently, with the progress of deep learning, Seq2Seq methods have been applied to response generation and achieve great performance 

(Serban et al., 2016; Vinyals and Le, 2015; Serban et al., 2017; Li et al., 2016a). Many Seq2Seq-based extensions have been applied to tackle the “safe response” problem (Li et al., 2016a); to incorporate external knowledge (Zhou et al., 2018c); to generate responses with emotions or personas (Zhou et al., 2018b; Li et al., 2016b; Qian et al., 2018); and to model the hierarchical structure of the dialogue context (Serban et al., 2016, 2017).

Personalized Chatbots

Endowing chatbots with a coherent personality is a challenging but necessary step to achieve the ultimate goal of building an intelligent assistant. With personality, a chatbot can generate more informative and user-specific responses (Li et al., 2016b; Zhang et al., 2018), and has the potential to perform similar behaviors as real humans.

Traditional personalized chatbots focus on modeling the user’s psychological behavior such as the “Big Five” of speakers (Mairesse and Walker, 2007)

. In recent years, deep-learning based methods were proposed to learn the persona information directly from large-scale dialogue datasets via end-to-end neural networks 

(Li et al., 2016b; Chan et al., 2019; Zhang et al., 2018; Qian et al., 2018). Some researchers first tried to input the user ID embeddings into the decoder of a Seq2Seq model to generate more personalized responses (Li et al., 2016b; Al-Rfou et al., 2016; Chan et al., 2019; Bak and Oh, 2019). Despite users can be identified by their IDs, the personalized performance is limited because no user-related information is used in the model. Therefore, another group of researchers proposed assigning explicit profiles (personas) for chatbots to generate personalized responses. For example, Zhang et al. (2018)

published the PERSONA-CHAT dataset, in which each user is assigned with several persona description sentences and the conversations between users are collected. With explicit profile descriptions, chatbots can learn to generate more consistent and informative dialogues. On this dataset, many methods have achieved encouraging performance, such as variational autoencoders 

(Song et al., 2019), pre-trained language models (Wolf et al., 2019; Song et al., 2020a), and multi-task modeling (Song et al., 2020b; Welleck et al., 2019; Song et al., 2020a). These explicit persona-based methods enjoy the high quality of predefined personas, but the acquisition of these persona data is expensive and even impossible when applied to real-world systems (Chan et al., 2019).

In this work, we propose using the implicit user profile to drive the response generation. Such user profile is implicitly learned with neural models from user dialogue history rather than predefined explicitly. Due to the fact that the user dialogue history contains abundant user-specific information and is easily accessible in real-world applications, our method is more applicable in practice.

3. Methodology

In this section, we first provide an overview of our proposed model DHAP. The details of each component are provided later, and the model training is introduced finally.

3.1. The Overview of DHAP

Suppose that for a user , we have their dialogue history including a series of responses issued by and the corresponding posts: , where is the number of historical post-response pairs. Note that the posts here can be issued by different users, but the responses are all issued by the same user . We call them historical posts and historical responses respectively in the following sections. Under the single-turn setting, given an input post and the user dialogue history , with sequence-to-sequence modeling, our task is to generate a personalized response as:


where denotes the word generated at the -th step, and denotes the previous generated words . It is worth noting that the dialogue history here includes dialogues between the user and several other users, thus the history is not a multi-turn dialogue between two fixed interlocutors.

To compute the probability

, we design a model called DHAP, which stands for learning user profiles from Dialogue History Automatically for Personalized chatbots. The structure of DHAP is shown in Figure 2. We briefly introduce the key components of DHAP as follows. The number of the modules corresponds to the mark in the figure. In general, DHAP considers personalized information in both the encoder and the decoder side.

3.1.1. Encoder

DHAP has two different encoders, which encode the input post and user’s historical responses, respectively:

Part (1): History encoder and general user profile. Since abundant personalized information (e.g., background knowledge and speaking style) is often hidden in user’s historical responses, DHAP firstly establishes a Transformer-based personalized language model to encode historical responses , then summarizes the general user profile based on the historical responses. As does not depend on the input , we call it general user profile. “General” does not mean it is general among all users or cannot be updated together with time. In contrast, every user has their own profile, which can be updated once they issue a new response.

Part (2): Personalized post encoder. DHAP also has an encoder for the input post, which is implemented by a bidirectional GRU (BiGRU). To make the post encoder aware of the user’s personalized information, we use the general user profile for initialization. Consequently, the post is represented by hidden state sequence . These representations will be dynamically aggregated as a personalized post representation by an attention mechanism at the decoding step .

3.1.2. Decoder

In the decoder side, DHAP incorporates the personalized information in two perspectives:

Part (3): User history memory and dynamic post-aware user profile. In the general user profile, all historical responses are considered in a general view. However, for a specific input post , the historical responses may play different roles. Intuitively, the historical responses relevant to are more valuable than the irrelevant ones. To highlight the historical responses relevant to the input post, DHAP builds a key-value memory of historical post-response pairs . Given as a personalized representation of , DHAP builds a dynamic post-aware user profile by selecting and aggregating the historical responses from the user history memory. As is dynamic with different input posts and summarizes related user history information, we call it dynamic post-aware user profile.

Part (4): Personalized decoder. Finally, the personalized post representation , general user profile , and dynamic post-aware user profile are fused together to decode the response sequentially. Inspired by CopyNet (Gu et al., 2016), DHAP can switch between generating a word from a generic vocabulary () and copying a word from a user’s personalized vocabulary () as:


where and are computed by our designed decoding switcher. and are calculated based on same inputs(e.g., , and ) with different functions.

In the remaining part of this section, we will introduce these four components of DHAP in detail.

3.2. History Encoder and General User Profile

Based on our observation, there is a large amount of personalized information in the user’s historical responses. For example, a fan of cricket may talk a lot about cricket topics with others. Furthermore, different users can hold different speaking styles, such as enjoying speaking slang. Therefore, our first idea is to devise a model for learning such personalized language information in a general view.

Inspired by the strong ability of Transformer (Vaswani et al., 2017) to aggregate context and model sequences, we use a Transformer encoder to learn contextual representations of historical responses. In particular, we first add special tokens and concatenates all historical responses as , where each response contains words, and [;] is the concatenation operation. A “” token is added at the tail of each response for segmentation, while a “” token is added at the sequence head for summary. Then we map all words and special tokens into embeddings, and learn their contextual representations:


where is the sum of the word embedding, segment embedding, and position embedding of the -th word. These three embeddings are used in a similar way like BERT (Devlin et al., 2019). The representation of “” token () summarizes the information in the entire historical responses, thus we call it the general user profile. contains contextual representations of words in historical responses. For , we denote its contextual representations as . () is a -layer bidirectional Transformer encoder identical to the original implementation described in (Vaswani et al., 2017).

With the history encoder, we obtain: (1) a general user profile , which summarizes all historical responses and contains personalized information of the user; and (2) the contextual representations of words in the historical responses . These representations will be used to build the dynamic post-aware user profile in Section 3.4.

3.3. Personalized Post Encoder

In real-world applications, the posts are often very short, even ambiguous. Thus, building accurate encoding of the input post is difficult, which further leads to poor quality of the generated responses (Li et al., 2016b; Shang et al., 2015). Fortunately, with personalized background knowledge, the chatbot can get more input information and is promising to better capture the semantic information of the post. Let us use an example to explain this: given a post “The new MAC is so beautiful”, different users may have different understandings. For a programmer, “MAC” may refer to the Apple’s laptop; but for a fashion girl, she may associate “MAC” with the lipstick. Therefore, the user’s history can help distinguish the word “MAC” and provide more background knowledge for the input post, so that the post representation can be significantly enhanced. We call this encoder “personalized post encoder” - the encoded representations of the same post can be different for various users, because these users may have different profiles and different understandings of the same post.

Specifically, DHAP employs a BiGRU to encode the input post. Here we choose RNN-based architectures because they are better at capturing local correlations and encoding positional information than Transformers for short texts (Xia et al., 2019; Neishi and Yoshinaga, 2019). To facilitate the encoder with personalized information, DHAP uses the general user profile to initialize hidden states of BiGRU. Given the post , its representations are built as:


where is the embedding obtained by the embedding table.

These hidden states draw personalized information from the general user profile, and in the decoding phase, they are aggregated by the decoding hidden state through an attention mechanism:


The detailed calculation and updating scheme of the decoding state at time step will be described in Section 3.5.4. Based on the attention mechanism, the personalized post representation can dynamically focus on some important words of the post according to the current decoding state. In the next section, DHAP will use to build the dynamic post-aware user profile.

3.4. User History Memory and Dynamic Post-aware User Profile

With the general user profile, DHAP can capture personalized information of a user in a general view. However, when faced with a new input post, the historical data may play different roles. For example, when a crazy fan of cricket meets a post about cricket, they will be talkative and post a lot of things. But when they face other daily topics, they may behave much more gently. Hence, it is valuable to dynamically select the information that is most relevant to the input, and the chatbot can behave differently as the input varies. Following this idea, we propose to dynamically aggregate historical responses that are highly relevant to the current input post, and leverage them as a reference to drive the response generation.

To achieve this, DHAP uses a key-value memory network (Miller et al., 2016) to store the user’s historical post-response pairs. Then the personalized post representation is used as the query to select highly related keys (historical posts) from the user history memory, and their corresponding values (historical responses) are aggregated as the dynamic post-aware user profile.

3.4.1. User History Memory

We firstly transform the historical post-response pairs into key-value pairs and build the memory.

As we discussed earlier, the historical posts are usually issued by different users. Thus the language style and topic of them may be various. Under this circumstance, it is more reasonable to treat them independently, so DHAP applies a BiGRU to represent each historical post, respectively. In our implementation, this BiGRU shares parameters with the personalized post encoder (introduced in Section 3.3). Consider the -th historical post , its representation is computed by a summing pooling over the word dimension as , where is the hidden state of the BiGRU for the -th word in . For all historical posts , their representations are denoted as . Similarly, the representation of historical responses is also computed by the same pooling strategy as , where is the contextual representation of the -th word in . Different from historical posts, the contextual word representations of historical responses are obtained by the history encoder in Equation (3). Thus, all historical responses are represented as .

Finally, we build the user history memory by using the post representations as key and the corresponding response representations as value, i.e., .

3.4.2. Dynamic Post-aware User Profile

After building the user history memory, DHAP can select and aggregate the most relevant historical responses and build the dynamic post-aware user profile based on the input post. Specifically, the personalized representation of the input post obtained in Equation (6) is used as the query and attends to the memory keys to find the most relevant historical posts. The relevance is measured by the attention weights. Then the corresponding historical responses are summed up based on the normalized weights to construct the dynamic profile :


where is the attention weight of the -th historical response based on the personalized post representation in a similar way like Equation (7). Note that the dynamic post-aware user profile is computed at each decoding time step . The most relevant information hidden in historical responses can thus be selected to help response generation.

3.5. Personalized Decoder

Finally, the generation probability of responses can be calculated by the personalized post representation , the general user profile , and the dynamic post-aware user profile . Inspired by CopyNet (Gu et al., 2016), in addition to leveraging the personalized information captured by the implicit user profile, we construct a personalized vocabulary so that the model is allowed to directly select personalized words that the user frequently used in history.

Specifically, the probability of the word generated in the personalized decoder is computed as Equation (2), where is the probability of general decoding mode and is the probability of copy decoding mode. They are computed by our proposed decoding switcher. and are the probabilities of generating under two modes, respectively.

It is worth noting that: (1) The switching probability and are both in . Thus, it is a “soft” decoding switcher. (2) The generic vocabulary also contains words in the personalized vocabulary. The generation probability of is obtained by the sum of probabilities under two decoding modes. Therefore, DHAP is just biased to the personalized words rather than lost in it.

3.5.1. Decoding switcher

The decoding switcher determines the probability of the two decoding modes, i.e., predicting a word from the generic vocabulary to maintain sentence fluency, or copying a word directly from the personalized vocabulary to make the response more informative and personalized. Specifically, DHAP computes the switching probability based on the matching degree between the decoder state and the concatenation of the personalized post representation, general user profile, and dynamic post-aware user profile, and further calculates two decoding mode probabilities.


where is the decoding hidden state at step , and

is the matching degree vector to estimate the two mode probabilities. The softmax function guarantees that


3.5.2. Personalized general decoding

While general decoding, the decoder should predict a word from the generic vocabulary:


3.5.3. Personalized copy decoding

The personalized vocabulary of user is composed of the words that appear in their historical responses. DHAP can directly select a word from this vocabulary to generate a more personalized response. Inspired by copy mechanism (Gu et al., 2016), the probability of selecting a word is computed as:


where is the attention weight calculated by the personalized post representation attentively reading the representation of historical responses with the same attention process in Equation (6).

3.5.4. Decoder state updating

DHAP applies a GRU as the decoder. The hidden state at decoding step is calculated as:


where is the embedding vector of the last generated word. The decoding states are initialized by the last hidden state of the personalized post encoder:


3.6. Training and Optimization

Our goal is to maximize the generation probability of the target response given the input post and user’s dialogue history. A length penalty is applied as (Li et al., 2016b)

to alleviate the generation of meaningless responses. As a result, the loss function of DHAP is defined as:


where is a hyper-parameter to control the associated length penalty weight. is the generation probability of word based on the given input post and user’s history, which is computed in Equation (2). All parameters are optimized by the loss function and the whole model is trained in an end-to-end manner.

4. Experiments

4.1. Datasets

Although there are many public datasets for response generation, as far as we know, none of them contain user identification. To collect each user’s dialogue history and evaluate the effectiveness of our model, we use two datasets extracted from two open online chatting forums, i.e., Weibo and Reddit. The two datasets contain massive dialogue utterances (i.e., post-response pairs) and user identification information, thus we can sample the data by users. To guarantee enough personalized information, we retrieve users with more than ten utterances to maintain an effective dialogue history. Each utterance is used as the target response for generation, while its former responses and the corresponding posts are treated as the dialogue history. We divide the utterances by users into 8:1:1 as training, validation, and test set respectively in time order. Besides, given a user, we ensure that the time of its records in the validation set and test set are behind the records in the training set.

Weibo dataset is a subset of PChatbotW (Qian et al., 2021), which is collected from Weibo for the one-year period beginning from Sept. 10, 2018. On Weibo, a user can post short messages visible to the public, which will be referred to as posts. Other users can make comments on a published post, which will be referred to as response. For data cleaning, we remove hashtags, URLs, emoticons, and duplicate text as (Qian et al., 2021). We also remove the utterances whose length is less than five words or more than 100 words. We use comparable scales of samples with (Chan et al., 2019) to conduct our experiments. It comprises 300,000 users and 8,618,374 utterances, in total 31M words.

Reddit dataset is extracted from comment chains scraped from Reddit from Dec. 1, 2015 to Oct. 30, 2018 (Zhang et al., 2020). Since the Reddit discussions can be naturally expanded as tree-structured reply chains, we pair the parent node with all its child nodes respectively, and construct multiple post-response pairs. we treat the parent node and the child node as the post and response, respectively. As a result, a parent node can be a submission or a comment, while a child node only refers to a comment. For each submission, we use its title as the post text. We clean the raw data by removing instances containing word repetitions, offensive words, or multi-language sentences. It contains 315,340 users and 24,162,464 utterances, in total 55M words.

4.2. Baselines

We evaluate the performance of our approach by comparing it with four groups of highly related and strong baseline methods:

(1) Non-personalized response generation models. Seq2SeqWA (Bahdanau et al., 2015) is a standard GRU-based Seq2Seq model with attention mechanism. MMI (Li et al., 2016a) is a Seq2SeqWA using Maximum Mutual Information as loss function to improve diversity.

(2) Personalized models using user ID embeddings. Speaker (Li et al., 2016b) is also based on Seq2SeqWA but using user ID embeddings as additional input to the decoder. PersonaWAE (Chan et al., 2019) is built on an augmented Wasserstein autoencoder. It utilizes user ID embeddings for building a personalization Gaussian mixture distribution, and fuses personalization in the decoder.

(3) Personalized models using explicit user profiles. Since no explicit user profiles are given in our datasets, we use the historical responses of users as their persona texts. GPMN (Zhang et al., 2018) enhances the Seq2SeqWA with a memory module, which encodes each piece of persona description as an individual memory representation. It uses the input message as the query to aggregate and incorporate the memory representations for response generation. PerCVAE (Song et al., 2019) uses the user profile descriptions as conditions and applies a conditional variational autoencoder to generate diverse responses.

(4) Personalized models using implicit user profiles. Since no existing methods consider mining user profiles from dialogue history implicitly, we adapt several state-of-the-art multi-turn response generation models to personalized response generation. We replace the dialogue context in the original models by the user’s historical post-response pairs. VHRED-P (Serban et al., 2017) extends the hierarchical recurrent encoder-decoder with a latent variable to model the complex dependencies among multiple utterances in the context. ReCoSa-P (Zhang et al., 2019) applies a self-attention mechanism to measure the relevance between the response and each utterance in the context.

Dataset Model Word Overlap Diversity Embedding Similarity Personalization
BLEU-1 BLEU-2 ROUGE-L Dist-1 Dist-2 Average Extrema Greedy P-F1(%) P-Cover
Weibo (1) Seq2SeqWA
(1) MMI
(2) Speaker
(2) PersonaWAE
(3) GPMN
(3) PerCVAE
(4) ReCoSa-P
(4) DHAP (ours)
Reddit (1) Seq2SeqWA
(1) MMI
(2) Speaker
(2) PersonaWAE
(3) GPMN
(3) PerCVAE
(4) ReCoSa-P
(4) DHAP (ours)
Table 1. Automatic evaluation results of all models. All models are categorized into four groups: (1) non-personalized; (2) using user ID; (3) using explicit user profile; and (4) using dialogue history. “

” denotes the result is significantly worse than our method DHAP in t-test with

level. The best results are in bold and the second best results are underlined.
Model Readability Informativeness Personalization
(1) Seq2SeqWA
(1) MMI
(2) Speaker
(2) PersonaWAE
(3) GPMN
(3) PerCVAE
(4) ReCoSa-P
(4) DHAP (ours)
Ground-truth 2.69 2.35 0.84
Table 2. Human evaluation results on Weibo dataset. “” denotes the result is significantly worse than our method in t-test with level. The best results are in bold and the second best results are underlined. The Fleiss Kappa is 0.42.

4.3. Evaluation Metrics

Automatic Evaluation: We consider several automatic metrics in different perspectives to jointly evaluate the generated responses. (1) We use BLEU-1, BLEU-2 (Papineni et al., 2002), and ROUGE-L (Lin and Och, 2004) to measure word overlaps between the generated response and ground truth. A higher value of these metrics indicates a higher word-level similarity between the generated response and the golden response. (2) Following (Li et al., 2016a), we employ Dist-1 and Dist-2 to evaluate the diversity of the generated response. Responses with more distinct unigrams/bigrams will have higher Dist-1/Dist-2. (3) As suggested by (Chan et al., 2019), we use three embedding-based metrics to measure the semantic relevance between the generated response and the ground-truth response. Concretely, we use the bag-of-words embeddings to represent both the generated and ground-truth response, and calculate their average similarity (Ave.), greedy similarity (Gre.), and extrema similarity (Ext.). The pre-trained word embeddings for Weibo and Reddit corpus are offered by Li et al. (2018) and Pennington et al. (2014), respectively. (4) Furthermore, since the goal of our model is to leverage user history for personalization, we evaluate the personalized performance by measuring how much information in the dialogue history is reflected in the generated response. Following (Lian et al., 2019; Lv et al., 2020), we use Persona F1 (P-F1) to measure the unigram F1 between the generated response and user’s historical responses. Thus, the more historical words the generated response contains, the higher P-F1 we will get. Since the importance of the shared words can be different, following (Song et al., 2019), we further use Persona Coverage (P-Cover) to measure the IDF-weighted word overlap between generated response and dialogue history. Specifically, for historical responses and the generated response , P-Cover is defined as:


where is the set of shared words between and .

Model Word Overlap Diversity Embedding Similarity Personalization
BLEU-1 BLEU-2 ROUGE-L Dist-1 Dist-2 Average Extrema Greedy P-F1(%) P-Cover
DHAP 7.013 0.144
w/o G
w/o D
w/o PC 8.830 13.981 14.457 6.884
w/o GEN 9.331 0.165
w/o COP
Table 3. Performance of ablation models on Weibo dataset. “” denotes the result is significantly worse than our method in t-test with level. The best results are denoted in bold font.

Human Evaluation: The automatic evaluation metrics can measure the quality of the generated response with respect to the ground-truth. However, due to the diversity of human dialogues, a response different from the ground-truth may also be acceptable. Thus, we randomly sample 100 test samples to conduct human evaluations. We present the generated responses, the corresponding post, and the user’s historical post-response pairs to three well-educated annotators. The annotators will evaluate the quality of the generated responses in a double-blind fashion. Following (Chan et al., 2019), the evaluation criterion includes: (1) Readability, which measures the grammatical correctness and smoothness of generated responses; (2) Informativeness, which measures whether the responses are informative or trivial; and (3) Personalization, which measures if the response can reflect personalized information (sharing some information with the history of the user). For the former two perspectives, we use a score 1/2/3 for bad/normal/good quality. For personalization, we use the score 0/1 to judge whether a response reflects personalized information or not. The Fleiss Kappa is 0.42 that indicates the annotators achieve a substantial agreement.

4.4. Implement Details

To determine the parameters of the model, we conducted multiple sets of experiments. The final parameters are selected as follows. For all datasets, we use 512 as the hidden size of GRU, 0.001 as the learning rate. The hidden size and number of heads of Transformer are 256 and 8. The number of Transformer layers

. The history length is set to 25. The vocabulary size is limited to 40,000. The word embedding dimension is 300/100 for Weibo/Reddit datasets, respectively. We use the Adam optimizer with a batch size of 256. We train all models for 10 epochs and select the best model based on the validation results on BLEU-1.

4.5. Experimental Results

4.5.1. Automatic Evaluation

All evaluation results under automatic metrics are reported in Table 1. We can observe that:

(1) Among all models, DHAP achieves the best results in terms of all evaluation metrics. DHAP improves performance with a large margin over two strongest baselines VHRED-P and ReCoSa-P, which can also learn implicit user profile. Concretely, DHAP significantly outperforms ReCoSa-P by improvements in BLEU-1 on Weibo/Reddit dataset. The reason for the improvement reduction on Reddit set is that it has a larger scale and more varied conversations, which leads to more noise. Besides, for the embedding similarity metrics, DHAP also outperforms the best baselines. These results demonstrate that DHAP can generate more semantically relevant responses to the ground-truth by leveraging user’s history. Furthermore, DHAP has dramatic improvements of Dist-1/2, indicating DHAP can generate more informative and diverse responses based on the personalized information. All these results prove that learning implicit user profiles from user’s dialogue history can improve the quality of generated responses.

(2) All personalized methods outperform non-personalized methods, indicating that personalization is helpful for generating more informative and relevant responses. Seq2SeqWA generally has the lowest performance, reflecting that the semantic information in the post is insufficient for generating an informative response. MMI improves the diversity performance significantly, but loses some ability on modeling semantic relevance, as it changes the training objective. Speakers and PersonaWAE use user ID embeddings to identify different users for personalization, and outperform the non-personalized methods. The explicit persona-based model GPMN and PerCVAE show comparable performance to the user ID embedding based baselines. A potential reason is that they are designed for leveraging explicit user profile, which is usually of high quality. In our case, they are only provided with the user’s historical responses, which are much noisy. Therefore, existing personalized methods for explicit user profile is not appropriate for dealing with the implicit user profile contained in the user history.

(3) Among all personalized methods, the ones using implicit user profile perform better. VHRED-P and ReCoSa-P show better performance on most metrics, confirming that the dialogue history can be used to mine implicit user profile for a specific user. However, these two methods are originally proposed for multi-turn dialogue generation. Their performance in personalized tasks is limited because the dialogue history covers far more aspects than the context in multi-turn dialogue. On the contrary, our DHAP models the personalized information in both encoder and decoder side, and consider the implicit user profile in both general and dynamic style. Hence, DHAP can achieve significant improvements compared with existing personalized baselines.

4.5.2. Human Evaluation

We also conduct a human evaluation for all models on Weibo dataset. The results are shown in Table 2. Generally, DHAP achieves significant improvements in terms of all perspectives, which is consistent with the results on automatic metrics. In particular, we find that DHAP is much better than ReCoSa-P in terms of personalization. This is because DHAP learns the implicit user profile more comprehensively and enhances the influence of personalized words directly in the decoder. DHAP also performs better than other baselines in terms of readability, which shows that DHAP is better at language understanding with the help of user history. Besides, the informativeness of responses generated by DHAP is also improved. This demonstrates that leveraging personalized information is effective to generate more meaningful responses.

In summary, the automatic and human evaluation results strongly verify that dialogue history is suitable to build the user profile implicitly, and leveraging implicit user profiles is effective to generate meaningful and personalized responses, further achieving a personalized chatbot.

Figure 3. Effectiveness of DHAP on users with different lengths of history for Weibo dataset.

4.6. Further Analysis

We further analyze the influence of different modules (Section 4.6.1) and the performance over different history lengths (Section 4.6.2). Both of these experiments are performed on Weibo dataset.

4.6.1. Ablation Study

DHAP learns several personalized user profiles based on the dialogue history and designs a decoding switcher and two decoding strategies in the personalized decoder. We remove one of them once a time to analyze its contribution. The experiment results on Weibo dataset are shown in Table 3.

The ablation on personalized representations. Three settings are considered: (1) without G: the general user profile is not used; (2) without D: the dynamic post-aware user profile is not used; and (3) without PC: the post encoder is non-personalized, namely initializing the post encoder with random states rather than the general user profile.

The results show that all of the personalized representations are useful. Specifically, removing the general user profile causes the most decline in all metrics, which confirms the necessity and contribution of it on summarizing personalized information in a general view. The performance degradation caused by removing the dynamic post-aware user profile shows that selecting historical responses relevant to the input post contributes to the further enhancement of user modeling. The influence of removing personalization in post encoder is relatively smaller. It proves that using the user profile to enhance the understanding of the current post is effective but limited, since such information is only provided at the beginning step and decreases with the hidden state update.

The ablation of components in the personalized decoder. We test the following variants of our model: (1) without GEN: the general decoding is banned; (2) without COP: the copying mode is banned; (3) with FIX: the probability of two modes are fixed. Specifically, the general decoding probability is set as 0.8 and the copy decoding probability is 0.2, these probabilities are set according to the best results of DHAP in our preliminary experiments.

It can be seen that the results of three variants all underperform the whole framework. Without general decoding, the performance of DHAP drops sharply in terms of all metrics except personalization metrics. Specifically, it drops 46.55% in terms of BLEU-1. This indicates that only copying words from personalized vocabulary is unable to generate a suitable response, because there are lots of noises irrelevant to the current post and some general words may not be contained in the vocabulary. The reason for its improvement on personalized metrics is that all of the generated words are copied from the history, regardless of the significant hurt on relevance and diversity. Thus, general decoding considering both the post, decoding states, and personalized information is necessary. However, only using general decoding also hurts the performance, which indicates the words reflecting user personalized information is also very valuable. To combine the two decoding strategies, DHAP calculates the possibilities of two decoding strategies dynamically. It works well in DHAP yet using fixed probabilities has lower performance.

4.6.2. Performance across Various Lengths of Dialogue History

As we leverage user’s dialogue history for personalization, the length of history may affect the model’s performance. To investigate the influence of history length, we test the performance of DHAP by using different numbers of historical post-response pairs. The results of BLEU-1 on Weibo set are illustrated in Figure 3. We find:

(1) In general, DHAP performs better when a user has a longer dialogue history. This is consistent with our speculation as a longer dialogue history can provide more personalized information. DHAP achieves the best performance with the history length around 25. Unfortunately, when more than 30 historical pairs are used, the performance of DHAP becomes unstable. The potential reason is that more historical data may bring more noise and increase the difficulty of building an effective user profile. (2) When the history is less than 5, ReCoSa-P performs better than DHAP without general user profile. This is because the persona information is extremely limited. Under this circumstance, the more complex architecture of ReCoSa shows its superiority. Nevertheless, our DHAP still performs best, showing its scalability for various history lengths.

5. Conclusion

In this work, we implemented response generation of personalized chatbots in an alternative way. Different from existing personalized methods, we propose the personalized model DHAP, which learns the implicit user profile automatically from large-scale user dialogue history. We design a personalized language model to capture the user’s general interest from their historical responses and summarize the general user profile. To further highlight the historical responses which are relevant and valuable to the current input post, we build a history memory and construct the dynamic post-aware user profile. We build a personalized decoder to coordinate two personalized decoding strategies. Experimental results confirm the effectiveness of our model on generating informativeness and personalized responses.

Zhicheng Dou is the corresponding author. This work was supported by National Natural Science Foundation of China No. 61872370 and No. 61832017, and Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, and Shandong Provincial Natural Science Foundation under Grant ZR2019ZD06.


  • (1)
  • Al-Rfou et al. (2016) Rami Al-Rfou, Marc Pickett, Javier Snaider, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2016. Conversational Contextual Cues: The Case of Personalization and History for Response Ranking. CoRR abs/1606.00372 (2016). arXiv:1606.00372
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR 2015.
  • Bak and Oh (2019) JinYeong Bak and Alice Oh. 2019. Variational Hierarchical User-based Conversation Model. In EMNLP-IJCNLP. 1941–1950.
  • Chan et al. (2019) Zhangming Chan, Juntao Li, Xiaopeng Yang, et al. 2019. Modeling Personalization in Continuous Space for Response Generation via Augmented Wasserstein Autoencoders. In EMNLP-IJCNLP. 1931–1940.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT 2019. 4171–4186.
  • Gerlach and Kuo (1991) James H. Gerlach and Feng-Yang Kuo. 1991. Understanding Human-Computer Interaction for Information Systems Design. MIS Q. 15, 4 (1991), 527–549.
  • Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In ACL 2016. ACL.
  • Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the NAACL HLT 2016,. 110–119.
  • Li et al. (2016b) Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jianfeng Gao, and William B. Dolan. 2016b. A Persona-Based Neural Conversation Model. In Proceedings of the ACL 2016. ACL.
  • Li et al. (2018) Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, et al. 2018. Analogical Reasoning on Chinese Morphological and Semantic Relations. In ACL. 138–143.
  • Lian et al. (2019) Rongzhong Lian, Min Xie, Fan Wang, Jinhua Peng, and Hua Wu. 2019. Learning to Select Knowledge for Response Generation in Dialog Systems. In IJCAI 2019.
  • Lin and Och (2004) Chin-Yew Lin and Franz Josef Och. 2004. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In Proceedings of the ACL 2004. ACL, 605–612.
  • Liu et al. (2020) Qian Liu, Yihong Chen, Bei Chen, Jian-Guang Lou, Zixuan Chen, Bin Zhou, and Dongmei Zhang. 2020. You Impress Me: Dialogue Generation via Mutual Persona Perception. In Proceedings of the ACL 2020. ACL, 1417–1427.
  • Luo et al. (2019) Liangchen Luo, Wenhao Huang, Qi Zeng, Zaiqing Nie, and Xu Sun. 2019. Learning Personalized End-to-End Goal-Oriented Dialog. In AAAI 2019. 6794–6801.
  • Lv et al. (2020) Pengcheng Lv, Shi Feng, et al. 2020.

    PersonaGAN: Personalized Response Generation via Generative Adversarial Networks. In

    DASFAA 2020. Springer, 570–586.
  • Ma et al. (2020) Zhengyi Ma, Zhicheng Dou, Guanyue Bian, and Ji-Rong Wen. 2020. PSTIE: Time Information Enhanced Personalized Search. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, Mathieu d’Aquin, Stefan Dietze, Claudia Hauff, Edward Curry, and Philippe Cudré-Mauroux (Eds.). ACM, 1075–1084.
  • Mairesse and Walker (2007) François Mairesse and Marilyn A. Walker. 2007. PERSONAGE: Personality Generation for Dialogue. In Proceedings of the ACL 2007. ACL.
  • Miller et al. (2016) Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-Value Memory Networks for Directly Reading Documents. In Proceedings of the EMNLP 2016. ACL, 1400–1409.
  • Neishi and Yoshinaga (2019) Masato Neishi and Naoki Yoshinaga. 2019. On the Relation between Position Information and Sentence Length in Neural Machine Translation. In CoNLL.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL. 311–318.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP 2014. ACL, 1532–1543.
  • Qian et al. (2021) Hongjin Qian, Xiaohe Li, Hanxun Zhong, Yu Guo, Yueyuan Ma, Yutao Zhu, Zhanliang Liu, Zhicheng Dou, and Ji-Rong Wen. 2021. Pchatbot: A Large-Scale Dataset for Personalized Chatbot. In Proceedings of the SIGIR 2021. ACM.
  • Qian et al. (2018) Qiao Qian, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. 2018. Assigning Personality/Profile to a Chatting Machine for Coherent Conversation Generation. In Proceedings of the IJCAI 2018. 4279–4285.
  • Qiu et al. (2017) Minghui Qiu, Feng-Lin Li, Siyu Wang, et al. 2017. AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine. In ACL 2017. ACL, 498–503.
  • Ritter et al. (2011) Alan Ritter, Colin Cherry, and William B. Dolan. 2011. Data-Driven Response Generation in Social Media. In Proceedings of the EMNLP 2011. ACL, 583–593.
  • Serban et al. (2016) Iulian Vlad Serban, Alessandro Sordoni, et al. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In AAAI 2016.
  • Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, et al. 2017. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. In AAAI 2017.
  • Shang et al. (2015) Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural Responding Machine for Short-Text Conversation. In Proceedings of the ACL 2015. ACL, 1577–1586.
  • Song et al. (2020a) Haoyu Song, Yan Wang, Weinan Zhang, Xiaojiang Liu, and Ting Liu. 2020a. Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation. In Proceedings of the ACL 2020. 5821–5831.
  • Song et al. (2019) Haoyu Song, Weinan Zhang, et al. 2019. Exploiting Persona Information for Diverse Generation of Conversational Responses. In IJCAI 2019. 5190–5196.
  • Song et al. (2020b) Haoyu Song, Wei-Nan Zhang, et al. 2020b. Generating Persona Consistent Dialogues by Exploiting Natural Language Inference. In AAAI 2020. 8878–8885.
  • Sordoni et al. (2015) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, et al. 2015. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses. In Proceedings of the NAACL HLT 2015. ACL, 196–205.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, et al. 2017. Attention is All you Need. In NIPS. 5998–6008.
  • Vinyals and Le (2015) Oriol Vinyals and Quoc V. Le. 2015. A Neural Conversational Model. CoRR abs/1506.05869 (2015). arXiv:1506.05869
  • Welleck et al. (2019) Sean Welleck, Jason Weston, Arthur Szlam, and Kyunghyun Cho. 2019. Dialogue Natural Language Inference. In Proceedings of the ACL 2019. ACL, 3731–3741.
  • Wolf et al. (2019) Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents. CoRR abs/1901.08149 (2019).
  • Wooldridge and Jennings (1995) Michael J. Wooldridge and Nicholas R. Jennings. 1995. Intelligent agents: theory and practice. Knowl. Eng. Rev. 10, 2 (1995), 115–152.
  • Wu et al. (2017) Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In Proceedings of the ACL 2017. ACL, 496–505.
  • Xia et al. (2019) Rui Xia, Mengran Zhang, and Zixiang Ding. 2019. RTHN: A RNN-Transformer Hierarchical Network for Emotion Cause Extraction. In IJCAI 2019. 5285–5291.
  • Zhang et al. (2019) Hainan Zhang, Yanyan Lan, Liang Pang, Jiafeng Guo, and Xueqi Cheng. 2019. ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn Dialogue Generation. In Proceedings of the ACL 2019. ACL, 3721–3730.
  • Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing Dialogue Agents: I have a dog, do you have pets too?. In Proceedings of the ACL 2018. ACL, 2204–2213.
  • Zhang et al. (2020) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, et al. 2020. DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation. In Proceedings of the ACL 2020. ACL, 270–278.
  • Zhou et al. (2018b) Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2018b. Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory. In Proceedings of the AAAI 2018. AAAI Press, 730–739.
  • Zhou et al. (2018c) Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. 2018c. Commonsense Knowledge Aware Conversation Generation with Graph Attention. In Proceedings of the IJCAI 2018.
  • Zhou et al. (2018a) Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2018a. The Design and Implementation of XiaoIce, an Empathetic Social Chatbot. CoRR (2018).
  • Zhou et al. (2020a) Yujia Zhou, Zhicheng Dou, and Ji-Rong Wen. 2020a. Encoding History with Context-aware Representation Learning for Personalized Search. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, Jimmy Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 1111–1120.
  • Zhou et al. (2020b) Yujia Zhou, Zhicheng Dou, and Ji-Rong Wen. 2020b. Enhancing Re-finding Behavior with External Memories for Personalized Search. In WSDM ’20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, February 3-7, 2020, James Caverlee, Xia (Ben) Hu, Mounia Lalmas, and Wei Wang (Eds.). ACM, 789–797.
  • Zhu et al. (2020a) Yutao Zhu, Zhicheng Dou, Jian-Yun Nie, and Ji-Rong Wen. 2020a. ReBoost: a retrieval-boosted sequence-to-sequence model for neural response generation. Inf. Retr. J. 23, 1 (2020), 27–48.
  • Zhu et al. (2021) Yutao Zhu, Jian-Yun Nie, Kun Zhou, Pan Du, and Zhicheng Dou. 2021. Content Selection Network for Document-Grounded Retrieval-Based Chatbots. In Proceedings of ECIR 2021 (Lecture Notes in Computer Science), Vol. 12656. Springer, 755–769.
  • Zhu et al. (2020b) Yutao Zhu, Ruihua Song, Zhicheng Dou, Jian-Yun Nie, and Jin Zhou. 2020b. ScriptWriter: Narrative-Guided Script Generation. In Proceedings of ACL 2020. Association for Computational Linguistics, 8647–8657.