CIKM 2021: Learning Implicit User Profile for Personalized Retrieval-based Chatbot
In this paper, we explore the problem of developing personalized chatbots. A personalized chatbot is designed as a digital chatting assistant for a user. The key characteristic of a personalized chatbot is that it should have a consistent personality with the corresponding user. It can talk the same way as the user when it is delegated to respond to others' messages. We present a retrieval-based personalized chatbot model, namely IMPChat, to learn an implicit user profile from the user's dialogue history. We argue that the implicit user profile is superior to the explicit user profile regarding accessibility and flexibility. IMPChat aims to learn an implicit user profile through modeling user's personalized language style and personalized preferences separately. To learn a user's personalized language style, we elaborately build language models from shallow to deep using the user's historical responses; To model a user's personalized preferences, we explore the conditional relations underneath each post-response pair of the user. The personalized preferences are dynamic and context-aware: we assign higher weights to those historical pairs that are topically related to the current query when aggregating the personalized preferences. We match each response candidate with the personalized language style and personalized preference, respectively, and fuse the two matching signals to determine the final ranking score. Comprehensive experiments on two large datasets show that our method outperforms all baseline models.READ FULL TEXT VIEW PDF
CIKM 2021: Learning Implicit User Profile for Personalized Retrieval-based Chatbot
Building an open-domain dialogue system is an intriguing but challenging task that draws more and more attention in both academic and industrial communities. In general, there are two pathways to construct an open-domain chatbot: retrieval-based and generation-based. The former usually retrieves relevant response candidates via a retrieval engine and then selects a proper response from these candidates (Wang et al., 2013; Wu et al., 2017a; Zhou et al., 2018; Tao et al., 2019; Lowe et al., 2015). The latter directly leverages models, such as encoder-decoder, to generate response texts (Wang et al., 2013; Vinyals and Le, 2015; Ritter et al., 2011). In this paper, we concentrate on the retrieval-based chatbots which are wildly applied in many industrial products such as Microsoft XiaoIce (Zhou et al., 2020a) and E-commerce assistant AliMe (Li et al., 2017).
Although plentiful chatbot applications are available, there still exist some challenges that cannot be overlooked (Chaves and Gerosa, 2020; Gao et al., 2019). Inconsistent personality is one of the most mentioned challenges of current open-domain chatbots (Shum et al., 2018; De Angeli et al., 2001). In the dialogue chatbot domain, personality refers to a set of traits that define a chatbot’s interaction style, character, and behaviors (De Angeli et al., 2001; Liu et al., 2020). Inconsistent personality brings more unpredictability and untrustworthiness to the chatbot, which can disorient end users and lead to a strong sense of discomfort (De Angeli et al., 2001; Portela and Granell-Canut, 2017). Therefore, endowing chatbots with consistent personality becomes a crucial research task that guarantees a chatbot’s performance in agreement with user’s expectations. To this end, many methods have been proposed to assign a personality to dialogue chatbots (Al-Rfou et al., 2016; Bak and Oh, 2019; Li et al., 2016b; Zhang et al., 2018a; Qian et al., 2018). In this paper, we address the problem of developing “personalized chatbots”. A personalized chatbot is created for a specific user and it functions like the digital chatting assistant of the user. Ideally, it can talk exactly the same way as the user, and can respond to others’ messages when it is delegated by the user. We argue that for such a personalized chatbot, it is extremely important to have a consistent personality with the corresponding user.
There have been a few attempts for developing personalized chatbots. Early studies encoded user information by using user ID embeddings which are learned during training (Al-Rfou et al., 2016; Chan et al., 2019; Li et al., 2016b), and generate personalized responses with the help of the ID embeddings. Recently, some works focused on building personalized chatbots using manually created user profiles to maintain the consistency of personality for chatbots (Liu et al., 2020; Mazaré et al., 2018; Song et al., 2020a, b; Zhang et al., 2018b; Gu et al., 2019; Hua et al., 2020). Such predefined user profiles are explicit user profiles which usually contain several persona descriptions or key-value-based personal information.
|(1) What’s your favorite programming language?||(1) Mine is Java.|
|(2) Rafa wins his 1000 match!||(2) Bravo Rafa!|
|(3) PC or MAC for college students?||(3) If ur in steam, go windows.|
|(4) I failed an exam again and feel like a loser.||(4) Good for you.|
|(5) Australian Open Final Nadal vs Federer!||(5) Vamos, Nadal.|
In this paper, we propose using the user’s dialogue history to automatically build implicit user profiles to endow consistent personality to the chatbot. The reasons mainly lie in two aspects. First, regarding accessibility and flexibility, the implicit user profile is superior to the explicit user profile. Obtaining a large number of explicit user profiles requires tremendous manual labour (Chan et al., 2019). In a practical scenario, users also might be reluctant to write such detailed persona descriptions (Gerlach and Kuo, 1991). Besides, compared with explicit user profiles, implicit user profiles are easier to update with the accumulation of dialogue history as they are automatically learned. Second, tremendous personalized style information is usually hidden underneath the user’s dialogue history. Here, the personalized style refers to the unique personal characteristics (e.g., speaking style, background knowledge, and preferences) that help distinguish a user from others. Such personalized style information is beneficial to enhance the chatbot’s ability to output coherent responses. Table 1 shows an example of a user’s dialogue history from which we can find the user’s personalized speaking style and preferences. Taking the fifth post “Australian Open Final Nadal vs Federer!” as an example, the user replies “Vamos222“Vamos” is a Spanish word which means “let’s go”. , Nadal”. Thus, we know that the user likes the tennis player Nadal instead of Federer. And the user likes using “Vamos” to cheer the sports player on.
Inspired by the beneficial properties of implicit user profiles, we design a model IMPChat to learn the IMplicit user profile from a user’s dialogue history for Personalized retrieval-based Chatbots. Through the model, we propose modeling the implicit user profile from two aspects: personalized language style and personalized preference. The former aims to summarize personalized language characteristics (e.g., using “Vamos” to cheer sports player on) from the dialogue history without considering the current message. The latter considers the conditional relation between the current message and the user’s personal preference (e.g., like the tennis player Nadal instead of Federer) by building a post-aware user profile. Then, we perform matching between the response candidates and the learned implicit user profile to select a personalized response.
Specifically, we first design a Personalized Style Matching module that uses multi-layer attentive modules to capture multi-grained personalized language characteristics. At each layer, we simultaneously perform self-attention within historical responses and cross-attention with the corresponding historical posts. Through this way, we not only obtain deep personalized style features but understand the context-aware relations between posts and responses. We perform matching between the response candidates and the personalized style. Second, we build a Post-Aware Personalized Preference Matching module. Given the current query, this module first selects topically related dialogue history. Then, it models the personalized preferences from the selected dialogue history to enhance the personalized response selection. The dialogue history selection is performed by measuring the topical relatedness between the current query and each historical post. We think that the word ambiguity of the current query might bias the topical relatedness. To deal with the bias, we create a tailored multi-hop method to stretch the current query’s semantic richness by fusing context information into the current query. We then perform matching between the response candidates and the post-aware personalized preference. Finally, in the fusion module, we combine the two matching features to determine the most proper response. To validate the effectiveness of our model, we conduct comprehensive experiments on two publicly available datasets for personalized response selection. Experimental results show that our model outperforms all baseline models.
Our contributions are threefold: (1) To the best of our knowledge, this is the first study that seeks to build an implicit user profile using the user’s dialogue history to enhance personality consistency for the retreival-based chatbot. (2) We propose IMPChat that constructs an implicit user profile considering the user’s personalized language style and personalized preference. (3) Extensive experiments show that our model outperforms the state-of-the-art models.
To build an open-domain dialogue chatbot, there exist two major research directions: generation-based and retrieval-based. The first usually learns to generate responses with an encoder-decoder structure (Oh and Rudnicky, 2000; Vinyals and Le, 2015; Shang et al., 2015; Li et al., 2016a; Serban et al., 2016; Zhu et al., 2020a). The second learns a matching model to select proper responses from candidates given input query and context (Lowe et al., 2015; Zhou et al., 2016; Wu et al., 2017b; Tao et al., 2019; Zhu et al., 2020b, 2021b). In this paper, we focus on the latter.
The general objective of retrieval-based methods is to select a proper response from a candidate list based on a given query333Here, “query” has different meanings with that often used in IR.. The candidates are usually pre-retrieved by an index system. Along this direction, early studies mainly focus on single-turn response selection (Hu et al., 2014; Wang et al., 2013, 2015; Dai et al., 2018). Recently, study interests move to multi-turn response selection, which uses dialogue session information as context to enhance the single-turn matching. For example, Dual-LSTM model encodes context and responses into hidden states and the last hidden states are used to compute relevance score (Lowe et al., 2015). Multi-view model (Zhou et al., 2016) performs multi-grained context-response matching. These methods concatenate the context to a long sequence which may undermine the relationships among utterances. To tackle this problem, the representation-matching-aggregation framework is then applied by many recent works, which conducts interaction between the response and each utterance in the context and then aggregate these matching features using CNN or RNN (Wu et al., 2017b). Following the framework, deep utterance aggregation network (DUA) (Zhang et al., 2018d) designs turns-aware aggregation mechanism that assigns different weights to each dialogue turn. Deep attentive matching network (DAM) jointly uses self-attention and cross-attention to obtain multi-grained representations (Zhou et al., 2018). Multi-hop selection network (MSN) (Yuan et al., 2019) designs a multi-hop method to make the context selection more robust. Besides, the interaction-over-interaction network (IoI) argues that one time of interaction captures limited matching features, and the IoI network stacks multiple interaction blocks (Tao et al., 2019).
Endowing personality to open-domain chatbots is intriguing but challenging. It is an inevitable barrier to achieve truly applicable intelligent assistant. With a consistent personality, the ability of a chatbot to be precise regarding how to use language would be greatly improved (Morrissey and Kirakowski, 2013; Ma et al., 2021)
. Early work tries to model personality using heuristic methods such as “Big Five” of speakers(Mairesse and Walker, 2007)
. With the rising of deep learning recently, many end-to-end neural models are designed to tackle the challenge. Basically, these methods can be categorized into two groups. The first group learns a user embedding for each user. The user embedding is then used to conduct matching or generation(Al-Rfou et al., 2016; Bak and Oh, 2019; Li et al., 2016b). Albeit the user embeddings are updating through training, they cannot directly interact with dialogue texts which contain abundant personalized information.
The second group learns personality using explicit user profile (persona) which contains either several persona sentences or key-value agent profile (Zhang et al., 2018a; Qian et al., 2018). These explicit user profiles are either human-annotated or extracted from the dialogue history using rule-based methods. PERSONA-CHAT is such a dataset that is widely used (Zhang et al., 2018a). The major problem is how to use persona sentences (documents) and the context properly. Along with the dataset, a key-value profile memory network is used, which takes the dialogue history as input and performs attention over persona sentences (Miller et al., 2016; Zhang et al., 2018a). DGMN lets the context and document attend to each other to obtain fused representations (Zhao et al., 2019). CSN performs document selection over the persona sentences (Zhu et al., 2021a). RSM-DCK conducts selection over both dialogue context and user profiles and lets the response candidates interact with the selected results (Hua et al., 2020).
Constructing a dataset like PERSONA-CHAT is exhausted. Some other works tackle this issue by generating user profiles through rule-based methods (Zhong et al., 2020; Mazaré et al., 2018). For example, Mazaré et al. (Mazaré et al., 2018) construct much larger datasets with explicit user profile using Reddit444https://www.reddit.com/r/datasets/ comments/3bxlg7/ dataset in a heuristic way, and it empirically proves that models training on such automatically-generated user profile with enough data can achieve similar or even better performance. Besides, the topic of personalization has also been explored in many other scenarios such as personalized search and user modelling (Zhou et al., 2021, 2020c, 2020b).
Unlike the studies above, in this paper, we seek to build an implicit user profile using the user’s dialogue history. Such implicit user profile contains rich personalized information and can be automatically learned from the historical dialogues.
Assigning personality to chatbots is crucial to satisfy user’s expectations. As introduced in Section 1, most current works seek to model personality with explicit user profiles. In this paper, we instead study the problem of automatically modeling implicit profiles of a user based on the user’s dialogue history, and leverage the implicit profile to select the personalized response. The chatbot acts like a virtual agent of the user and is able to reply messages when the user is busy and asks the chatbot to generate responses.
The general goal of a retrieval-based chatbot is to return the most proper response within a database for an input message. Assume is a scoring model evaluating the quality of a candidate response for a given message under the context , the chatbot will output the response with the highest value of from a repository of response , i.e., we have:
For a single-turn chatbot, and the response is selected merely based on the input message . For a multi-turn chatbot, is the conversation context comprised of previous turns of dialogues. In this work, we investigate the problem of retrieving a personalized response that matches the speaking style and knowledge background of a user . Suppose is the corresponding profile of , and we have . is usually a compact representation of a user’s interests and knowledge. Hence, the personalized chatbot aims to return the response with the highest value of .
There are multiple ways to model . Previous works either learn user ID embeddings (Al-Rfou et al., 2016; Bak and Oh, 2019; Li et al., 2016b) or build explicit user profiles(Zhang et al., 2018a; Zhao et al., 2019; Zhu et al., 2021a; Hua et al., 2020). For example, Zhang et al. (Zhang et al., 2018a) used explicit persona descriptions and Qian et al. (Qian et al., 2018) used key-value profile to model . In this paper, we instead model using the user’s dialogue history. Suppose is the user’s dialogue history where represents posts issued by others and means responses made by the current user, we propose selecting responses by:
where defines a model that learns implicit user profile from dialogue history. Note that we further denote the user’s historical posts and responses in as and respectively.
Intuitively, there are two major factors that impact the manner of expressing. First, a user has a personalized language style regardless of the conversation context. Such language style is usually determined by the user’s knowledge background and preferred expressions. Second, the way a user makes responses is also conditioned on the intrinsic relations between a post and the user’s personalized preferences. Given the same dialog context (e.g., a post), different users may give different responses and we try to learn the personalized preferences from history.
With this in mind, we assume the personalized user profile is comprised of two parts: (1) the personalized language style S and (2) the post-aware personalized preferences P, and we have:
where and return the matching features of the candidate response regarding the two parts of the personalized user profile.
Figure 1 shows the structure of our proposed IMPChat. Specifically, we first build a Personalized Style Matching module (which will be introduced in Section 3.4) which aims to capture the personalized language style using the user’s historical responses and get which measures the style consistency of the response candidates. Then, we design a Post-Aware Personalized Preference Matching module (in Section 3.5) which learns post-aware user preferences and get which measures the matching degree between the post-aware user preferences and the response candidates. The two modules perform matching separately, and in the fusion module (in Section 3.6), we combine the two matching features to compute the final matching score . In the remaining parts of this section, we will introduce the details of each component.
We first introduce the structure of Attentive Module, which is a basic component in our method. Following the previous work (Zhou et al., 2018; Yuan et al., 2019), we use the Attentive Module to endow contextual semantics into word embeddings. Attentive Module is proposed in (Zhou et al., 2018). It is a variant of Transformer model (Vaswani et al., 2017). Instead of the multi-head attention that is used in the Transformer, the Attentive Module uses single-head attention. Specifically, the Attentive Module takes , , and as input, where and represent sequence length and embedding dimension respectively.
The Attentive Module defines an attention function to map the query and key-value pair to a weighted output. The weights are computed by letting each word in the query sentence attend to words in the key sentence via scaled dot production (Vaswani et al., 2017), which can be formulated as:
A residual connection with layer normalization(Ba et al., 2016) is then applied to get a better fused representation
and prevent vanishing or exploding of gradients. A feed-forward network (FFN) with ReLU(Goodfellow et al., 2016) activation is further applied to the normalized results as:
is a 2D tensor with the same shape of the query; and , , , and are parameters. The final output is obtained via a residual connection with normalization between and . We denote the whole Attentive Module as .
A personalized chatbot should coherently output responses that portray consistent personalized styles (e.g., speaking style and vocabulary). We think that these personalized styles are very helpful when we expect to retrieve personalized responses. Taking the pet phrase (a type of speaking style) as an example, regarding how to express congratulation, some users might use “neat” or “congrats”, while the user in Table 1 would like to use “bravo”. Users tend to use their preferred expressions more frequently, and these personalized preferred expressions are underneath the user’s historical responses. Under such an observation, we design a Personalized Style Matching module that aims to model a user’s preferred speaking style from the user’s historical responses.
Formally, given a user with dialogue history and the response candidate
, the Personalized Style Matching module aims to get a style matching feature vectorwhich measures the style consistency between the response candidate and the historical responses . The Personalized Style Matching module achieves the goal via three layers: (1) Representation: it extracts multi-grained semantic representations for historical responses and the response candidate; (2) Matching: it performs matching at each semantic level, and (3) Aggregation: it dynamically fuses matching signals between the response candidate and all historical responses to obtain . We will introduce each layer in detail as follows.
The representation layer aims to obtain multi-grained contextual representations and cross-attention representations for the response candidate and each historical response using attentive modules.
Taking the -th response as an example, the contextual representations model the contextual semantic pattern of from shallow to deep. Specifically, we first initialize word representations by looking up a word embedding table (e.g., Word2Vec). Then, we obtain deep contextual response representations by feeding the word embeddings into attentive modules:
where is the contextual representation output by the -th attentive module. Through attentive modules, we obtain the representations which depict the co-occurrence pattern of words at different granularities.
Furthermore, we think that the personalized style of a response is also conditioned on the post. For example, “good for you” is semantically similar to “bravo”. However, in Table 1, “good for you” is a response to the post “I failed an exam again and feel like a loser”. The user instead uses the phrase in an ironical way. In view of this, we let the response ’s contextual representation attend to the corresponding post ’s contextual representation to obtain the cross-attention representations :
where is obtained in the same way as . For the response candidate , we obtain and in the same way.
The matching layer aims to obtain a personalized style matching matrix which measures the style matching degree between the response candidate and each historical response at multiple granularities. Specifically, given a candidate response and the -th historical response , we compute and which are the matching matrices for the contextual representations and the cross-attention representations :
where is the dimension of the embeddings and refers the representation output by the -th attentive module. Hence, for historical responses , we have two groups of multi-grained matching matrices and .
To transform these matching matrices into a shared feature space, we first concatenate them into a stacked matching matrix:
where refers to concatenation along a new dimension. and is the maximum sequence length.
Next, in the aggregation layer, following (Zhou et al., 2018; Yuan et al., 2019; Zhu et al., 2021a), we extract matching features from the matching matrix via CNN. The extracted feature is then linearly mapped into a lower dimension:
where , and MLP(
) represents a multi-layer perceptron.
Note that the personalized style matching feature contains the matching signals between the response candidate and each historical response . Although post-response pairs are sorted by time in the dialogue history, temporal patterns vanish for historical responses solely. Besides, each historical response may impact differently on personalized style. Therefore, we apply self-attention to dynamically sum up the personalized style matching feature . Finally, we obtain the style matching features :
where represents the attention weights, and is the element-wise multiplication.
In addition to personalized language styles, a user’s personalized preferences also have a great impact on the personalized response selection. For example, people who prefer MAC to PC would give a positive comment about a Mac-related post. We think that such personalized preferences tend to be captured from the user’s dialogue history as the user’s preferences are relatively consistent. Hence, how to properly utilize the dialogue history to enhance the personalized preference consistency is important.
The Post-Aware Personalized Preference Matching module aims to obtain a matching vector which measures whether a response candidate can consistently reflect the user’s personalized preferences given the current query and a user with dialogue history . To properly utilize the dialogue history , we first transform the dialogue history into a user profile from which we can effectively model the personalized preferences.
Ideally, the user profile only contains post-response pairs that are topically related to current post. However, dialogue history reflects multifaceted user interests, some of which might be unrelated to current post. Unrelated context might bring negative impact on response selection (Yuan et al., 2019; Hua et al., 2020). Therefore, we need to filter out the unrelated dialogue history. Intuitively, we can compute a relevance vector that measures the topical relatedness between the current post and historical posts . Then,we can obtain the user profile by reweighting the dialogue history :
where is the representation of the -th post-response pair.
We compute the relevance vector considering two assumptions: (1) topic relatedness can be context-level and word-level; (2) word usage in a post is ambiguous by nature, which biases the topic relatedness. Take the post ”Do you like MAC?” as an example. At the context level, it relates to the topic “personal preference”. At the word level, it relates to the topic “MAC”. Meanwhile, the word “MAC” is ambiguous. For the user in Table 1, it has a great chance to be referred to as Apple’s ”Macintosh” as we can find a “MAC”-related post “PC or MAC for college students” in the dialogue history. But for a beauty blogger, it might refer to cosmetics “MAC”.
In view of the first assumption, we decompose the relevance vector into the word-level relevance vector and the context-level relevance vector .
At word level, for historical posts with a maximum length of , we obtain the word-level matching matrix by:
where and are parameters. is the contextual representations for the current post and historical posts, which is obtained via a single attentive module: and
is initialized by looking up a word embedding table. We then conduct max-pooling over word-level matching matrix to obtain the most important matching features, which are then linearly mapped into the word-level relevance vectorusing the softmax function:
where is the concatenation operation.
At context level, we obtain context-level relevance vector by:
where and are sentence representation obtained by mean pooling over the word dimension. We combine word-level and context-level relevance vector by:
where is a trainable parameter and is initialized by 0.5.
For now, the relevance vector is obtained by using the current query as a key to attend to historical posts. To tackle the bias discussed in the second assumption, we then design a multi-hop method that alleviates the word ambiguity by stretching the semantic richness of the current query.
Figure 2 demonstrates the multi-hop method. Specifically, we store the historical posts in a buffer. At each hop, the most related historical post will be popped out as a new key, where . is the relevance scores of posts in the buffer. For the post “Do you like MAC?” in Table 1, at hop-1, the key is itself. After attending to historical posts, we expect the historical post “PC or MAC for college students” can be popped out as a new key. In this way, the word “MAC” is linked to “PC” and the ambiguity of the word “MAC” can be alleviated.
We denote all popped keys as . It is a subset of all historical posts . At hop-1, and the relevance score is computed by Eq. (14) with the current query . Afterwards, we update the current query by:
We then obtain a new relevance score via Eq. 14 using the updated representation . We denote the relevance score of hop- as . After hops, we have . We then linearly map these scores into the final reweighting scores: where and . Thus, we can rewrite the Eq. (10) to:
To thoroughly measure the relevance between the response candidate and the post-aware user profile , we construct three matching matrices:
where and is the embedding size. and are the contextual embeddings obtained via a single-layer attentive module:
Meanwhile, and are obtained via cross-attention:
Thereafter, the three matching matrices are concatenated together:
Same to the Personalized Style Matching, we use 2D CNN with max-pooling to extract high-level matching features. As the dialogue history is sorted by time, we utilize a single-layer GRU to capture the temporal signal of post-response pairs in the dialogue history.We use the GRU’s final state as the Post-Aware Personalized Preference Matching feature .
In Section 3.4 and Section 3.5, we obtain two matching features: (1) personalized style matching feature , which measures the personalized style consistency of the response candidates; and (2) post-aware personalized preference matching feature
, which measures the relevance of a response candidate and the user’s personalized preferences. We concatenate the two matching features to get the final matching vector. We then use an MLP with a sigmoid activation function to compute the final matching score:
We use cross-entropy loss to train the model:
There are many datasets to evaluate retrieval-based dialogue models (Lowe et al., 2015; Wu et al., 2017a; Zhang et al., 2018b, c). However, none of them contains user identifications. In this paper, we expect to learn implicit user profiles from the user’s dialogue history. Thus, we need datasets that contain users’ identifications. We use two public datasets crawled from two social networking sites: Weibo and Reddit. The Weibo dataset is a subset of the PChatbotW dataset released by (Qian et al., 2021). The PChatbotW dataset contains one-year Weibo logs from Sept. 10, 2018. Weibo is a popular social network in China. A Weibo user can post short messages publicly visible, and other users can make responses to the post. In the dataset, all posts and responses have user IDs. The Reddit dataset is extracted from the online forum Reddit from Dec. 1, 2015 to Oct. 30, 2018 (Zhang et al., 2020). In the dataset, the discussions can be expanded as tree-structured reply chains where each parent node can be considered as a post to its child nodes. Thus, we generate post-responses pairs by traversing the tree structure.
After aggregating users’ dialogue history, we filter users with less than fifteen history dialogues to guarantee enough personalized information. Besides, we limit the number of words to 50 for each utterance in the user’s dialogue history. For each user, we sort the dialogue history by time and use the latest post as the current query. Following previous works on constructing retrieval-based dialogue datasets (Lowe et al., 2015; Wu et al., 2017a; Zhang et al., 2018c, b), we create a list of ten response candidates for the current query.
The response candidates can be divided into three groups: (1) we use the user’s response under the current query as the ground-truth (personalized response); (2) we select other users’ responses under the current query (non-personalized response) as part of the response candidates; (3) following (Wu et al., 2017a), we retrieve response candidates via a retrieval engine (relevant response) and filter out responses issued by the current user as part of the response candidates. Notably, both personalized responses and non-personalized responses can be considered as proper responses to the current query.
Compared to previous candidates sampling strategies, such as random sampling (Lowe et al., 2015; Zhang et al., 2018c, b) or only retrieval (Wu et al., 2017a), our response sampling method is more advantageous, especially regarding personalized chatbot, because: First, in a practical scenario, the retrieval-based chatbots usually retrieve response candidates relevant to the query via a retrieval engine and then select a proper response from the candidates. Thus, instead of justifying the differences between irrelevant candidates and proper responses, learning to select proper responses from a list of relevant responses is more useful and challenging. Furthermore, we expect that the dialogue chatbot has a consistent personality. Thus, it should also be able to recognize the personalized response from a list of proper responses, including personalized responses and non-personalized ones. The statistical information of the two datasets is shown in Table 2.
To train baseline models that require external explicit user profile such as DIM (Gu et al., 2019) and RSM-DCK (Hua et al., 2020), following (Zhong et al., 2020; Mazaré et al., 2018), we also collect explicit persona sentences using heuristic methods. Note that these explicit persona sentences are not used in our model.
|Number of users||420,000||280,642|
|Avg. history length||32.3||85.4|
|Avg. length of post||24.9||10.5|
|Avg. length of response||10.1||12.4|
|Avg. number of non-pers. candidates||3.9||3.2|
|Avg. number of relevant candidates||5.1||5.8|
|Number of response candidates||10||10|
|Number of training samples||3,000,000||2,000,000|
|Number of validation samples||600,000||403,210|
|Number of testing samples||600,000||403,210|
As discussed in Section 4.1.2, the negative response candidates are sampled from the retrieval engine and from other users’ responses under the same post (non-personalized response). Thus, we consider three types of metrics to evaluate the models’ performances. First, we use (recall at position in candidates) and MRR (Mean Reciprocal Rank) to measure the model’s ability to select a personalized response from all candidates. Second, we use nDCG (normalized Discounted Cumulative Gain) to measure the model’s ability to select a proper response from all candidates. Last, we introduce a new metric to measure the model’s ability to select a personalized response from all proper candidates. We introduce the nDCG and in the following:
nDCG: the metric assigns scores to both personalized and non-personalized responses. The metric matters in the situation that no personalized response in the list of candidates. In such a scenario, a non-personalized response is better than an irrelevant response. We use nDCG@5 and set the relevance score as 2 and 1 for the personalized and non-personalized responses, respectively.
: for a personalized chatbot, personalized responses are more useful than non-personalized responses. Thus, we introduce recall at position in the proper candidates to measure the model’s ability to distinguish the personalized responses from other proper candidates. Taking as an example, when the personalized response ranks 1st out of all proper responses, we assign 1 score to the sample, otherwise 0.
|Weibo Corpus||Reddit Corpus|
” denote the result is significantly worse than our method in t-test withand level respectively. Models with “” are implemented with the provided source code, and models with “” are implemented by ourselves. The best results are in bold.
In our task, user’s dialogue history contains many single-turn dialogues. Given a query, we can either consider the task as single-turn matching or use the dialogue history as a context. Thus, we consider three types of retrieval-based model as baseline models:
(1) Single-turn matching models:
ARC-I (Hu et al., 2014): The model uses a Siamese architecture in which multi-layer 1D CNNs capture multi-grained semantic features of each sentence and then perform matching. ARC-II (Hu et al., 2014): The model directly builds on the interaction space between two sentences. It uses 1D CNNs to capture low-level features and then uses 2D CNNs to capture deep matching features. KNRM (Xiong et al., 2017): The model uses a kernel-based neural ranking model to model word-level soft matches for single turn-dialogue. Conv-KNRM (Dai et al., 2018)
: The model uses a CNN kernel-based ranking model to model n-gram soft matches for single-turn dialogue.
(2) Multi-turn dialogue models:
SMN (Wu et al., 2017a): The model matches a response with each utterance in the context on multi-level. It uses CNN to extract matching information and obtains the final matching score using an RNN. DAM (Zhou et al., 2018): The model constructs multi-level text segment representations with stacked self-attention and then extracts the matched segment pairs with attention across the context and response. IOI (Tao et al., 2019): The model uses a chain of interactive blocks to conduct semantic interaction between response and utterance in the context many times to obtain deep interactive matching information. MSN (Yuan et al., 2019): The model conducts context selection and filtered irrelevant context. It lets the response candidates interact with each utterance in the context to get multiple matching features.
(3) Persona-based dialogue models:
DIM (Gu et al., 2019): The model uses BiLSTM to encode response candidates, user profile, and context, and lets the user profile, context interact with response candidates, respectively, via cross-matching. It utilizes another BiLSTM to aggregate the matching features. DGMN (Zhao et al., 2019): The model fuses information in a document and a context into representations of each other. It performs hierarchical interactions between a response and both document and context. And the importance of each part is dynamically determined. RSM-DCK (Hua et al., 2020): The model pre-selects document and context. It performs matching between response-context and response-document. It uses Bi-LSTM to aggregate matching features. CSN (Zhu et al., 2021a): The model designs a content selection network to explicitly select relevant contents, and filter out the irrelevant parts. It performs context-response matching and document-response matching and aggregates matching features using CNN with LSTM.
In our model, we use Word2Vec (Mikolov et al., 2013) to initialize the word embedding, which has a size of 200. The Word2Vec is trained on the dataset. In the experiments, all baseline models apply the word embedding. We limit the max sequence length to 50. For the Personalized Style Matching module, we set the number of attentive modules to 3. For the Post-Aware Personalized Preference Matching module, we set the number of hops to 2 and set the hidden size of GRU to 300. We use three-layer CNN with , , filters, respectively. The 1st and 2nd layer use stride and the max-pooling size is with stride. The 3rd layer uses stride, and the max-pooling size is with
stride. We optimize the model using the Adam method with a learning rate set as 5e-4, which is decayed during training. We tune IMPChat and all baseline models on the validation set and evaluate on the test set. For both datasets, we set batch size as 128 and train the model for 10 epochs on two Tesla V100 16G GPUs. Our model is implemented by Pytorch(Paszke et al., 2019), and the code will be released based upon the acceptance of the paper.
Table 3 shows the results. Compared to previous state-of-the-art models, we find that our model IMPChat achieves the best performance across most of the metrics on the two datasets. As mentioned in Section 4.1.3
, we use three types of evaluation metrics that reflect the model’s abilities from different aspects.First, regarding , and MRR, our model IMPChat leads to statistically significant improvement than all baseline models on the two datasets (t-test with ). It demonstrates the strong capability of our IMPChat to distinguish the marginal differences between personalized responses and all other responses. Second, regarding nDCG, IMPChat outperforms all baselines on Reddit Corpus while IMPChat is worse than several multi-turn models on the Weibo dataset. We also find that most other personalized models perform worse than multi-turn models in terms of nDCG. The potential reasons might be: (1) personalized information increases the marginal distance between personalized responses and other responses. Therefore, the marginal distance between non-personalized (but relevant) responses and other negative responses decreases; (2) more personalized information also brings more noise that undermines matching (also discussed in section 4.5.2). Last, regarding , we find that most personalized models perform better than multi-turn models, reflecting that personalized information enhances the models’ capability to distinguish personalized responses from non-personalized responses. Our IMPChat outperforms all baseline models on Weibo Corpus and baseline models excluding DIM on Reddit Corpus, showing the effectiveness of building implicit user profiles from the dialogue history.
We respectively remove (1) the multi-hop method, (2) the Personalized Style Matching module, and (3) the Post-Aware Personalized Preference Matching module of IMPChat to investigate their effectiveness. Table 4 shows the results. We can conclude that removing any part of IMPChat will lead to a performance drop. Notably, performance degrades a large margin when removing the Post-Aware Personalized Preference Matching module. This is because, without the module, IMPChat only accesses historical responses so that the conditional matching signals among post-response pairs diminish. Besides, we find that without either multi-hop or Personalized Style Matching module, IMPChat still outperforms all baseline models regarding , , MRR and which verifies that either of the two parts indeed brings orthogonal information that the rest parts fail to capture.
We aim to learn an implicit user profile from the user’s dialogue history. Thus, how to choose the length of dialogue history is a crucial problem. We investigate the problem from two aspects: how the history dialogue length affects the model’s performance; and the difficulty to train the model or make inferences with more dialogue history. Figure 3 shows the influence of the history length, from which we can find: (1) as aforementioned, and MRR measure the model’s capability to retrieve a personalized response, while nDCG also considers non-personalized responses. Thus, with more dialogue history, the model’s capability of finding the personalized response steadily increases since the longer history brings more personalized information. However, the model’s capability to retrieve non-personalized responses decreases. The potential reason for the decline is that long history brings more noise into the implicit profile, which ambiguates the decision boundaries between non-personalized responses and other negative ones; (2) Figure 2(b) shows that the required computing resources increase with longer dialogue history555We set the batch size to 32, and conduct experiments on a single Tesla V100 16G GPU.. In a practical scenario, there is a trade-off regarding how to choose the length of dialogue history. First, training and inference speeds matter in practical use, especially for devices with limited computing resources such as a mobile phone. Second, not always personalized responses can be retrieved before response selection. When no response matches the user’s personality, a relevant response is better than a wrong one. We also conduct experiments on RSM-DCK, which has a similar number of parameters to IMPChat. The results also verify our findings.
The left figure shows the performance variance on different history lengths. The right figure shows the required GPU memory and speed for training and inference.
In this paper, we propose building implicit user profiles to endow personality to a retrieval-based chatbot. To achieve such a goal, we build a model IMPChat to learn the implicit user profile from two aspects. First, it models the user’s personalized language style using the user’s historical responses. It then models the user’s personalized preferences from a post-aware user profile which contains post-response pairs that are topically related to the current post. Extensive experimental results on two large datasets show that our method outperforms all previous state-of-the-art models, verifying our model’s effectiveness for the personalized retrieval-based chatbot.
Zhicheng Dou is the corresponding author. This work was supported by National Natural Science Foundation of China No. 61872370 and No. 61832017, Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, Shandong Provincial Natural Science Foundation under Grant ZR2019ZD06, and Intelligent Social Governance Platform, Major Innovation & Planning Interdisciplinary Platform for the ”Double-First Class” Initiative, Renmin University of China. I also wish to acknowledge the support provided and contribution made by Public Policy and Decision-making Research Lab of Renmin University of China.
Modeling personalization in continuous space for response generation via augmented wasserstein autoencoders. In EMNLP-IJCNLP, pp. 1931–1940. Cited by: §1, §1.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1845–1854. Cited by: §1, §4.1.2, §4.2.
Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pp. 2042–2050. Cited by: §2.1, §4.2.
Efficient estimation of word representations in vector space. External Links: Cited by: §4.3.