Lifelong sequential modeling for user response prediction.
User response prediction, which models the user preference w.r.t. the presented items, plays a key role in online services. With two-decade rapid development, nowadays the cumulated user behavior sequences on mature Internet service platforms have become extremely long since the user's first registration. Each user not only has intrinsic tastes, but also keeps changing her personal interests during lifetime. Hence, it is challenging to handle such lifelong sequential modeling for each individual user. Existing methodologies for sequential modeling are only capable of dealing with relatively recent user behaviors, which leaves huge space for modeling long-term especially lifelong sequential patterns to facilitate user modeling. Moreover, one user's behavior may be accounted for various previous behaviors within her whole online activity history, i.e., long-term dependency with multi-scale sequential patterns. In order to tackle these challenges, in this paper, we propose a Hierarchical Periodic Memory Network for lifelong sequential modeling with personalized memorization of sequential patterns for each user. The model also adopts a hierarchical and periodical updating mechanism to capture multi-scale sequential patterns of user interests while supporting the evolving user behavior logs. The experimental results over three large-scale real-world datasets have demonstrated the advantages of our proposed model with significant improvement in user response prediction performance against the state-of-the-arts.READ FULL TEXT VIEW PDF
Lifelong sequential modeling for user response prediction.
Nowadays, accurate prediction of user responses, e.g., clicks or conversions, has become the core part in personalized online systems, such as search engines (Dupret and Piwowarski, 2008), recommender systems (Qu et al., 2016) and computational advertising (He et al., 2014)
. The goal of user response prediction is to estimate the probability that a user would respond to a specific item or a piece of content provided by the online service. The estimated probability may guide the subsequent decision making of the service provider, e.g., ranking the candidate items according to the predicted click-through rate(Qu et al., 2016) or performing ad bidding according to the estimated conversion rate (Zhang et al., 2014).
One key aspect of user response prediction is user modeling, which profiles each user through learning from her historical behavior data or other side information. Generally speaking, the user behavior data have three characteristics. First, the user behaviors not only reflect the intrinsic and multi-facet user interests (Jiang et al., 2014; Koren, 2008), but also reveal the temporal dynamics of user tastes (Koren, 2009). Second, as is shown in Figure 1, the length of behavior sequences vary for different users because of diverse activeness or registration time. Third, there exist long-term dependencies in one’s behavior history where some behaviors happened early may accounts for the final decision making of the user, as illustrated in the right plot of Figure 1. Moreover, the temporal dependency also shows multi-scale sequential patterns, i.e., various temporal behavior dependencies, of different users.
With two-decade of rapid development of Internet service platforms, there have been abundant user behavior sequences cumulated in online platforms. Many works have been proposed for user modeling (Rendle et al., 2010; Zhou et al., 2018), especially with sequential modeling (Hidasi and Karatzoglou, 2018; Zhou et al., 2019). Some of the existing methods for user modeling aggregate the historical user behaviors for the subsequent preference prediction (Koren et al., 2009; Koren, 2008). However, they ignore temporal dynamics of user behaviors (Koren, 2009). Sequential modeling for user response prediction is to conduct a dynamic user profiling with sequential pattern mining. Some other works (Hidasi and Karatzoglou, 2018; Zhou et al., 2019) aim to deal with temporal dynamics with sequential pattern mining. Nevertheless, these sequential models focus only on short-term sequences, e.g., several latest behaviors of the user (Zhou et al., 2019) or the behavior sequence within recent period of time (Hidasi and Karatzoglou, 2018) but abandon previous user behaviors.
Considering the situation of recommending items in the manual way. Human may first take one’s intrinsic tastes into consideration (Zhang et al., 2018) and then consider her multi-facet interests (Koren, 2008; Jiang et al., 2014), e.g., various preferences over different item categories. Moreover, it is natural to combine one’s long-term (Ying et al., 2018) and recent experience (Hidasi et al., 2016) so as to comprehensively recommend items.
In order to tackle these challenges, also to overcome the shortcomings of the related works, we formulate the lifelong sequential modeling framework and propose a novel Hierarchical Periodic Memory Network (HPMN) to maintain user-specific behavior memories to solve it. Specifically, we build a personalized memorization for each user, which remembers both intrinsic user tastes and multi-facet user interests with the learned while compressed memory. Then the model maintains hierarchical memories to retain long-term knowledge for user behaviors. The HPMN model also updates memorization from newly coming user behaviors with different periods at different layers so as to capture multi-scale sequential patterns during her lifetime. The extensive experiments over three large-scale real-world datasets show significant improvements of our proposed model against several strong baselines including state-of-the-art.
This paper has three main contributions listed as follows.
To the best of our knowledge, it is the first work to propose the lifelong sequential modeling framework, which conducts a unified, comprehensive and personalized user profiling, for user response prediction with extremely long user behavior sequence data.
In lifelong sequential modeling framework, we propose a memory network with incremental updating mechanism to learn from the retained knowledge of user lifelong data and the evolving user behavior sequences.
We further design a hierarchical architecture with multiple update periods to effectively mine and utilize the multi-scale sequential patterns in users’ lifelong behavior sequences.
The rest of our paper is organized as below. Section 2 presents a comprehensive survey of user response prediction works. Section 3 introduces the motivation and model design of our methodology in detail. The experimental setups with the corresponding results are illustrated in Section 4. We finally conclude this paper and discuss the future work in Section 5.
User response prediction is to model the interest of the user on the content from the provider and estimate the probability of the corresponding user event (Ren et al., 2018b), e.g., clicks and conversions. It has become a crucial part of the online services, such as search engines (Dupret and Piwowarski, 2008), recommender systems (Qu et al., 2016; Guo et al., 2017) and online advertising (Graepel et al., 2010; Zhou et al., 2018; He et al., 2014). Typically, user response prediction is formulated as a binary classification problem with user response likelihood as the training objective (Richardson et al., 2007; Graepel et al., 2010; Agarwal et al., 2010; Oentaryo et al., 2014).
From the view of methodology, linear models such as logistic regression(Lee et al., 2012; Gai et al., 2017) and non-linear models such as tree-based models (He et al., 2014) and factorization machines (Menon et al., 2011; Oentaryo et al., 2014)
have been well studied. Recently, neural network models(Qu et al., 2016; Zhou et al., 2018) have attracted huge attention.
User modeling, i.e. to capture the latent interests of the user and derive the adaptive representation for each user, is the key component in user response prediction (Zhou et al., 2018; Zheng et al., 2017). The researchers have proposed many methodologies ranging from latent factor methods (Koren et al., 2009; Rendle, 2010) to deep representation learning methods (Qu et al., 2016; Zhou et al., 2018). These models aggregate all historical behaviors as a whole while ignoring the temporal and drifting user interests.
Nowadays, sequential user modeling has drawn great attention since the sequences of user behaviors have rich information for the user interests, especially with drifting trends. It has been a research hotspot for sequential modeling in online systems (Zhou et al., 2019; Ren et al., 2018a; Villatel et al., 2018). From the perspective of modeling, there are three categories for sequential user modeling. The first is from the view of temporal matrix factorization (Koren, 2009)2010; He and McAuley, 2016; He et al., 2016)
which implicitly models the user state dynamics and derive the outcome behaviors. The third school is based on deep neural network for its stronger capacity of feature extraction, such as recurrent neural network (RNN)(Hidasi et al., 2016; Hidasi and Karatzoglou, 2018; Wu et al., 2017; Jing and Smola, 2017; Liu et al., 2016; Beutel et al., 2018; Villatel et al., 2018)
and convolutional neural network (CNN) regarding the behavior history as an image(Tang and Wang, 2018; Kang and McAuley, 2018).
However, these methods mainly focus on short-term user modeling which has been constrained in the most recent behaviors. Zhang et al. (2018) additionally utilized a static user representation for user intrinsic interests along with short-term intent representation. Ying et al. (2018) proposed a hierarchical attentional method over a list of user behavior features for modeling long-term interests. But they can only capture simple sequential patterns lacking of considering long-term and multi-scale behavior dependencies. Moreover, few of the existing works consider modeling lifelong user behavior history thus cannot properly establish a comprehensive user profiling.
have been proposed in natural language processing (NLP) tasks for explicitly remembering the extracted knowledge by maintaining that in an external memory component. Several works(Ebesu et al., 2018a; Chen et al., 2018; Huang et al., 2018; Wang et al., 2018) utilize memory network for recommendation tasks. However, these methods directly use the structure of memory network from NLP tasks, which does not consider practical issues in user response prediction. Specifically, they fail to consider multi-scale knowledge memorization or long-term dependencies. There is one work of recurrent model with multi-scale pattern mining (Chung et al., 2016) in the NLP field. The essential difference is that their model was designed for natural language sentence modeling with fixed length, while our model supports lifelong sequential modeling through the maintained user memory and additionally consider long-term dependencies within user behavior sequences with extremely large length.
In this part, with discussions about the notations and preliminaries of user response prediction, we make a definition of lifelong sequential modeling and discuss some characteristics of it. Then we present the overall architecture of lifelong sequential modeling including the data flow with Hierarchical Periodic Memory Network (HPMN). The notations have been summarized in Table 1.
The data in the online system are formulated as a set of triples each of which includes the user , item and the corresponding label of user behavior indicator
Without loss of generality, we take click as the user behavior and the goal is to estimate click-through rate111In this paper, we focus on the CTR estimation, while the estimation of other responses can be done by following the same tokens. (CTR) of user on item at the given time. It approaches CTR prediction through a learned function with parameter . There are three parts of raw features . Here
is the feature vector of the target itemincluding the item ID and some side information and is the context feature of the prediction request such as web page URL. User side feature contains some side information and a sequence of user interacted (i.e., clicked) items of user . Note that, the historical sequence length varies among different users.
The goal of sequential user modeling is to learn a function with parameter for conducting a comprehensive representation for user
taking the recent user behaviors. Note that, this user modeling can be drifting since the user continues interacting with online systems and generating new behaviors. Many sequential user modeling works set a fixed value as the maximal length of user behavior sequence, e.g., in (Hidasi and Karatzoglou, 2018) for session-based recommendation and in (Zhou et al., 2019), to capture recent user interests.
Thus the final task of user response prediction is to estimate the probability of user action i.e. click, over the given item as
|The target user and the target item.|
|The true label and the predicted probability of user response.|
|Feature of user , item and the context information.|
|The side information of the user.|
|Feature of the -th interacted item in user’s behavior history.|
|The inferred sequential representation of the user.|
|The total behavior sequence length and the layer number of HPMN.|
|The index of sequential behavior and network layer ().|
|The maintained memory content in the -th layer at the -th time step.|
|The reading weight and the period for the -th memory slot|
|maintained by the -th layer of HPMN.|
Recall that, most existing works on sequential user modeling focus on the recent behaviors, while sometimes for the whole user behavior sequence with length . To the best of our knowledge, few of them consider lifelong sequential modeling. We define it as below.
Lifelong Sequential Modeling (LSM) in user response prediction is a process of continuous (online) user modeling with sequential pattern mining upon the lifelong user behavior history.
There are three characteristics of LSM.
LSM supports lifelong memorization of user behavior patterns. It is impossible for the model to maintain the whole behavior history of each user for real-time online inference. Thus it requires highly efficient knowledge preserving of user behavior patterns.
LSM should conduct a comprehensive user modeling of both intrinsic user interests and temporal dynamic user tastes, for future behavior prediction.
LSM also needs continuous adaptation to the up-to-date user behaviors.
Following the above principles, we propose a LSM framework for the whole evolving user behavior history, as is illustrated in Figure 2. Within the framework, we conduct a personalized memory with several slots for each user. This memory will be maintained through an incremental updating mechanism (as Steps A and B in the figure) along with the evolving user behavior history.
As for online inference, when a user sends a visit request, the online service will transmit the request including the information of target user and target item. Each user request will trigger a query procedure and we use the vector of the target item as the query to obtain the associated user representation according to this specific item in the memory pool. Then HPMN model will take the query vector to read the lifelong maintained personalized memory of that user, to conduct the corresponding user representation, without inference over the whole historical behavior sequence. After that, the user representation , item vector and context features will be together considered for the subsequent user response prediction, which will be described in Section 3.4.
The details of HPMN will be presented in Section 3.3.
In this section, we first present the motivations of HPMN model and subsequently discuss the specific architectures.
Generally speaking, we propose HPMN model based on three considerations of the motivation.
As is stated above, the main goal of LSM is to capture sequential user patterns hidden in user behavior sequences. Many works (He et al., 2016; Hidasi et al., 2016; Zhou et al., 2019) have been proposed for sequential pattern mining to improve the subsequent prediction. Thus HPMN model firstly introduces sequential modeling through a recurrent component.
There also exists long-term dependencies among lifelong user behaviors, i.e., the later user decision making may have some relationship to her previous actions. We will show some examples in the experiment of the paper. However, traditional sequential modeling methods either rely on the recent user behaviors, or updates user states too frequently which may result in memorization saturation and knowledge forgetting (Sodhani et al., 2018). Hence we incorporate periodic memory updating mechanism to avoid unexpected knowledge drifting.
The behavior dependencies may span various time distances, e.g., user may show preferences on the specific item at different time along the whole history. So that it requires multi-scale sequential pattern mining. HPMN model deals with this by maintaining hierarchical memory slots with different update periods.
Moreover, since the personalized memory stores a comprehensive understanding of each user with multi-facet user preferences, so HPMN model incorporates a regularization of memory covariance to preserve diverse knowledge of user interests. Besides, for each query, the model reads the user memory through an attentional way which tries to match the target item over the multi-facet user modeling knowledge.
Next we will describe the model details from four aspects. The memory architecture will be introduced in Section 3.3.1 followed by the description of periodic yet incremental updating mechanism in Section 3.3.2. We introduce the usage of the user memory in Section 3.3.3 and the covariance regularization in Section 3.3.4.
As is illustrated in Figure 2, for each user , there is a user-specific memory pool containing memory slots and is a piece of real-value representation of user modeling. The idea of the external memory has been used in the NLP field (Miller et al., 2016; Kumar et al., 2016) for better memorization of the context information embedded in the previously consumed paragraph. We utilize this external memory pool for capturing the intrinsic user interests with temporal sequential patterns, yet it is also evolving and supports incremental memory update along with the growing behavior sequences.
Generally speaking, HPMN model is a layer-wise memory network which contains layers, as is shown in Figure 3. Each layer maintains the specific memory slot . The output of the -th layer at the -th time step (i.e., -th sequential user behavior) will be transmitted not only to the next time step, but also to the next layer at the specific time step.
Considering the rapidly growing user-item interactions, it is impossible for the model to scan through the complete historical behavior sequence at each prediction time. That’s the reason why almost all the existing methods only consider recent short-term user behaviors. Thus it is necessary to maintain only the latest memories and implement an incremental update mechanism in real time. After each user behavior on a item at the -th time step, the memory slot at each layer would be updated as
where and is the update period of -th layer. In Eq. (4
), the memory writing in each layer is based on the Gated Recurrent Unit (GRU)(Cho et al., 2014) cell as
For each cell in different layers, the parameters of differs. Note that, it is a soft-writing operation on the memory slot since the last operation of function has the “erase” vector
as the same as that in the other memory network literature such as Neural Turing Machine (NTM)(Graves et al., 2014). Note that the first layer of memory will be updated with the raw feature vector of the user interacted item and the memory contents from the last time step .
Moreover, the memory update is periodic where each memory at -th memory slot will be updated according to the time step and the period of each layer. Here we set the period of each layer
as the hyperparameter which is reported in Table3 in the experimental setup. By applying this periodic updating mechanism, the upper layers are updated less frequently to achieve two goals. (i) First it avoids gradient vanishing or explosion, thus being able to model long sequences better; (ii) It then remembers the long-term dependency better than the memory maintained by the lower layer. The different update behaviors of each layer may capture multi-scale sequential patterns, which is illustrated in Section 4.3.
The similar idea of clockwork update has been implemented in RNN model (Koutnik et al., 2014). However, they simply split the parameters in the recurrent cell and update the hidden states separately. We make two improvements that (i) we connect the network layers through state transferring so as to make layer-wise information transmitting; (ii) we incorporate the external memory component to preserve both intrinsic and multi-scale sequential patterns for lifelong sequential modeling.
Till now, the model has conducted the long-term memorization of the intrinsic and multi-scale temporal dynamics, which may connect the intrinsic properties and the multi-scale patterns of behavior dependency, to the current user response prediction. Besides, we conduct the attentional memory usage similar to the common memory networks (Weston, 2015; Sukhbaatar et al., 2015; Graves et al., 2014).
We calculate the comprehensive user representation as
Here is the maintained memory at the last time step of the long-term sequence, i.e., and is the final behavior log of the user. The weight of each memory means the contribution of each memory slot to the final representation and it is calculated as
is an energy model which measures the relevance between the query vector and the long-term memory . Note that the energy function. The way we calculate the attention through the energy function is similar to that in the NLP field (Bahdanau et al., 2015).
As is described in the previous sections, the maintained user memory captures long-term sequential patterns with multi-facet user interests. Recall that our model uses memory slots with dimensions to memorize user behavior patterns. We expect that different memories store knowledge of user interests from different perspectives. However, unlike the models like NTM (Graves et al., 2014), HPMN does not utilize attention mechanism to reduce redundancy when updating memory slots. In order to facilitate memorization utility, we utilize a covariance regularization on memories following (Cogswell et al., 2016).
Specifically, we first define as the covariance matrix of the memory contents as
is the matrix of memories and is the mean matrix with regard to each row of and is the dimension of each memory slot. Note that has the same shape with . After that, we define the loss to regularize the covariance as
where is the Frobenius norm of matrix. We need to minimize covariance between different memory slots, which corresponds to penalizing the norm of .
For each prediction request, we obtain the comprehensive representations through querying the personalized memory for the target user by Eqs. (2) and (6). The final estimation for the user response probability will be calculated as that in Figure 4 as
is implemented as a multi-layer deep network with three layers, whose widths are 200, 80 and 1 respectively. The first and second layer use ReLU as activation function while the third layer uses sigmoid function as.
As for the loss function, we take an end-to-end training and introduce (i) the widely used cross entropy loss(Zhou et al., 2018, 2019; Ren et al., 2018b) over the whole dataset with (ii) the covariance regularization and (iii) the parameter regularization . We utilize gradient descent for optimization. Thus the final loss function is
where and are the weights of the two regularization losses, is the set of model parameters of HPMN and is the size of training dataset.
Discussions. We propose the lifelong sequential user modeling with the personalized memory for each user. The memory are updated periodically to capture long-term yet multi-scale sequential patterns of user behavior. For user response prediction, the maintained user memory will be queried with the target item to forecast the user preference over that item.
Note that, LSM has some essential differences from the lifelong machine learning (LML) proposed by(Chen and Liu, 2016). First, the retained knowledge in LSM is user-specific while LML is model-specific; Second, LSM is conducted for user modeling while LML aims at continuously multi-task learning (Chen et al., 2016); Finally the user behavior patterns drift in LSM while the data samples and tasks change in LML.
The retained while compressed memory guarantees that the time complexity of our model is acceptable for industrial productions. The personalized memory will be created from the first registration of the user and maintained by HPMN model as lifelong modeling. For each prediction, the model only needs to query the maintained memory, rather than inferring over the whole behavior sequence as adopted by the other related works (Hidasi et al., 2016; Zhou et al., 2019). Meanwhile, our model has an advantage of sequential behavior modeling to those aggregation-based model, such as traditional latent factor models (Koren, 2009, 2008). For memory updating, the time complexity is where is the calculation time of the recurrent component. All the matrix operations can be parallelly executed on GPUs.
The model parameters of HPMN can be updated in a normal way as common methods (Qu et al., 2016; Zhou et al., 2018) where the model is retrained periodically depending on the specific situations. The number of memory slots is the hyperparameter and the specific slot number depends on the practical situation. Along with the lifelong sequential user modeling, the memory of each user is expanded accordingly. We conduct an experiment about the relations between the number of memory slots and the task performance and discuss in Section 4.3. We may follow (Sodhani et al., 2018) and expand the memory when the performance drops in some margin. However, we only need to add one layer with a larger updating period on the top, without retraining all the parameters of HPMN as that in (Sodhani et al., 2018).
In this section, we present the details of the experiment setups and the corresponding results. We also make some discussions with an extended investigation to illustrate the effectiveness of our model. Moreover, we have also published our code222Reproducible code link: https://github.com/alimamarankgroup/HPMN..
We start with three research questions (RQs) to lead the experiments and discussions.
Does the incorporation of lifelong behavior sequence contributes to the final user response prediction?
Under the comparable experimental settings, does HPMN achieve the best performance?
What patterns does HPMN capture from user behavior sequences? Does it have the ability to capture long-term, short-term and multi-scale sequential patterns?
In this part, we present the experiment setups including dataset description, preprocessing method, evaluation metrics, experiment flow and the discussion of the compared settings.
We evaluate all the compared models over three real-world datasets. The statistics of the three datasets are shown in Table 2.
(McAuley et al., 2015) is a collection of user browsing logs over e-commerce products with reviews and product metadata from Amazon Inc. We use the subset of Electronic products which contains user behavior logs from May 1999 to July 2014. Moreover, we regard all the user reviews as user click behaviors. This processing method has been widely used in the related works (Zhou et al., 2019, 2018).
(Zhu et al., 2018) is a dataset of user behaviors from the commercial platform of Taobao. The dataset contains several types of user behaviors including click, purchase, add-to-cart and item favoring. It is consisted of user behavior sequences from nearly one million users from November 25 to December 3, 2017.
is sampled from the click logs of more than twenty thousand users on Alibaba e-commerce platform from April to September 2018. It contains relatively longer historical behavior sequences than the other two datasets. Note that there is no public dataset containing such long behavior history of each user for sequential user modeling. We have published this dataset for further research333Dataset download link: https://tianchi.aliyun.com/dataset/dataDetail?dataId=22482..
Dataset Properties. These datasets are selected as typical examples in real-world applications. Amazon dataset covers a very long time range of user behaviors during about fifteen years while some of the users were inactive and generated relatively sparse behaviors during this long time range. For XLong dataset, each user has a behavior sequence of one thousand clicks that happened in half a year. And modeling such long sequence is a major challenge for lifelong sequential modeling. As for Taobao dataset, although it only covers nine days’ logs, the users in it have generated quite a few behaviors which reflects that the users are quite active.
Dataset Preprocessing. To simulate the environment of lifelong sequential modeling, for each dataset, we sort the behaviors of each user by the timestamp to form the lifelong behavior sequence for each user. Assuming there are behaviors of user , we use this behavior sequence to predict the user response probability at the target item for the -th behavior. Note that 50% target items at the prediction time in each dataset have been replaced with another item from the non-clicked item set for each user, to build the negative samples.
Training & Test Splitting. We split the training and test dataset according to the timestamp of the prediction behavior. We set a cut time within the time range covered by the full dataset. If the prediction behavior of a sequence took place before the cut time, the sequence is put into the training set. Otherwise it would be in the test set. In this way, training set is about 70% of the whole dataset and test set is about 30%.
We use two measurements for the user response prediction task. The first metric is area under ROC curve (AUC) which assesses the pairwise ranking performance of the classification results between the clicked and non-clicked samples. The other metric is Log-loss calculated as
Here is the number of samples in the test set. Log-loss is to measure the overall likelihood of the whole test data and has been widely used for the classification tasks (Ren et al., 2016; Qu et al., 2016).
Recall that each sample of user behaviors contains at most interacted items. As some of our baseline models were proposed to model recent short behavior sequence, thus we first split the recent user behaviors as the short-term sequential data for baseline model evaluation (), as is shown in Table 2. Moreover, for fair comparison, we also conduct the experiments over the whole lifelong sequences with length for all the baselines.
Note that, all the compared models are fed with the same features including contextual features and side information for fair comparison.
Finally, we conduct the significance test to verify the statistical significance of the performance improvement of our model against the baseline models. Specifically, we deploy a MannWhitney U test (Mason and Graham, 2002)
under AUC metric, and a t-test(Bhattacharya and Habtzghi, 2002) under Log-loss metric.
To show the effectiveness of our method, we compare it with three groups of eight baselines. The first group consists of aggregation-based models, they aggregate the user behaviors for user modeling and response prediction, without considering the sequential patterns.
is a multi-layer feed-forward deep neural network which has been widely used as the base model in recent works (Zhou et al., 2018; Zhang et al., 2016; Qu et al., 2016). We follow (Zhou et al., 2018) and use sum pooling operation to integrate all the sequential behavior features concatenating the other features as the user representation.
(Koren, 2008) is a MF-based model that combines the user clicked items and latent factors for response prediction.
The second group contains short-term sequential modeling methods including RNN-based models, CNN-based models and a memory network model. For these methods, they either use the behavior data within a session or just truncate the recent behavior sequence to the fixed length.
(Hidasi et al., 2016) bases on RNN and it is the first work using recurrent cell to model sequential user behaviors. It is originally proposed for session-based recommendation.
(Tang and Wang, 2018) is a CNN based model, using horizontal and vertical convolutional filters to capture behavior patterns at different scales.
(Zhou et al., 2019) is a two-layer RNN structure with attention mechanism. It uses the calculated attention values to control the second RNN layer to model drifting user interests.
The third group is formed of some long-term sequential modeling methods. However, note that, our HPMN model is the first work on the lifelong sequential modeling for user response prediction.
(Hochreiter and Schmidhuber, 1997) is the first model to do long-term sequential modeling whose memory capacity is limited.
(Ying et al., 2018) is a hierarchical attention network. It uses two attention layers to handle user’s long- and short-term sequences, respectively. However, this model does not capture sequential patterns.
is our proposed model described in Section 3.
We first evaluate the models in the second group over the short length data as they were proposed for short-term sequential modeling. Then we test all the models over the whole length data comparing to our proposed model.
Some state-based user models (Rendle et al., 2010; He and McAuley, 2016) have been compared in (Tang and Wang, 2018) thus we just compare with state-of-the-art (Tang and Wang, 2018). We omit comparison to the other memory-based models (Ebesu et al., 2018b; Huang et al., 2018) since they are not aiming at sequential user modeling.
For online inference, all of the baselines except memory models, i.e., RUM and HPMN, need to load the whole user behavior sequence to further conduct user modeling for response prediction, while the memory-based models only need to read the user’s personalized memory contents for the subsequent prediction. Thus the space utility is more efficient of memory-based model considering online sequential modeling.
The difference between our model and the other memory network model, i.e., RUM, is two-fold. (i) RUM implements the memory architecture following (Miller et al., 2016) in NLP tasks, which may not be appropriate for user response prediction, since the user generated data are quite different to language sentences. And the experiment results in the below section also reflect this. (ii) Our model utilizes periodic updated memories through hierarchical network to capture multi-scale sequential patterns while RUM has no consideration of that.
There are two sets of hyperparameters. The first set is training hyperparameters, including learning rate and regularization weight. We consider learning rate from and regularization weight and from . Batch size is fixed on 128 for all the models. The hyperparameters of each model are tuned and the best performances have been reported below. The second group is the structure hyperparameters of HPMN model, including size of each memory slot and the update periods of the -th layer which are shown in Table 3. The reported update periods are listed from the first (lowest) layer to the last (highest).
|Dataset||Mem. Size||Update Periods|
|Amazon||32||3 layers: 1, 2, 4|
|Taobao||32||4 layers: 1, 2, 4, 12|
|XLong||32||6 layers: 1, 2, 4, 8, 16, 32|
In this section, we present the experiment results in Table 4 and conduct an analysis from several perspectives. Recall that the compared models are divided into three groups as mentioned in Sec. 4.1.4.
Comparison between HPMN and baselines. From Table 4, we can tell that HPMN improves the performance significantly against all the baselines and achieves state-of-the-art performance (RQ2).
The aggregation-based models in Group 1, i.e., DNN and SVD++, perform not well as the sequential modeling methods, which indicates that there exist sequential patterns in user behavior data and simply aggregating user behavior features may result in poor performance.
Comparing with the other sequential modeling methods of Group 2, HPMN outperforms all of them regardless of the length of user behavior sequences. Since GRU4Rec was proposed for short-term session-based recommendation, thus it has the same issue as LSTM which may lose some knowledge of the long-term behavior dependency. Though the attention mechanism of DIEN improves the performance from GRU4Rec in a large margin, it either ignores the multi-scale user behavior patterns, which will be illustrated from an example in the next section. Moreover, DIEN model requires to conduct online inference over the whole sequence for prediction, which lacks of practical efficiency considering extremely long, especially lifelong user behavior sequences. From the results of Caser which uses CNN to extract sequential patterns, we may tell that convolution operation may not be appropriate for sequential user modeling. As for RUM model, though it utilizes an external memory for user modeling, it fails to capture sequential patterns which results in quite poor performance. Moreover, this proposed model was originally optimizing for other metrics (Chen et al., 2018)
, e.g., precision and recall, thus it may not perform well for user response prediction.
By comparing HPMN with the models in Group 3, i.e., LSTM and SHAN, we find that although both baselines are proposed to deal with long-term user modeling, HPMN has better performance on the very long sequences. The reason would be that LSTM has limited memory capacity to retain the knowledge, and SHAN has not considered any sequential patterns in the user behaviors.
Analysis about Lifelong Sequential Modeling. Recall that we evaluate all the short-term sequential modeling methods on both short sequence data and lifelong sequence data, as is shown in Table 4 and we have highlighted the results of the performance gain (and drop ) in the table of the latter case compared with the former case.
From the table, we find that almost all the models gain an improvement when modeling on the lifelong user behavior sequences on Amazon and Taobao datasets. However, on XLong dataset, the performance of GRU4Rec, Caser and DIEN drops, while the memory-based model, i.e., RUM achieve better performance than itself on short sequences. Note that, our HPMN model performs best. All the phenomenon reflect that the incorporation of lifelong sequences contributes better user modeling and response prediction (RQ1). Nevertheless, it also requires well designed memory model for lifelong modeling, while our HPMN model achieves satisfying performance on this problem.
Model Convergence. We plot the learning curves of HPMN model over the three datasets in Figure 5. As is shown in the figure, HPMN converges quickly, the Log-loss value on three datasets all drop to the stable convergence after about one iteration over the whole training set.
In this section, we further investigate the patterns that HPMN captures when dealing with lifelong sequence (RQ3) and the model capacity of memorization.
Sequential Patterns with Multi-scale Dependency. In Figure 6, we plot three real examples of user behavior sequence with length sampled from XLong dataset. These three sequences reflect the long-term, short-term and multi-scale sequential patterns captured by HPMN, respectively.
In the first example, the target item is “lotion” clicked by the user at the final prediction time. As we find in her behavior history, there are several clicks on lotions at the 31st, 33rd and 37th positions of her behavior sequence, which is far from the tail of her latest behaviors. When HPMN model takes the target item as query to conduct the user representation, from the attention heatmap calculated by HPMN as that in Eq. (7), we can tell that the fifth layer of HPMN has the maximum attentions, whose update period is relative large. It shows that HPMN captures long-term sequential pattern in the memory maintained by high layers.
In the second example, User 2 at last clicked a desk, and some similar items (table, cabinet) are also clicked in very recent history. However, these kinds of furniture are not clicked in the former part of the sequence. The first memory of HPMN has the maximum attention value which shows that the lower layer is better at modeling short-term pattern for that it updates the memory more frequently to capture user’s short-term interests.
As for User 3, the click behavior on the target item has both long-term and short-term dependencies, the similar items are clicked in the recent history and in the former part of her behavior sequence. After inference through HPMN model, the second and fifth layers have higher attention values, for they could capture short-term and long-term dependencies respectively. Thus, this demonstrates that HPMN has the ability to capture multi-scale sequential patterns.
Memory Capacity. In Figure 7, we plot the AUC performance of HPMN with different numbers of memory slots on XLong Dataset. Note that the number of memory slots is equal to the number of HPMN layers. On one hand, when the number of memory slots for each user is less than 5, the prediction performance of the model rises sharply as the memory increases. This indicates that it requires large memory of sequential patterns for long behavior sequences. And increasing the memory according to the growth of user behavior sequence helps HPMN to better capture lifelong sequential patterns. However, when the number of memory slots is larger than 5, the AUC score drops slightly as the memory number increases. This demonstrates that, on the other hand, the model capacity has some constraints for the specific length of user behavior sequence. It provides some guides about memory expanding and the principle of enlarging HPMN model for lifelong sequential modeling with evolving user behavior sequence, as is discussed in Section 3.4.
In this paper, we present lifelong sequential modeling for user response prediction. To achieve this goal, we conduct a framework with a memory network model maintaining the personalized hierarchical memory for each user. The model updates the corresponding user memory through periodic updating machanism to retain the knowledge of multi-scale sequential patterns. The user lifelong memory will be attentionally read for the subsequent user response prediction. The extensive experiments have demonstrated the advantage of lifelong sequential modeling and our model has achieved a significant improvement against strong baselines including state-of-the-art.
In the future, we will adopt our lifelong sequential modeling to improve multi-task user modeling such as prediction of both user clicks and conversions (Ma et al., 2018). We also plan to investigate learning for dynamic update period of each layer, to capture more flexible user behavior patterns.
Acknowledgments. The work is sponsored by Alibaba Innovation Research. The corresponding author Weinan Zhang thanks the support of National Natural Science Foundation of China (61702327, 61772333, 61632017) and Shanghai Sailing Program (17YF1428200).
Synthesis Lectures on Artificial Intelligence and Machine Learning(2016).