Code for the IJCAI'19 paper "Deep Session Interest Network for Click-Through Rate Prediction"
Click-Through Rate (CTR) prediction plays an important role in many industrial applications, such as online advertising and recommender systems. How to capture users' dynamic and evolving interests from their behavior sequences remains a continuous research topic in the CTR prediction. However, most existing studies overlook the intrinsic structure of the sequences: the sequences are composed of sessions, where sessions are user behaviors separated by their occurring time. We observe that user behaviors are highly homogeneous in each session, and heterogeneous cross sessions. Based on this observation, we propose a novel CTR model named Deep Session Interest Network (DSIN) that leverages users' multiple historical sessions in their behavior sequences. We first use self-attention mechanism with bias encoding to extract users' interests in each session. Then we apply Bi-LSTM to model how users' interests evolve and interact among sessions. Finally, we employ the local activation unit to adaptively learn the influences of various session interests on the target item. Experiments are conducted on both advertising and production recommender datasets and DSIN outperforms other state-of-the-art models on both datasets.READ FULL TEXT VIEW PDF
Code for the IJCAI'19 paper "Deep Session Interest Network for Click-Through Rate Prediction"
Recommender systems (RS) are becoming increasingly indispensable in assisting users to find their preferred items in web-scale applications such as Amazon and Taobao. Typically, an industrial recommender system consists of two stages: candidate generation and candidate ranking [Covington et al.2016]. The candidate generation stage adopts some naive but time-efficient recommendation algorithms (e.g. item-based collaborative filtering [Sarwar et al.2001]
) to provide a relative small set of items from the huge whole item set for ranking. In the candidate ranking stage, complex but powerful models (e.g. neural network methods) are applied to rank the candidates so as to select the top-k items for recommendation. In this paper, we mainly focus on the candidate ranking stage and treat it as a Click-Through Rate (CTR) prediction task. It means we assume a relative small item set has been provided for ranking and we rank items according to their CTR score predictions.
Some recent effective CTR models [Covington et al.2016, Zhou et al.2018c, Zhou et al.2018b, Zhou et al.2018a] show promising results by utilizing users’ sequential behaviors, which reflect users’ dynamic and evolving interests. However, these models overlook the intrinsic structure of the sequences: the sequences are composed of sessions. A session is a list of interactions (user behaviors) that occur within a given time frame. We observe that user behaviors are highly homogeneous in each session and heterogeneous cross sessions. As shown in figure 1, a user is sampled from a real-world industrial application and we split her behavior sequence into 3 sessions. Sessions are divided in the principle of whenever there exists a time gap of more than 30 minutes [Grbovic and Cheng2018]. The user mainly browses trousers in session 1, finger rings in session 2, and coats in session 3. The phenomenon illustrated in figure 1 is general. It reflects the fact that a user usually has a clear unique intention in one session while his/her interest can change sharply when he/she starts a new session.
Motivated by the above observations, we propose Deep Session Interest Network111https://github.com/shenweichen/DSIN (DSIN) to model users’ sequential behaviors in the CTR prediction task by leveraging their multiple historical sessions. There are three key components in DSIN. First, we naturally divide users’ sequential behaviors into sessions and then use self-attention network with bias encoding to model each session. Self-attention can capture the inner interaction/correlation of session behaviors and then extract users’ interests of each session. These various session interests may be correlated with each other and even follow a sequential pattern [Quadrana et al.2017]. So in the second part, we apply bi-directional LSTM (Bi-LSTM) to capture the interaction and evolution of users’ varying historical session interests. Because various session interests have different influences on the target item, finally we design the local activation unit to aggregate them w.r.t. the target item to form the final representation of the behavior sequence.
The main contributions of this paper are summarized as follows:
We highlight that user behaviors are highly homogeneous in each session and heterogeneous cross sessions, and propose a new model named DSIN, which can effectively model the user’s multiple sessions for the CTR prediction.
We design a self-attention network with bias encoding to get accurate interest representation of each session. Then we employ Bi-LSTM to capture the sequential relationship among historical sessions. At last, we employ the local activation unit for aggregation considering the influences of different session interests on the target item.
Two groups of comparative experiments are conducted on both advertising and production recommender datasets. The experiment results demonstrate the superiority of our proposed DSIN compared with other state-of-the-art models in the CTR prediction task.
The organization of the remaining parts of this paper is as follows. Section 2 introduces some related work. Section 3 gives the detailed description of our DSIN model. Section 4 presents our experiment results and analyses on both advertising and recommender datasets.
In this section, we mainly introduce existing studies of the CTR prediction and session-based recommendation.
Recent CTR models mainly pay attention to the interaction between features. Wide&Deep [Cheng et al.2016] combines the linear representation of features. DeepFM [Guo et al.2017] learns the second-order crossover of features and DCN [Wang et al.2017] applies a multi-layer residual structure to learn higher-order representation of features. AFM [Xiao et al.2017] argues that not all feature interactions are equally predictive and uses attention mechanism to automatically learn weights of cross-features. To sum up, the higher-order representation and interaction of features significantly improve the expressive ability of features and the generalization ability of models.
Users’ sequential behaviors imply users’ dynamic and evolving interests and have been widely proven effective in the CTR prediction task. YoutubeNet [Covington et al.2016]
transforms embeddings of users’ watching lists into a vector of fixed length by average pooling. Deep Interest Network (DIN)[Zhou et al.2018c] uses attention mechanism to learn the representation of users’ historical behaviors w.r.t. the target item. ATRANK [Zhou et al.2018a] proposes an attention-based framework modeling the influence between users’ heterogeneous behaviors. Deep Interest Evolution Network (DIEN) [Zhou et al.2018b] uses auxiliary loss to adjust the expression of current behavior to the next behavior and then models the specific interest evolving process for different target items with AUGRU. Modeling users’ sequential behaviors enriches the representation of the user and improves the prediction accuracy significantly.
The concept of session is commonly mentioned in sequential recommendation but rare in the CTR prediction task. Session-based recommendation benefits from the dynamic evolving of users’ interests in sessions. General Factorization Framework (GFF) [Hidasi and Tikk2016] uses sum pooling of items to represent a session. Each item has two kinds of representations, one represents itself and the other represents the context of the session. Recently, RNN-based approaches [Hidasi et al.2015, Hidasi et al.2016, Li et al.2018] are applied into session-based recommendations to capture the order relationship within a session. Based on that, [Li et al.2017] proposes a novel attentive neural networks framework (NARM) to model the user’s sequential behavior and capture the user’s main purpose in the current session. Hierarchical RNN [Quadrana et al.2017] is proposed to relays end evolves latent hidden states of the RNNs across users’ historical sessions. Besides RNNs, [Liu et al.2018, Kang and McAuley2018] apply only self-attention based models to effectively capture long-term and short-term interests of a session. [Tang and Wang2018]
uses convolutional neural network and[Chen et al.2018] adopts user memory network to enhances the expressiveness of the sequential model.
In this section, we introduce the Deep Session Interest Network (DSIN) in detail. We first introduce the basic deep CTR model named BaseModel, then the technical designs of DSIN that model the extraction and interaction of users’ session interests.
In this section, we mainly introduce feature representation, embedding, MLP and loss function in BaseModel.
Informative features count a great deal in the CTR prediction task. Overall, we use three groups of features in BaseModel: User Profile, Item Profile and User Behavior. Each group consists of some sparse features: User Profile contains gender, city, etc.; Item Profile contains seller id, brand id, etc.; User Behavior contains the item ids of items that the user recently clicked on. Note that the side information of the item can be concatenated to represent itself.
Embedding is a common technique which transforms large-scale sparse features into low-dimensional dense vectors. Mathematically, sparse features can be represented by respectively, where is the size of sparse features and is the embedding size. With embedding, User Profile can be represented by where is the number of sparse features of User Profile. Item Profile can be represented by where is the number of sparse features of Item Profile. User Behavior can be represented by where is the number of users’ historical behaviors and is the embedding of the -th behavior.
The negative log-likelihood function is widely used in CTR models, which is usually defined as:
where is the training dataset, is represented by is the input of the network, represents whether the user clicked the item and is the final output of the network which represents the prediction probability that the user clicks the item.
In recommnder systems, users’ behavior sequences consist of multiple historical sessions. Users show varying interests in different sessions. Also, users’ session interests are sequentially related to each other. DSIN is proposed to extract users’ session interest in each session and capture the sequential relationship of session interests.
As shown in figure 2, DSIN consists of two parts before MLP. One is the embedding vectors transformed from User Profile and Item Profile. The other models User Behavior and has four layers from the bottom up: (1) session division layer partitions users’ behavior sequence into sessions; (2) session interest extractor layer extracts users’ session interests; (3) session interest interacting layer captures the sequential relationship among session interests; (4) session interest activating layer applies the local activation unit to users’ session interests w.r.t the target item. Finally outputs of session interest activating layer and embedding vectors of User Profile and Item Profile are fed into MLP for the final prediction. In the following sections we introduce these four layers in the latter part in detail.
To extract more precise users’ session interests, we divide users’ behavior sequences S into sessions Q, where the -th session , is the number of behaviors we keep in the session and is users’ -th behavior in the session. The segmentation of users’ sessions exists between adjacent behaviors whose time interval is more than 30 minutes followed by [Grbovic and Cheng2018].
Behaviors in the same session are strongly related to each other. Besides, users’ casual behaviors in the session deviate the session interest from its original expression. To capture the inner relationship between behaviors in the same session and decrease the effect of those unrelated behaviors, we employ multi-head self-attention [Vaswani et al.2017] mechanism in each session. We also make some improvements in the self-attention mechanism to achieve our goal better.
To make use of the order relations of the sequence, self-attention mechanism applies positional encoding to the input embeddings. Furthermore, the order relations of sessions and the bias existed in different representation subspaces also need to be captured. Thus, we propose bias encoding on the basis of positional encoding, where each element in BE is defined as follows:
is the bias vector of the session,is the index of sessions, is the bias vector of the position in the session, is the index of the behavior in sessions, is the bias vector of the unit position in the behavior embedding and is the index of the unit in the behavior embedding. After added with bias encoding, users’ behavior sessions Q are updated as follows:
In recommender systems, users’ click behaviors are influenced by various factors (e.g. colors, styles and price) [Zhou et al.2018a]. Mulit-head self-attention can capture relationship in different representation subspaces. Mathematically, let where is the -th head of , is the number of heads and . The output of is calculated as follows:
where are linear matrices. Then vectors of different heads are concatenated and then fed into a feed-forward network:
where is the feed-forward network and
is the linear matrix. We also conduct residual connections and layer normalization successively. Users’-th session interest is calculated as follows:
where is the average pooling. Note that weights are shared in the self-attention mechanism of different sessions.
Users’ session interests hold sequential relations with contextual ones. Modeling the dynamic changes enriches the representation of the session interests. Bi-LSTM [Graves and Schmidhuber2005] is excellent at capturing sequential relations and naturally applied on modeling the interaction of session interests in DSIN. LSTM [Hochreiter and Schmidhuber1997] memory cell is implemented as follows:
where is the logistic function, and i, f, o and c are the input gate, forget gate, output gate and cell vectors which have the same size as . Shapes of weight matrices are indicated with the subscripts. Bi-direction means that there exist forward and backward RNNs, and the hidden states H are calculated as follows:
where is the hidden state of the forward LSTM and is the hidden state of the backward LSTM.
Users’ session interests more related to the target item have greater impacts on whether the user will click the target item. The weights of users’ session interests need to be reallocated w.r.t. the target item. Attention mechanism [Bahdanau et al.2014] conducts soft alignment between source and the target and has been proved effective as a weight allocation mechanism. The adaptive representation of session interests w.r.t. the target item is calculated as follows:
where has the corresponding shape. Similarly, the adaptive representation of session interests mixed with contextual information w.r.t. the target item is calculated as follows:
where has the corresponding shape. Embedding vectors of and , and are concatenated, flattened and then fed into the MLP layer.
In this section, we first introduce experiment datasets, competitors and evaluation metric. Then we compare our proposed DSIN with competitors and analyse the results. We further discuss the effectiveness of critical technical designs in DSIN empirically.
Advertising Dataset222https://tianchi.aliyun.com/dataset/dataDetail?dataId=56 is a public dataset released by Alimama, an online advertising platform in China. It contains 26 million records from ad display/click logs of 1 million users and 800 thousand ads in 8 days. Logs from -- to -- are for training and logs from -- are for testing. Users’ recent 200 behaviors are also recorded in logs.
To verify the effectiveness of DSIN in the real-world industrial applications, we conduct experiments on the recommender dataset of Alibaba. This dataset contains 6 billion display/click logs of 100 million users and 70 million items in 8 days. Logs from -- to -- are for training and logs from -- are for testing. Users’ recent 200 behaviors are also recorded in logs.
YoutubetNet. YoutubeNet [Covington et al.2016] is a technically designed model which uses users’ watching video sequence for video recommendation in Youtube. It treats users’ historical behaviors equally and utilizes average pooling operation. We also experiment with YoutubeNet without to verify the effectiveness of historical behaviors.
Wide&Deep. Wide&Deep [Cheng et al.2016] is a CTR model with both memorization and generalization. It contains two parts: wide model of memory and deep model of generalization.
DIN. Deep Interest Network [Zhou et al.2018c] fully exploits the relationship between users’ historical behaviors and the target item. It uses attention mechanism to learn the representation of users’ historical behaviors w.r.t. the target item.
DIN-RNN. DIN-RNN has a similar structure as DIN, except that we use the hidden states of Bi-LSTM, which models users’ historical behaviors and learns the contextual relationship.
DIEN. DIEN [Zhou et al.2018b] extracts latent temporal interests from user behaviors and models interests evolving process. Auxiliary loss makes hidden states more expressive to represent latent interests and AUGRU models the specific interest evolving processes for different target items.
AUC (Area Under ROC Curve) reflects the ranking ability of the model. It is defined as follows:
where is the collection of all positive examples, is the collection of all negative examples, is the result of the model’s prediction of the sample x and is the indicator function.
YoutubeNet without User Behavior.
DSIN with positional encoding.
DSIN with bias encoding and without session interest interacting layer and the corresponding activation unit.
DSIN with bias encoding.
Results on the advertising dataset and recommender dataset are shown in Table 1. YoutubeNet performs better than YoutubeNet-No-User-Behavior owing to User Behavior, while Wide&Deep gets the betters result due to combining the memorization of wide part. DIN improves AUC obviously by activating User Behavior w.r.t. the target item. Especially, the results of DIN-RNN in both datasets are worse than those of DIN due to the discontinuity of users’ behavior sequences. DIEN obtains better results while auxiliary loss and specially designed AUGRU lead to deviating from the original expression of behaviors. DSIN gets the best results on both datasets. It extracts users’ historical behaviors into session interests and models the dynamic evolving procedure of session interests, both of which enrich the representation of the user. The local activation unit helps obtain the adaptive representation of users’ session interests w.r.t. the target item.
As shown in Table 1, results show that DIN-RNN performs worse than DIN while DSIN-BE performs better than DSIN-BE-NO-SIIL. The difference between each pair is only the sequence modeling. [Zhou et al.2018c] explains that rapid jumping and sudden ending over behaviors causes the sequence data of user behaviors to seem to be noisy. It will lead to information loss in the process of information transmission in RNNs and further confuse the representation of users’ behavior sequences. While in DSIN, we partition users’ behavior sequences into multiple sessions for the following two reasons: (i) users’ behaviors are generally homogeneous in each session; (ii) users’ session interests follow a sequential pattern and are more suitable for sequence modeling. Both improve the performance of DSIN.
As shown in Table 1, we conduct comparative experiments with DSIN-BE and DSIN-BE-NO-SIIL, where DSIN-BE performs better. With session interest interacting layer, users’ session interests are mixed with contextual information and become more expressive, which improve the performance of DSIN.
As shown in Table 1, we conduct comparative experiments with DSIN-BE and DSIN-PE, where DSIN-BE performs better. Different from the two-dimensional positional encoding, the bias of users’ sessions is also captured. Empirically, bias encoding successfully captures the order information of sessions and improves the performance of DSIN.
As shown in figure 3, we visualize the attention weights in the the local activation unit and self-attention mechanism. To illustrate the effect of self-attention, we take the first session for example. The user mainly browses trouser-related items and occasionally coat-related items. We can observe that weights of trouser-related items are generally high. After self-attention, most representations of trouser-related behaviors are reserved and extracted into the user’s interest in this session. Besides, the local activation unit works by making users’ session interests related to the target item more prominent. In this case, the target item is a black trouser. The user’s trouser-related session interest is assigned greater weight and has more influence on the final prediction. While the session 3 is coat-related, the user’s color preference to black in it is also helpful to rank the trouser.
In this paper, we provide a novel perspective on the CTR prediction task, where users’ sequential behaviors consist of multiple historical sessions. User behaviors are highly homogeneous in each session and heterogeneous in different sessions. Base on these observations, Deep Session Interest Network (DSIN) is proposed. We first use the self-attention mechanism with bias encoding to extract the user’s interest of each session. Then we apply Bi-LSTM to capture the sequential relation of contextual session interests. We employ the local activation unit to aggregate the user’s different session interest representations with regard to the target item at last. Experiment results demonstrate the effectiveness of DSIN both on advertising and recommender datasets. In the future, we will pay attention to utilizing knowledge graph as prior knowledge to explain users’ historical behaviors for better CTR prediction.
Wide & deep learning for recommender systems.In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 7–10. ACM, 2016.
Parallel recurrent neural network architectures for feature-rich session-based recommendations.In Proceedings of the 10th ACM Conference on Recommender Systems, pages 241–248. ACM, 2016.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.