Déjà vu: A Contextualized Temporal Attention Mechanism for Sequential Recommendation

by   Jibang Wu, et al.
University of Virginia

Predicting users' preferences based on their sequential behaviors in history is challenging and crucial for modern recommender systems. Most existing sequential recommendation algorithms focus on transitional structure among the sequential actions, but largely ignore the temporal and context information, when modeling the influence of a historical event to current prediction. In this paper, we argue that the influence from the past events on a user's current action should vary over the course of time and under different context. Thus, we propose a Contextualized Temporal Attention Mechanism that learns to weigh historical actions' influence on not only what action it is, but also when and how the action took place. More specifically, to dynamically calibrate the relative input dependence from the self-attention mechanism, we deploy multiple parameterized kernel functions to learn various temporal dynamics, and then use the context information to determine which of these reweighing kernels to follow for each input. In empirical evaluations on two large public recommendation datasets, our model consistently outperformed an extensive set of state-of-the-art sequential recommendation methods.



There are no comments yet.


page 1

page 2

page 3

page 4


Continuous-Time Sequential Recommendation with Temporal Graph Collaborative Transformer

In order to model the evolution of user preference, we should learn user...

Non-invasive Self-attention for Side Information Fusion in Sequential Recommendation

Sequential recommender systems aim to model users' evolving interests fr...

Self-Attentive Sequential Recommendation

Sequential dynamics are a key feature of many modern recommender systems...

Learning to Structure Long-term Dependence for Sequential Recommendation

Sequential recommendation recommends items based on sequences of users' ...

MEANTIME: Mixture of Attention Mechanisms with Multi-temporal Embeddings for Sequential Recommendation

Recently, self-attention based models have achieved state-of-the-art per...

Recommendation Systems and Self Motivated Users

Modern recommendation systems rely on the wisdom of the crowd to learn t...

A Self-Attentive Emotion Recognition Network

Modern deep learning approaches have achieved groundbreaking performance...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The quality of recommendation results is one of the most critical factors to the success of online service platforms, with the growth objectives including user satisfaction, click- or view-through rate in production. Designed to propose a set of relevant items to its users, a recommender system faces dynamically evolving user interests over the course of time and under various context. For instance, it is vital to distinguish when the history happened (e.g. a month ago or in the last few minutes) as well as to evaluate the context information (e.g. under a casual browsing or some serious examining setting), especially on how serious the user is about the click, and how related his/her preference is to this particular event.

Concerning such sequential dependence within user preferences, the task of sequential recommendation is set to predict the ongoing relevant items based on a sequence of the user’s historical actions. Such setting has been widely studied (hidasi2015session; quadrana2017personalizing; you2019hierarchical; cui2017hierarchical; kang2018self; Tang:2018:PTS:3159652.3159656; chen2018sequential; cai2018modeling) and practiced in popular industry recommender systems such as YouTube (tang2019towards; beutel2018latent) and Taobao (sun2019bert4rec). Take the online shopping scenario illustrated in Figure 1 for example: the system is given a series of user behavior records and needs to recommend the next set of items for the user to examine. We should note in this setting we do not have assumptions about how the historical actions are generated: solely from interaction between the user and the recommender system, or a mix of users’ querying and browsing activities. But we do assume the actions are not independent from each other. This better reflects the situation where only offline data and partial records of user behaviours are accessible by a recommender system.

Figure 1. Sequential recommendation in online shopping scenario (up) from the traditional view, (down) from our view with temporal and contextual segmentations.

One major challenge to the sequential recommendation task is that the influence patterns from different segments of history reflect user interests in different ways, as is exemplified in Figure 1:

  • [leftmargin=*]

  • By temporal segment: The distant history indicates that the user is interested in shopping sports related products. Now that he or she is looking for a watch, the system could have recommended some sports watches instead of generic ones. Essentially, the distant and prolonged user history could carry sparse yet crucial information of user preferences in general, while the more recent interactions should more closely represent the user intention in near future.

  • By contextual segment

    : Since the user closely examined several smartphone options (much shorter time intervals in between than the average), these interaction events could be emphasized for estimating current user preference such that smartwatches might be preferred over traditional watches. In general, some periods of user browsing log could appear to be heterogeneous, packed with exploration insights, while at a certain point, the user would concentrate on a small subset of homogeneous items, in a repetitive or exploitative way.

Hence, the designs to capture and connect these different signals from each part of history have driven the progress of recent development of sequential recommendation algorithm.

Traditionally, the notion of session is introduced in modeling sequential user behaviors, in a way to segment sequence of actions by active and idle engagement. It is shown in (hidasi2015session; quadrana2017personalizing)

that the pattern of user preference transition usually differs after a session gap, which is commonly defined as a minimal of 30 minutes’ inactivity. Prior work has demonstrated many effective hierarchical neural network structures, such as hierarchical recurrent neural networks (RNN)

(quadrana2017personalizing; cui2017hierarchical; you2019hierarchical) to jointly model the transition patterns in- and cross- sessions. However, the strength stemmed from its session assumption could also be the bottleneck of session based recommendation, that is, the user preference does not necessarily transit in strict accordance with the manually defined session boundaries.

Figure 2. Histogram of time intervals between successive events on two datasets, UserBehavior (left), XING (right).

In this paper, we argue that the transition patterns could widely differ due to the subtle variance within the temporal proximity between neighboring events, associated with its changing context. Specifically, the time and context of each historical event would support fine-grained interpretation and high-fidelity replay of the sequential behavior history for a more accurate portrait of current user preference. This claim is further supported by the initial statistic results obtained from two large datasets later used in our evaluation: the mixed Gaussian shape appearing in the time interval distribution in Figure

2 indicates that a binary interpretation of time gap as in- or cross- session is not accurate enough. Therefore, our proposed model adaptively weights the historical influences in regard to the user’s drifting impressions from previous interactions over time and under its inferred contextual condition.

Traditional RNN-based approaches leave little room for one to dynamically adjust the historical influences at current state. One earlier work, known as Time-LSTM (zhu2017next), proposed several variants of time gate structure to model the variable time intervals as part of the recurrent state transition. But this assumes that the temporal influence only takes effect for once during transition and is fixed regardless of context or future events. Thereby, in order to model the influence evolving temporally and contextually, we appeal to the attention based sequence models, which emphasize dependencies directly on each sequential input rather than relying on the recurrent state transition in RNN models.

In sequential recommendation, a line of work (kang2018self; tang2019towards; sun2019bert4rec) has borrowed the state-of-art self-attention network structure from nature language modeling (devlin2018bert; vaswani2017attention). tang2019towards show that its attention component can enhance the model capacity in determining dependence over an extensively long sequence of history. Nevertheless, Kang and McAuley (kang2018self) report that the action order in long sequence of user interaction history is lacking in boosting the empirical evaluation performance on several recommendation datasets, even though the position embedding technique is proposed for self-attention mechanism in its original paper (vaswani2017attention). In other words, there is no explicit ordering of input or segment of history modeled by self-attention mechanism. Therefore, this presents us the opportunity to model temporal and context information as a more informative and flexible order representation to complement existing attention mechanism, bridging the insights from the both sides of work in sequential recommendation. But the challenge also comes along as incorporating these information could contribute more noise than signal unless properly structured by the model.

We propose the Contextualized Temporal Attention Mechanism (CTA), an attention based sequential neural architecture that draws dependencies among the historical interactions not only through event correlation but also jointly on temporal and context information for sequential behavior modeling. In this mechanism, we weigh the historical influence for each historical action at current prediction following the three design questions:

  1. What is the action? The dependency is initially based on the action correlation through the self-attention mechanism, i.e., how such an action is co-related to the current state in the sequence.

  2. When did it happen? The influence is also weighed by its temporal proximity to the predicted action, since the temporal dynamics should also play an important role in determining the strength of its connection to presence.

  3. How did it happen? The temporal weighing factor is realized as a mixture of the output each from a distinct parameterized kernel function that maps the input of time gaps onto a specific context of temporal dynamics. And the proportion of such mixture is determined by the contextual factors, inferred from the surrounding actions. In this way, the influence of a historical action would follow different temporal dynamics under different contextual conditions.

We apply the model on both XING111https://github.com/recsyschallenge/2016/blob/master/TrainingDataset.md and UserBehavior222https://tianchi.aliyun.com/dataset/dataDetail?dataId=649 dataset each with millions of interaction events including user, item and timestamp records. The empirical results on both dataset show that our model improves recommendation performance, compared with a selection of state-of-the-art approaches. We also conducted extensive ablation studies as well as visualizations to analyze our model design to understand its advantages and limitations.

2. related work

Several lines of existing research are closely related to ours in this paper, and their insights largely inspired our model design. In this section, we briefly introduce some key work to provide the context of our work.

2.1. Sequential Recommendation

For the problem of sequential recommendation, the scope was initially confined to the time-based sessions. Recurrent Neural Network (RNN) and its variants, including Long Short-Term Memory (LSTM)


and Gated Recurrent Units (GRU)

(chung2014empirical), have become a common choice for session-based recommendations (Zhang:2014:SCP:2893873.2894086; hidasi2015session)

. Other methods based on Convolutional Neural Networks (CNN)

(Tang:2018:PTS:3159652.3159656), Memory Network (chen2018sequential)

and Attention Models

(li2017neural) have also been explored. The hierarchical structure generalized from RNN, Attention or CNN based models (cui2017hierarchical; quadrana2017personalizing; you2019hierarchical) is used to model transitions inter- and intra-sessions. The recent work (you2019hierarchical) by You et al. showed that using Temporal Convolutional Network to encode and decode session-level information and GRU for user-level transition is the most effective hierarchical structure. Nevertheless, as many studies borrow sequence models from natural language modeling task directly, their model performance is usually limited by the relatively small size and sparse pattern of user behaviors, compared to the nature language datasets.

The attention mechanism was first coined by bahdanau2014neural. The original structure is constructed on the hidden states generated from RNN in order to better capture the long-term dependence and align the output for decoder in RNN. The Transformer model (vaswani2017attention) and several follow-up work (devlin2018bert; dai2019transformer; radford2018language) showed that for many NLP tasks, the sequence-to-sequence network structure based on attention alone, a.k.a. self-attention mechanism, is able to outperform existing RNN structures in both accuracy and computation complexity in long sequences. Motivated by this unique advantage of self-attention, several studies introduced this mechanism to sequential recommendation. SASRec (kang2018self), based on self-attention mechanism, demonstrated promising results in modeling longer user sequences without the session assumption. Another work known as Multi-temporal range Mixture Model (M3) (tang2019towards) manages to hybrid the attention and RNN models to capture the long-range dependent user sequences. The most recent work, BERT4Rec (sun2019bert4rec), adopts the bidirectional training objective via Cloze task and further improves its performance over SASRec.

2.2. Temporal Recommendation

Temporal recommendation specifically studies the temporal evolution of user preferences and items; and methods using matrix factorization have shown strong performance. TimeSVD++ (koren2009collaborative)

achieved strong results by splitting time into several bins of segments and modeling users and items separately in each. Bayesian Probabilistic Tensor Factorization (BPTF)

(xiong2010temporal) is proposed to include time as a special constraint on the time dimension for the tensor factorization problem. And many of these solutions (xiong2010temporal; song2016multi; li2014modeling) in temporal recommendation share the insight to model separately the long-term static and short-term dynamic user preference. Nevertheless, none of the models are developed specifically for sequential recommendation.

There have been various efforts to utilize temporal information in existing deep recommendation models. li2017time proposed methods to learn time-dependent representation as input to RNN by contextualizing event embedding with time mask or event-time joint embeddings. zhu2017next proposed several variants of time gate structure to model the variable time interval as part of the recurrent state transition. But the empirical results of both model show limited improvement compared to their LSTM baseline without using temporal information.

Meanwhile, the time series analysis is a well established research area with broad application in real world problems (das1994time; scharf1991statistical). Hawkes process (hawkes1971spectra; laub2015hawkes)

is one of the powerful tools for modeling and predicting temporal events. It models the intensity of events to occur at moment

conditioned on the observation of historical events. Some recent work (vassoy2019time; xiao2017modeling; du2016recurrent) attempt to use RNN to model the intensity function of point process model and predict the time of next action. As a typical example, the Neural Hawkes Process (mei2017neural) constructs a neurally self-modulating multivariate point process in LSTM, such that the values of LSTM cells decay exponentially until being updated when a new event occurs. Their model is designed to have better expressivity for complex temporal patterns and achieves better performance compared to the vanilla Hawkes process. The Long- and Short- Term Hawkes Process model (cai2018modeling) demonstrates a combination of Hawkes Process model for different segments of user history can improve the performance in predicting the type and time of the next action in sequential online interactive behavior modeling. However, most of these Hawkes process based algorithms model each typed event as a separate stochastic process and therefore cannot scale as the space of event type grows.

3. method

In this section, we discuss the details of our proposed Contextualized Temporal Attention Mechanism (CTA) for sequential recommendation. We will first provide a high-level overview of the proposed model, and then zoom into each of its components for temporal and context modeling.

3.1. Problem Setup & Model Overview

We consider the sequential recommendation problem with temporal information. Denote the item space as of size , and the user space as of size . The model is given a set of user behavior sequences as input. Each is a sequence of time-item tuples, where is the timestamp when item is accessed by user , and the action sequence is chronologically ordered, i.e.,

. The interacted item is represented as a one-hot vector

and the timestamp is a real valued scalar . The recommendation task is to select a list of items for each user at a given time with respect to , such that best matches user ’s interest at the moment.

In this section, we will introduce from a high level about each part of our CTA model in a bottom-up manner, from the inputs, through the three stage pipeline: content-based attention, temporal kernels and contextualized mixture, denoted as stages as illustrated in Figure 3, and finally into the output.

Figure 3. The architecture of our proposed Contextualized Temporal Attention Mechanism. Three stages are proposed to capture the content information at stage with self-attention, temporal information at stage with multiple kernels, and contextual information at stage with recurrent states, for sequential recommendation.

Model architecture

The raw input consists of the user’s historical events of a window size in item and time pairs , as well as the timestamp at the moment of recommendation . The sequence of input items is mapped into embedding space with the input item embeddings : . We also transform the sequence of timestamps into the intervals between each action to current prediction time: .

Motivated by our earlier analysis, we design the three stage mechanism, namely , and , on top of the processed input and , to model dependencies among the historical interactions respectively on their content, temporal, and context information:

In essence, weighs the influence of each input purely on content and outputs a scalar score as importance of each events in sequence ; transforms the temporal data through temporal kernels for the temporal weighing of each input ; extracts the context information from , with which it mixes the factors and from previous stages into the contextualized temporal importance score . We will later explain their individual architectures in details.

In the end, our model computes the row sum of the input item sequence embedding weighted by

(through the softmax layer, the weight

sums up to 1). This weighted sum design is borrowed from the attention mechanism in a sense of taking expectation on a probability distribution,

. The representation is then projected to the output embedding space with a feed-forward layer :

We consider as the predicted representation of recommended item. We define matrix , where its th row vector is the item ’s representation in the output embedding space. Then , the model can compute the similarity between item and the predicted representation through inner-product (or any other similarity scoring function):

For a given user, item similarity scores are then normalized by a softmax layer which yields a probability distribution over the item vocabulary. After training the model, the recommendation for a user at step is served by retrieving a list of items with the highest scores among all .

3.2. Three Stage Weighing Pipeline

3.2.1. stage, what is the action:

The goal of stage is to obtain the content-based importance score for the input sequence . Following the promising results of prior self-attentive models, we adopt the self-attention mechanism to efficiently and effectively capture the content correlation with long-term dependence. In addition, the self-attention mechanism allows us to directly define the importance score over each input, in contrast to the recurrent network structure.

We use the encoder mode of self-attention mechanism to transform the input sequence embedding , through a stack of self-attentive encoder blocks with heads and hidden units, into representation , which is the hidden state of the sequence at the last layer. Due to the recursive nature of self-attention, we use the following example to explain the multi-head attention component in our solution. For example, in the th attention head of the th self attention block, from the input state , we compute one single head of the self-attended sequence representation as,

where are the learnable parameters specific to th head of th attention block, used to project the same matrix into the query , key , and value representation as the input to the Scaled Dot-Product (vaswani2017attention):

Here the scaling factor is introduced to produce a softer attention distribution for avoiding extremely small gradients.

All the computed heads in the th attention block is stacked and projected as , where

. We can then employ the residue connection

(He2015DeepRL) to compute the output of this attention block as:

where is a feed-forward layer specific to the th attention block mapping from to and is the Layer Normalization function (ba2016layer).

Note that for the initial attention block, we use to serve as the input ; and in the end, we obtain as the final output from self-attention blocks. In prior work (kang2018self; sun2019bert4rec), this is directly used for prediction. Our usage of self-attention structure is to determine a reliable content-based importance estimate of each input, hence we compute once again the Scale Dot-Product using the last layer hidden states to project as the query and the last item input embedding to project as the key via :

Note that we can also view this operation as the general attention (luong2015effective), i.e., the bi-linear product of the last layer hidden states and the last input item embedding, where is the learnable attention weight and serves as the softmax temperature (hinton2015distilling).

3.2.2. stage, when did it happen:

The goal of stage is to determine the past events’ influence based on their temporal gaps from the current moment of recommendation. The raw information of time intervals might not be as useful to indicate the actual temporal distance of a historical event’s influence (e.g., perceived by the user), unless we transform them with some appropriate kernel functions.

Meanwhile, we incorporate the observation that each event can follow different dynamics in the variation of its temporal distance, given different contextual conditions. The item browsed casually should have its influence to user preference drop sharply for a near term, but it might still be an important indicator of user’s general interest in the long term. In contrast, if the user is seriously examining the item, it is very likely the user would be interested to visit the same or similar ones in a short period of time. Therefore, we create multiple temporal kernels to model the various temporal dynamics and leave it for the context environment to later decide contextualized temporal influence. This design allows more flexibility in weighting the influence of each event with different temporal distances.

In this paper, we handpicked a collection of kernel functions with different shapes including:

  1. exponential decay kernel, , assumes that the user’s impression of an event fades exponentially but will never fade out.

  2. logarithmic decay kernel, , assumes that the user’s impression of an event fades slower as time goes and becomes infinitesimal eventually. Later we will introduce a softmax function that will transform negative infinity to .

  3. linear decay kernel, , assumes that the influence drops linearly and the later softmax operation will map the influence over some time limit to 0.

  4. constant kernel, , assumes that the influence stays static.

where are the corresponding kernel parameters. Note that the above kernels are chosen only for their stability in gradient descent and well understood property in analysis. We have no assumption of which kernel is more suitable to reflect the actual temporal dynamics, and an ablation study of different combinations is presented in the following Section 4.4.2. This mechanism should be compatible with other types of kernel function by design, and it is also possible to inject prior knowledge of the problem to set fixed parameter kernels.

Hence, given a collection of kernel functions, , we transform into K sets of temporal importance scores: , for next stage’s use.

3.2.3. stage, how did it happen:

The goal of stage is to fuse the content and temporal influence based on the extracted context information. The core design follows the multiple sets of proposed temporal dynamics in the stage, in which it learns the probability distribution over each temporal dynamics given the context.

First, we explain our design to capture context information. In our setting, we consider the contextual information as two parts: sensitivity and seriousness. Specifically, if one event seems to be closely related to its future actions, it means the user is likely impressed by this event and his or her ongoing preference should be sensitive to the influence of this action. In contrast, if the event appears to be different from its past actions, the user is possibly not serious about this action, since his or her preference does not support it. Such factors of sensitivity and seriousness can be valuable for the model to determine the temporal dynamics that each particular event should follow. Review the example in Figure 1 again, the repetitive interactions with smartphones reflect high seriousness, while the sparse and possibly a noisy click on shoes suggests low sensitivity to its related products. This observation also motivates our design to model context as its relation from past and to future events: we choose the Bidirectional RNN structure (schuster1997bidirectional) to capture the surrounding event context from both directions. From the input sequence embedding , we can compute the recurrent hidden state of every action as their context feature vector:

where is the concatenation operation. Here, we also introduce some optional context features that can be the attributes of each event in the specific recommendation applications, representing the context when the event happened. For instance, we can infer the user’s seriousness or sensitivity from the interaction types (e.g., purchase or view) or the media (e.g., mobile or desktop) associated with the action. In our experiments, we only use the hidden states of bidirectional RNN’s output as the context features, and we leave the exploration of task specific context features as our future work.

Second, the model needs to learn the mapping from the context features of event to a weight vector of length , where each entry is the probability of this event follows as the temporal dynamics. We apply the feed-forward layer to map them into the probability space and then normalize them into probabilities that sum up to one for each action with a softmax layer:

Finally, we use the probability distribution to mix the temporal influence scores from the different kernels for the contextualized temporal influence , with which we use element-wise product to reweight the content-based importance score for the contextualized temporal attention score:

This design choice that uses product instead of addition to fuse the content and contextualized temporal influence score and is based on the consideration of their influence on the gradients of . For example, the gradient on parameters in stage is,

The error gradient in the addition form is independent of the function evaluation of , while the product form has the gradients of and depend on each other. Therefore, we choose the product form as a better fusion of the two scores.

3.3. Parameter Learning

3.3.1. Loss Functions

In the previous section, we showed how the model makes recommendations by the highest similarity scores for all . When training the model, we only use a subset of . That is, since the size of the item space can be very large, we apply negative sampling (mikolov2013distributed), i.e., proportional to their popularity in the item corpus, sample a subset of items , that excludes the target item , i.e., .

We adopt negative log-likelihood (NLL) as the loss function for model estimation:

which maximizes the likelihood of target item.

We also consider two ranking-based metrics to directly optimize the quality of recommendation list. The first metric is the Bayesian Personalized Ranking (BPR) loss (Rendle:2009:BBP:1795114.1795167) 333

We use the sigmoid function


which is designed to maximize the log likelihood of the target similarity score exceeding the other negative samples’ score .

The second is the TOP1 Loss (hidasi2018recurrent):

which heuristically puts together one part that aims to push the target similarity score

above the score of the negative samples, and the other part that lowers the score of negative samples towards zero, acting as a regularizer that additionally penalizes high scores on the negative examples.

3.3.2. Regularization

We introduce regularization through the dropout mechanism (srivastava2014dropout) in the neural network. In our implementation, we have dropout layer after each feed-forward layer and the output layer of context bidirection RNN with a dropout rate of

. We leave as out future work to explore the effect of batch normalization as well as regularization techniques of the parameters in temporal kernels.

3.3.3. Hyperparameter Tuning

We initialize the model parameters through the Kaiming initialization proposed by he2015delving. The temporal kernel parameters are initialized in proper range (e.g. uniform random in

) in order to prevent numerical instability during training. We use the Relu function


by default as the activation function in the feed-forward layer.

4. Experiments

In this section, we perform extensive experiment evaluations of our proposed sequential recommendation solution. We compared it with an extensive set of baselines, ranging from session-based models to temporal and sequential models, on two very large collections of online user behavior log data. We will start from the description of experiment setup and baselines, and then move onto the detailed experiment results and analysis.

4.1. Dataset

We use two public datasets known as XING and UserBehavior. The statistics of the datasets are summarized in Table 1. The two datasets include user behaviors from two different application scenarios.

Dataset XING UserBehavior
Users 64,890 68,216
Items 20,662 96,438
Actions 1,438,096 4,769,051
Actions per user 22.1621.25 69.9148.98
Actions per item 69.60112.63 49.4565.31
Time span 80 days 9 days
Table 1. Statistics of two evaluation datasets.

The XING dataset is extracted from the Recsys Challenge 2016 dataset (Pacuk:2016:RCJ:2987538.2987544), which contains a set of user actions on job postings from a professional social network site 444https://www.xing.com/. Each action is associated with the user ID, item ID, action timestamp and interaction type (click, bookmark, delete, etc.). Following the prior work (quadrana2017personalizing; you2019hierarchical), we removed interactions with type “delete” and did not consider the interaction types in the data. We removed items associated with less than 50 actions, and removed users with less than 10 or more than 1000 actions. We also removed the interactions of the same item and action type with less than 10 seconds dwell time.

The UserBehavior dataset (zhu2018learning) is provided by Alibaba and contains user interactions on commercial products from an e-commerce website555https://www.taobao.com/

. Each action is associated with the user ID, item ID, action timestamp and interaction type (click, favor, purchase, etc.). In order to have a computationally tractable deep learning model, we randomly sub-sampled 100,000 users’ sequences from each dataset for our experiment. We removed items associated with less than 20 actions, and then removed users with less than 20 or more than 300 actions. We also removed the interactions with timestamp that is outside the 9 day range that dataset specifies.

4.2. Experiment Setup

Dataset Metric CTA Pop S-Pop Markov GRU4Rec HRNN LSHP SASRec M3R
XING Recall@5 0.3217 0.0118 0.2059 0.2834 0.2690 0.2892 0.2173 0.2530 0.2781
MRR@5 0.1849 0.0062 0.1202 0.2319 0.2008 0.2392 0.1454 0.2254 0.2469
UserBehavior Recall@5 0.1611 0.0026 0.1093 0.0846 0.0936 0.0940 0.1201 0.1418 0.1077
MRR@5 0.0925 0.0013 0.0639 0.0534 0.0619 0.0610 0.0792 0.0863 0.0689
Table 2. Performance comparison of different methods on sequential recommendation.

4.2.1. Baseline methods

We compare our proposed Contextualized Temporal Attention Mechanism with a variety of baseline methods666

All implementations are open sourced at

https://github.com/Charleo85/seqrec. To ensure a fair comparison of deep learning model, we adjust the number of layers and hidden units such that all the models have similar number of trainable parameters.

Heuristics methods. We include some simple heuristic methods, which show strong performance in prior sequential recommendation work (quadrana2017personalizing; li2017neural).

  • [leftmargin=*]

  • Global Popularity (Pop). Rank item by its popularity in the entire training set in a descending order.

  • Sequence Popularity (S-Pop). Rank item by its popularity in the target user’s action sequence in a descending order. The popularity of an item is updated sequentially as more actions of the target user are observed.

  • First Order Markov Model (Markov).

    This method makes the Markov assumption that each action depends only on the last action. It ranks item according to its probability given the item in last action, which is estimated from the training set.

Session-based Models. We include several deep learning based models with session assumptions. We set the session cut-off threshold as 30 minutes by convention.

  • [leftmargin=*]

  • Session based Recurrent Neural network (GRU4Rec). hidasi2015session used the GRU, a variant of Recurrent Neural network, to model the user preference transition in each session. The session assumption is shown to be beneficial for a consistent transition pattern.

  • Hierarchical Recurrent Neural network (HRNN). quadrana2017personalizing proposed a hierarchical structure that use one GRU to model the user preference transition in each session and another to model the transition across the sessions.

Temporal Models. Since our model additionally uses the temporal information to make the sequential recommendation, we include the following baselines that explicitly consider temporal factors and have been applied in sequential recommendation tasks.

  • [leftmargin=*]

  • Long- and Short-term Hawkes Process (LSHP). cai2018modeling proposed a Long- and Short-term Hawkes Process that uses a uni-dimension Hawkes process to model transition patterns across sessions and a multi-dimension Hawkes process to model transition patterns within a session.

Sequential Models. Similar to our proposed CTA model, we also include several deep learning based models that directly learn the transition pattern in the entire user sequence. A fixed size window is selected for better performance and more memory-efficient implementation.

  • [leftmargin=*]

  • Self-attentive Sequential Recommendation (SASRec). kang2018self applied the self-attention based model on sequential recommendation. It uses the last encoder’s layer hidden state for the last input to predict the next item for user. We use 4 self-attention blocks and 2 attention heads with hidden size 500 and position embedding. We set the input window size to 8.

  • Multi-temporal-range Mixture Model (M3R).  tang2019towards proposed a mixture neural model to encode the users’ actions from different temporal ranges. It uses the item co-occurrence as tiny-range encoder, RNN/CNN as short-range encoder and attention model as long-range encoder. Following the choice in its original paper, we use GRU with hidden size 500 as the short-range encoder.

4.2.2. Implementation Details

For our proposed model, we use self-attention blocks and attention heads with hidden size . We use the same representation for input and output item embeddings , and a combination of 5 exponential decay kernels (). We use a bidirection RNN with hidden size in total of both directions to extract context features. We set the learning rate as . We will present the experiments on different settings of our model in the following section.

4.2.3. Experiment settings.

We split all the data by user, and select 80% of the users to train the model, 10% as the validation set and the remaining 10% users to test the model. We also adopt the warm start recommendation setting, where the model is evaluated after observing at least 5 historical actions in each testing user.

All the deep learning based models are trained with Adam optimizer with momentum 0.1. We also search for a reasonablely good learning rate in the set and report the one that yields the best results. We set batch size to 100, and set the size of negative samples to 100. The model uses the TOP1 loss by default. The item embedding is trained along with the model, and we use the embedding size 500 for all deep learning models. The training is stopped when the validation error plateaus. For the self-attention based model, we follow the training convention (zhu2018learning)

by warming up the model in the first few epoches with small a learning rate.

4.2.4. Evaluation metrics.

The model predicts the user action at the time of the next observed action. The result is evaluated by ranking the ground-truth action against a pool of candidate actions. For both datasets, the candidate pool is the set of all items in the dataset, though only a subset of negative items is sampled for model optimization.

We rank the candidates by their predicted probabilities and compute the following evaluation metrics:

  • [leftmargin=*]

  • Recall@K. It reports the percentage of times that the ground-truth relevant item and ranked within the top K list of retrieved items.

  • MRR@K. The mean reciprocal rank is used to evaluate the prediction quality from the predicted ranking of relevant items. It is defined as the average reciprocal rank for ground-truth relevant items among the top K list of retrieved items. If the rank is larger than K, the reciprocal rank is 0.

4.3. Experimental results

4.3.1. Overall Performance.

We summarize the performance of the proposed model against all baseline models on both dataset in Table 2. The best solution is highlighted in bold face.

Similar to the results reported in prior work (quadrana2017personalizing; kang2018self), heuristics methods do show strong baseline performance on our sequential recommendation tasks. And based on their strong performance, we can conclude that the XING dataset features first order transition, while the UserBehavior dataset features sequence popularity. This results are not surprising because it is common for a user to visit the same item several times back and forth in online shopping scenarios, and to visit the next job posting closely related to the recent history. This also results in different strengths of each model on both datasets, which we will analyze in the next two sections.

Notably, on both datasets, our proposed model CTA outperforms all baselines in Recall@5 by a large margin (11.24% on XING dataset, 14.18% on UserBehavior dataset). The model’s MRR@5 performance is strong on UserBehavior dataset, but weak on XING dataset. This suggests that our model fails to learn a good ranking for the first order transition pattern, since it uses a weighted sum of input sequence for prediction. Nevertheless, such weighted sum design is powerful to capture the sequential popularity pattern. It also shows that our model outperforms the self-attentive baselines, which suggests our design of the contextual temporal influence reweighing, i.e., , improves sequential order modeling in recommendation applications, compared to the positional embedding borrowed from natural language modeling.

4.3.2. Results on XING dataset.

The RNN-based methods outperformed both temporal models and attention-based models. This again confirms that the recurrent model is good at capturing the first order transition pattern or the near term information. We also observe that the hierarchical RNN structure outperforms the first order baseline, while the session-based RNN performs not as well as this strong heuristic baseline. This demonstrates the advantage of hierarchical structure and reinforces our motivation to segment user history for modeling users’ sequential behaviors.

4.3.3. Results on UserBehavior dataset.

On the contrary to the observations on XING dataset, the temporal models and attention-based models outperformed RNN-based methods. This means the recurrent structure is weak at learning the sequential popularity pattern, while the attention-based approach is able to effectively capture such long-term dependence. Such conflicting nature of existing baselines is exactly one of the concerns this work attempts to address. This again validates our design to evaluate and capture the long- and short-term dependence through the proposed three stage weighing pipeline.

4.4. Performance analysis

4.4.1. Ablation Study

Architecture Dataset
XING UserBehavior
Base 0.3216 0.1847 0.1611 0.0925
Window 4 0.3115 0.2167 0.1488 0.0899
size (L) 16 0.3049 0.1733 0.1433 0.0914
32 0.3052 0.1735 0.1401 0.0950
Attention 1 0.3220 0.1851 0.1631 0.0926
blocks () 4 0.3217 0.1849 0.1631 0.0924
Attention 1 0.3225 0.1860 0.1622 0.0919
heads () 4 0.3225 0.1860 0.1646 0.0940
Sharing embedding 0.1263 0.0791 0.1042 0.0192
Embedding 300 0.3147 0.1831 0.1622 0.0920
size () 1000 0.3207 0.1857 0.1628 0.0921
Loss NNL 0.3130 0.1806 0.1571 0.0895
function BPR 0.3163 0.1804 0.1598 0.0913
Flat attention 0.3215 0.1869 0.1588 0.0907
Global context 0.3207 0.1839 0.1603 0.0912
Local context 0.3210 0.1841 0.1591 0.0912
Kernel 0.3191 0.1827 0.1604 0.0910
types 0.3122 0.2141 0.1591 0.0907
0.3207 0.1844 0.1627 0.0925
0.2917 0.2323 0.1562 0.0976
0.3025 0.2209 0.1670 0.1010
0.3214 0.2183 0.1673 0.0997
0.3111 0.2196 0.1618 0.0931
0.3230 0.1869 0.1635 0.0932
0.3241 0.1888 0.1635 0.0932
0.3273 0.2146 0.1673 0.0997
0.3254 0.1971 0.1664 0.0983
Table 3. Ablation analysis on two datasets under metrics of Recall@5 (left) and MRR@5 (right). The best performance is highlighted in bold face. and denote a drop/increase of performance for more than 5%. , , , respectively denote the exponential, logarithmic, linear and constant temporal kernels. The superscript on the kernel function denotes the number of such kernel used in the model.

we perform ablation experiments over a number of key components of our model in order to better understand their impacts. Table 3 shows the results of our model’s default setting and its variants on both datasets, and we analyze their effect respectively:

Window size. We found that the window size of appears to be the best setting among other choices of input window size among on both datasets. The exceptions are a smaller window size on XING and a larger window size on UserBehavior can slightly improve MRR@5, even though Recall@5 still drops. The reason might be suggested by the previous observation that the first order transition pattern dominates XING dataset so that it favors a smaller input window, while the sequence popularity pattern is strong in UserBehavior dataset such that it favors a larger input window size.

Loss functions. The choice of loss function also affects our model’s performance. The ranking based loss function, BPR and TOP1, is consistently better than the NLL loss, which only maximizes the likelihood of target items. The TOP1 loss function with an extra regularizer on the absolute score of negative samples can effectively improve the model performance and reduce the over-fitting observed in the other two loss functions.

Self-Attention settings. We compare the model performance on different and settings. The performance difference is minimal on XING, but relatively obvious on UserBehavior. This indicates the content-based importance score is more important in capturing the sequential popularity than first order transition pattern.

Item embedding. We test the model with separate embedding space for input and output sequence representations; and the model performance drops by a large margin. Prior work, e.g., (kang2018self), in sequential recommendation found similar observations. Even though the separate embedding space is a popular choice in neural language models (press-wolf-2017-using), but the item corpus appears to be more sparse to afford two distinct embedding spaces. The dimensionality of the embedding space, slightly affects the model performance on both datasets, and it at the same time increases Recall@5 and decreases MRR@5 score, and vice versa. A trade-off on ranking and coverage exists between larger and smaller embedding spaces.

4.4.2. Discussion on Model Architecture

To further analyze the strength and weakness of our model design, we conduct experiments specifically to answer the following questions:

Does the model capture the content influence ? To understand if our model is able to learn a meaningful , we replace the component with a flat attention module, such that it always outputs . And we list this model’s performance in Table 3 as ‘Flat Attention’.

The performance stays almost the same on XING, but drops slightly on the UserBehavior dataset. It shows that the content-based importance score is less important for the sequential recommendation tasks when the first order transition pattern dominates, but is beneficial for the sequential popularity based patterns. It also suggests that contextualized temporal importance along is already a strong indicator of historical actions about current user preference.

Does the model extract the context information in stage? As the effect of temporal influence depends on our context modeling component, we design the following experiments on the context component to understand the two follow-up questions.

First, whether the local context information of each event is captured. We replace the local conditional probability vector with a global probability vector , i.e., a single weight vector learnt on all contexts. This model’s performance is listed in the table as ‘Global Context’. We can observe a consistent drop in performance in both datasets.

Second, whether the local context is conditioned on its nearby events. We replace the local conditional probability vector with a local probability vector conditioned only on the event itself, . More specifically, instead of using the bidirectional RNN component, the model now uses a feed-forward layer to map each to the probability space . This model’s performance is listed in the table as ‘Local Context’. We again observe a consistent drop in performance, though it is slightly better than the global context setting.

As a conclusion, our model is able to extract the contextual information and its mapping into probability for different temporal influences on both datasets.

Does the model capture temporal influence ? We conduct multiple experiments on the number of temporal kernels and the combined effect of different kernel types.

Firstly, we want to understand the advantages and limitations of each kernel type. We look at the model performance carried out with a single constant temporal kernel . Its performance on MRR@5 is the worst among all the other kernel settings on both datasets. At the same time, we compare the settings of 10 exponential , logarithmic and linear kernels each. The 10 linear kernels setting is overall the best on both datasets, especially in improving the ranking-based metrics. It shows that it is beneficial to model the temporal influence with the actual time intervals transformed by appropriate kernel functions.

Secondly, we compare the model performance on different number of temporal kernels. The results suggested that the model performance always improves from using a single kernel to multiple kernels. This directly supports our multi-kernel design. Specifically, among the exponential kernels , performs the best on XING, yet not as good as on UserBehavior. On linear kernels , as the kernel number increases, Recall@5 improves, but MRR@5 drops on XING. Similarly on UserBehavior, achieves the best ranking performance, but the induces a better coverage. Hence, the model with more kernels does not necessarily perform better, and as a conclusion, we need to carefully tune the number of kernels for better performance on different tasks.

Thirdly, we study the combinatorial effect of different kernel types: , and . We can observe that all types of kernel combinations we experimented improve the performance on both datasets, compared to the base setting . This suggests the diversity of kernel types is beneficial to capture a better contextualized temporal influence. However, it also shows on both datasets that if mixing exponential with either linear or logarithmic kernel can improve the model performance, mixing all three of them together would only worsen the performance. We hypothesize that certain interference exists among the kernel types so that their performance improvement cannot simply add on each other. And we leave the exploration of finding the best combination of kernels as our future work.

Overall, we believe that the temporal influence can be captured by current model design, and there are opportunities left to improve the effectiveness and consistency of the current kernel based design.

Figure 4. Attention visualization. The blue (left) bar is the content-based importance score , the orange (middle) bar is the contextualized temporal influence score , the green (right) bar is the combined importance score . The figures contains three different sequences selected from the test set of the UserBehavior dataset.

4.4.3. Attention Visualization

To examine the model’s behavior, we visualize how the importance score shifts in some actual examples in Figure 4. The x-axis is a series of actions with their associated items 777for privacy concerns, these datasets do not provide the actual item content; and we represent the items in the figure with symbols. and the time interval from action time to current prediction time, . From left to right, it follows a chronological order from distant to recent history. We select the example such that the ground-truth next item is among the historical actions for the sake of simplicity, and we use smile face symbol to denote if the item of such historical action is the same as the target item. Each action on the x-axis is associated with three bars. Their values on the y-axis is presented as the computed score , and

respectively of each event in the model after normalization (by z-score). The model setting uses the temporal kernel combination

for its best performance.

Orange bars. The contextualized temporal influence score , in both sequence A and B, follows the chronological order, i.e., the score increases as time interval shortens. In addition, such variation is not linear over time: the most recent one or two actions tend to have higher scores, while the distant actions tend to have similar lower scores. The sequence C, as all actions happened long time ago from current prediction, the context factor is deciding the height of orange bar. And the model is able to extract the context condition and assign high temporal importance to this event, which is indeed the target item. These observations all suggest that the contextualized temporal influence is captured in a non-trivial way that helps our model to better determine the relative event importance.

Blue bars. For the content-based importance score , it shows different distribution on each of the sequences. This is expected as we want to model the importance on the event correlation that is independent of the sequence order. Only in the third example that the target, i.e., the most relevant historical action, is ranked above average according to the content-based importance score. This again shows the important role of the temporal order to improve the ranking quality for sequential recommendation.

Green bars. The combined score largely follows the relative importance ranking in orange bar. In other words, the contextualized temporal order is the dominating factor to determine relative importance of each input in our selected examples. This corresponds to the previous observation that the model performance would only slightly drop if the self-attention component outputs flat scores. This supports our motivation to model the contextualized temporal order in sequential recommendation tasks.

Although these are only three example interaction sequences from more than users, we can now at least have a more intuitive understanding of the reweighing behavior of our model design – the core part that helps boost the recommendation performance over the existing baselines. However, there are also many cases where the importance scores are still hard to interpret, especially if there is no obvious correspondence between target item and the historical actions. We need to develop better techniques to visualize and analyze the importance score for interpretable neural recommender system as follow-up research.

5. Conclusion

This work identifies and addresses the critical problem in sequential recommendation, Déjà vu, that is the user interest based on the historical events varies over time and under different context. Our empirical evaluations show that the proposed model, CTA, has the following advantages:

  • [leftmargin=*]

  • Efficacy & Efficiency. Compared with the baseline work, CTA effectively improves the recommendation quality by modeling the contextualized temporal information. It also inherits the advantage of self-attention mechanism for its reduced parameters and computational efficiency, as the model can also be deployed in parallel.

  • Interpretability. Our model, featuring the three stage weighing mechanism, shows promising traits of interpretability. From the elementary analysis demonstrated in our experiments, we can have a reasonable understanding on why an item is recommended, e.g., for its correlation with some historical actions and how much on temporal influence or under context condition.

  • Customizability. The model design is flexible in many parts. In the stage, the model can extract the content-based importance by all means, such as the sequence popularity heuristics – customizable for recommendation applications with different sequential patterns. In the stage, as we mentioned earlier, we can adapt different choices of temporal kernels to encode prior knowledge of the recommendation task. The stage is designed to incorporate extra context information from the dataset, and one can also use more sophisticated neural structures to capture the local context given the surrounding events.

Nevertheless, our understandings are still limited in the temporal kernels including what choices are likely to be optimal for certain tasks, and how we can regularize the kernel for more consistent performance. Our current solution ignores an important factor in recommendation: the user, as we assumed everything about the user has been recorded in the historical actions preceding the recommendation. As our future work, we plan to explicitly model user in our solution, and incorporate the relation among users, e.g., collaborative learning, to further exploit the information available for sequential recommendation.

6. Acknowledgements

We thank all the anonymous reviewers for their helpful comments. This work was partially supported by the National Science Foundation Grant IIS-1553568.