추천시스템 관련 정보를 아카이브하는 공간입니다.
Understanding temporal dynamics has proved to be highly valuable for accurate recommendation. Sequential recommenders have been successful in modeling the dynamics of users and items over time. However, while different model architectures excel at capturing various temporal ranges or dynamics, distinct application contexts require adapting to diverse behaviors. In this paper we examine how to build a model that can make use of different temporal ranges and dynamics depending on the request context. We begin with the analysis of an anonymized Youtube dataset comprising millions of user sequences. We quantify the degree of long-range dependence in these sequences and demonstrate that both short-term and long-term dependent behavioral patterns co-exist. We then propose a neural Multi-temporal-range Mixture Model (M3) as a tailored solution to deal with both short-term and long-term dependencies. Our approach employs a mixture of models, each with a different temporal range. These models are combined by a learned gating mechanism capable of exerting different model combinations given different contextual information. In empirical evaluations on a public dataset and our own anonymized YouTube dataset, M3 consistently outperforms state-of-the-art sequential recommendation methods.READ FULL TEXT VIEW PDF
The chronological order of user-item interactions can reveal time-evolvi...
Characterizing temporal dependence patterns is a critical step in
Modeling user preferences (long-term history) and user dynamics (short-t...
Predicting the missing values given the observed interaction matrix is a...
Sequential recommendation recommends items based on sequences of users'
Successful sequential recommendation systems rely on accurately capturin...
Automatic vertebrae identification and localization from arbitrary CT im...
추천시스템 관련 정보를 아카이브하는 공간입니다.
Across the web and mobile applications, recommender systems are relied upon to surface the right items to users at the right time. Some of their success can be attributed to advances in modeling as well as the ingenuity of applied researchers in adopting and inventing new techniques to solve this important problem (Sarwar et al., 2001; Koren et al., 2009; Rendle et al., 2009; He et al., 2017). Fundamentally, recommenders match users in a particular context with the best personalized items that they will engage with (Linden et al., 2003; Gomez-Uribe and Hunt, 2016). In order to do this effectively, recommenders need to understand the users, typically based on their previous actions, and to understand items, most often based on the users who previously interacted with them. This presents a fundamental challenge: users’ preferences and items’ perception are continuously changing over time, and the recommender system needs to understand these dynamics.
A significant amount of research has recognized forms of this problem. Sequence information has been generally shown to improve recommender performance (He and McAuley, 2016; Rendle et al., 2010). Koren (2009) identified multiple user and item dynamics in the Netflix Prize competition, and incorporated these dynamics as biases in a collaborative filtering model. (Wu et al., 2017; Beutel et al., 2018)
demonstrated that Recurrent Neural Networks (RNNs) could learn many of these patterns, and likewise(Hidasi et al., 2015) demonstrated that RNNs can learn patterns in individual sessions. Despite these successes, RNNs are known to have difficulties learning long-range dependent temporal patterns (Belletti et al., 2018).
We observe and study an open challenge for such sequential recommender systems: while different applications and contexts require different temporal ranges and patterns, model architectures are typically designed to capture a particular temporal dynamic. For example, when a user comes to the Amazon home page they may be looking for something new to buy or watch, but on an item specific page they may be looking for other items that are closely related to recently browsed items. How can we design a model that works, simultaneously, across all of these contexts and temporal ranges?
Contributions: We address the issue of providing a single model adapted to the diversity of contexts and scales of temporal dependencies in sequential recommendations through data analysis and the design of a Multi-temporal-range Mixture Model, or M3 for short. We make the following contributions to this problem:
Data-driven design: We demonstrate that in real world recommendation tasks there are significant long-range temporal dependencies in user sequence data, and that previous approaches are limited in their ability to capture those dynamics. M3’s design is informed by this quantitative analysis.
Multi-range Model: We offer a single model, M3, which is a mixture model consisting of three sub-models (each with a distinct manually designed architecture) that specialize in capturing different ranges of temporal dependencies. M3 can learn how to dynamically choose to focus on different temporal dynamics and ranges depending on the application context.
Empirical Benefits and Interpretability: We show on both public academic and private data that our approach provides significantly better recommendations. Further, using its interpretable design, we analyze how M3 dynamically switches between patterns present at different temporal ranges for different contexts, thus showing the value in enabling context-specific multi-range modeling. Our private dataset consists in anonymized user sequences from YouTube. To the best of our knowledge this paper is the first to focus on sequential patterns in such a setting.
Before we describe our sequential recommendation problem and provide the quantitative insights orienting the design of a novel sequential neural model based on a mixture of models, we briefly introduce the reader to some key pre-existing related work.
Matrix factorization (Koren et al., 2009) is among the most popular techniques used in classic recommender research, in which a similarity score for each user-item pair is learned by building latent user and item representations to recover historical user-item interactions. The predicted similarity score is then used to indicate the relatedness and find the most relevant items to recommend to a user. Followup work on introducing auxiliary sources of information beyond user-item interactions have been proven successful (Covington et al., 2016), especially for cold-start problems. Pazzani and Billsus (2007) use item content (e.g., product image, video’s visual/audio content, etc) to provide a better item representation.
Neural Recommender Systems.
Deep neural networks have gained tremendous success in the fields of Computer Vision(Karpathy et al., 2014; Krizhevsky et al., 2012)Bahdanau et al., 2014; Mikolov et al., 2010). In recommender research, we have witnessed growing interest of using deep neural networks to model complex contextual interactions between user and items, which surpass classic factorization-based methods (Koren et al., 2009; Rendle, 2010). Auto-encoders(Sedhain et al., 2015; Wu et al., 2016; Liang et al., 2018) constitute an early example of success for a framework based on neural networks to better infer un-observed user/item affinities in a recommendation problem. He et al. (2017) also proved that traditional Collaborative Filtering methods can be effectively generalized by a deep neural network. Besides,
have become a common choice. Other methods based on Convolutional Neural Networks (CNNs)(Tang and Wang, 2018a; Yuan et al., 2019)Zhou et al., 2018b) have also been explored. While most of existing methods developed for sequential recommendations perform well (He and McAuley, 2016; Rendle et al., 2010; Tang and Wang, 2018a; Hidasi et al., 2015; Smirnova and Vasile, [n. d.]), they still have some limitations when dealing with long user sequences found in production recommender systems. As we shall discuss in Section 3, such approaches do not scale well to very long sequences.
Mixture of Models. Despite being simpler and more elegant, monolithic models are in general less effective than mixtures of models to take advantage of different model capacities and architectural biases. Gehring et al. (2017)
used an RNN in combination with an attention model for neural machine translation which provided a substantial performance gain.Pinheiro and Collobert (2014)
proposed to combine a CNN with an RNN for scene labeling. In the field of sequential recommendation, an earlier work on mixing of a Latent Factor Model (LFM) and a Factorized Markov Chain (FMC) has been shown to offer superior performance than each individual one(Rendle et al., 2010). A similar trend was observed in (He and McAuley, 2016; Zhou et al., 2018a). While sharing similar spirit to these aforementioned methods, we designed our mixture of models with the goal to model varying ranges of dependence in long user sequences found in real production systems. Unlike model ensembles (Dietterich, 2000; Zhou et al., 2002; Zhu et al., 2018) that learn individual models separately prior to ensembling them, a mixture of models learns individual models as well as combination logic simultaneously.
We first present some findings on our anonymized proprietary dataset which uncover properties of behavioral patterns as observed in extremely-long user-item interaction sequences. We then pinpoint some limitations of existing methods which motivate us to design a better adapted solution.
Sequential Recommendation Problem: We consider a sequential recommendation problem (Rendle et al., 2010; He and McAuley, 2016; Tang and Wang, 2018a; Hidasi et al., 2015; Smirnova and Vasile, [n. d.]) defined as follows: assume we have a set of users , a set of items , and for each user we have access to a sequence of user historical events ordered by time. Each records the item consumed at time as well as context information of the interaction. Given the historical interactions, our goal is to recommend to each user a subset of items in order to maximize a performance metric such as user satisfaction.
We now describe how we developed a better understanding of long user sequences in our proprietary dataset through quantitative data exploration. To quantify how past events can influence a user’s current behavior in our internal dataset, i.e. measure the range of temporal dependency within a sequence of events, one can examine the covariance matrix of two events -step apart (Belletti et al., 2017; Pipiras and Taqqu, 2017), where step denotes the relative order of events within sequence. In particular, we look at the trace of the covariance matrix as a measurement of dependency:
where is the item in last event in a logged user/item interaction sequence and is the item corresponding to the interaction that occurred
time steps before the last event. We focus on the trace of the covariance matrix as it equals the sum of the eigenvalues of the covariance matrix and its rate of decay is therefore informative of the rate of decay of these eigenvalues as a whole.
We utilize the embeddings that have been learned by a pre-existing model—in our case an RNN-based sequential recommender which we describe later as one of M3’s sub-models. here measures the similarity between the current event and the event
steps back from it. To estimatefor a particular value of we employ a classic empirical averaging across user sequences in our dataset. From Figure 1, we can extract multiple findings:
The dependency between two events decreases as the time separating their consumption grows. This suggests that recent events bear most of the influence of past user behavior on a user’s future behavior.
The dependency slowly approaches zero even as the temporal distance becomes very large (i.e. ). The clear hyperbolic-decay of the level of temporal dependencies indicates the presence of long-range-dependent patterns existing in user sequences (Pipiras and Taqqu, 2017). In other words, a user’s past interactions, though far from the current time step, still cumulatively influence their current behavior significantly.
These findings suggest that users do have long-term preferences and better capturing such long-range-dependent pattern could help predicting their future interests. In further work, we plan to use off-policy correction methods such as (Gilotte et al., 2018; Chen et al., 2019) to remove presentation bias when estimating correlations.
The previous section has demonstrated the informational value of long-range temporal patterns in user sequences. Unfortunately, it is still generally challenging for existing sequential predictive models to fully utilize information located far into the past.
Most prior models have difficulties when learning to account for sequential patterns involving long-range dependence. Existing sequential recommenders with factorized Markov chain methods (He and McAuley, 2016) or CNNs (Tang and Wang, 2018a) arguably provide reliable sequential recommendation strategies. Unfortunately they are all limited by a short window of significant temporal dependence when leveraging sequential data to make a recommendation prediction. RNNs (Hidasi et al., 2015; Jannach and Ludewig, 2017; Beutel et al., 2018) and their variants (Quadrana et al., 2017; Devooght and Bersini, 2017) are widely used in sequential recommendation. RNN-based models, though effective for short user sequences (e.g. short event sequences within a session), are challenged by long-range dependent patterns in long user sequences. Because of the way they iterate over sequential items (Mikolov et al., 2010) and their use of saturating non-linear functions such as to propagate information through time, RNNs tend to have difficulties leveraging the information contained in states located far into the past due to gradient propagation issues (Pascanu et al., 2013; Belletti et al., 2018)
. Even recent architectures designed to facilitate gradient propagation such as Gated Recurrent Unit(Cho et al., 2014)Hochreiter and Schmidhuber, 1997; Sutskever et al., 2014) have also been shown to suffer from the same problem of not being able to provably account for long-range dependent patterns in sequences (Belletti et al., 2018).
A second challenge in sequential recommendations is learning user latent factors explicitly from data, which has been observed to create many difficulties (Tang and Wang, 2018a; He and McAuley, 2016; Chen et al., 2018; Rendle et al., 2010). In the corresponding works, users’ long-term preferences have been modeled through learning a set of latent factors for each user. However, learning explicitly is difficult in large-scale production systems. As the number of users is usually several magnitudes higher than the number of items, building such a large user vocabulary and storing the latent factors in a persistent manner is challenging. Also, the long-tail users (a.k.a cold users) and visitor users (i.e. users who are not logged in) could have much worse recommendations than engaged users (Beutel et al., 2017).
Figure 1 clearly indicates that although the influence of past user events on future interactions follows a significant decaying trend, significant predictive power can still be carried by events located arbitrarily far in the past. Very recent events (i.e. ) have large magnitude similarities with the current user behavior and this similarity depends strongly on the sequential order of related events. As the distance grows larger, the informative power of previously consumed items on future user behavior is affected by more uncertainty (e.g.variance) and is less sensitive to relative sequential position. That is, the events from 100 steps ago and from 110 steps ago may have a generally similar influence on future user decisions regardless of their relative temporal location. Therefore, for the kind of sequential signals we intend to leverage, in which different scales of temporal dependencies co-exist, it may be better to no longer consider a single model. While simple monolithic models such as Deep Neural Network (DNN) with pooling and dropout (Wu and Yan, 2017; Covington et al., 2016) are provably robust to noise, they are unfortunately not sensitive to sequential order (without substantial modifications). On the other hand, RNNs (Hidasi et al., 2015; Beutel et al., 2018) provide cutting-edge sequential modeling capabilities but they are heavily sensitive to noise in sequential patterns. Therefore, it is natural to choose a mixture of diverse models which would then complement each other to provide better overall predictive power.
Motivated by our earlier analyses, we now introduce a novel method aimed at addressing the shortcoming of pre-existing approaches for long user/item interaction sequences: Multi-temporal-range Mixture Model (M3) and its two variants (M3R/M3C). For simplicity, we omit the superscripts related to users (i.e. will now be denoted ) and use a single user sequence to describe the neural architecture we introduce.
Figure 2 gives a general schematic depiction of M3. We will now introduce each part of the model separately in a bottom-up manner, starting from the inputs and progressively abstracting their representation which finally determines the model’s output. When predicting the behavior of a user in the next event of their logged sequence, we employ item embeddings and context features (optional) from past events as inputs:
where denotes the concatenation operator. To map the raw context features and item embeddings to the same high-dimensional space for future use, a feed-forward layer is used:
here represents the input processed at step and stands for the collection of all processed inputs before step
(included). Either the identity function or a ReLU(Nair and Hinton, 2010) can be used to instantiate the feed-forward layer .
In the previous section, we assessed the limitations of using a single model on long user sequence. To circumvent the issues we highlighted, we employ in M3 three different sequence models (encoders) in conjunction, namely , and , on top of the processed input . We will later explain their individual architectures in details. The general insight is that we want each of these sub-models to focus on different ranges of temporal dependencies in user sequences to provide a better representation (i.e., embedding) of the sequence. We want the sub-models to be architecturally diverse and address each other’s shortcomings. Hence
which yields three different representations, one produced by each of the three sequence encoders. The three different sub-model encoders are expected to produce outputs—denoted by —of identical dimension. By construction, each sequential encoder produces its own abstract representation of a given user’s logged sequence, providing diverse latent semantics for the same input data.
Our approach builds upon the success of Mixture-of-Experts (MOE) model (Jacobs et al., 1991). One key difference is that our ‘experts’ are constructed to work with different ranges of temporal dependencies, instead of letting the cohort of ‘experts’ specialize by learning from data. As shown in (Shazeer et al., 2017), heavy regularization is needed to learn different experts sharing the same architecture in order to induce specialization and prevent starvation when learning (only one expert performs well because it is the only one to learn which creates a self-reinforcing loop when learning with back-propagation).
Informed by the insights underlying the architecture of MOE models, we aggregate all sequence encoders’ results by weighted-concatenate or weighted-sum, with weights computed by a small gating network. In fact, we concatenate the outputs with
where corresponds to the outputs of our gating network. We can also aggregate outputs with a weighted-sum:
Note that there is no theoretical guarantee whether concatenation is better than summation or not. The choice of aggregation, as well as the choice of activation functions, is determined by observing a given model’s performance from a validation set extracted from different datasets. Such a procedure is usual in machine learning and will help practitioners determine which variant of the model we propose is best suited to their particular application.
Because of its MOE-like structure, our model can adapt to different recommendation scenarios and provide insightful interpretability (as we shall see in Section 5). In many recommendation applications, some features annotate each event and represent the context in which the recommendation query is produced. Such features are for instance indicative of the page or device on which a user is being served a recommendation. After obtaining a sequence encoding at step (i.e. ), we fuse it with the annotation’s context features (optional) and project them to the same latent space with another hidden feed-forward layer :
is a vector encoding contextual information to use after the sequence has been encoded. Here theis what we name user representation, it is computed based on the user’s history as it has been gathered in logs. Finally, a user similarity score is predicted for each item via an inner-product (which can be changed to another similarity scoring function):
is a vector representing the item. For a given user, item similarity scores are then normalized by a softmax layer which yields a recommendation distribution over the item vocabulary. After training M3, the recommendations for a user at stepare served by sorting the similarity scores obtained for all and retrieving the items associated with the highest scores.
Item Co-occurrence as a Tiny-range Encoder The Tiny-range encoder only focuses on the user’s last event , ignoring all previous events. In other words, given the processed inputs from past events , this encoder will only consider . As in factorizing Markov chain (FMC) models (Rendle et al., 2010), makes predictions based on item range-1 co-occurrence within observed sequences. For example, if most of users buy iPhone cases after purchasing an iPhone, then should learn this item-to-item co-occurrence pattern. As shown in Figure 2(a), we compute ’s output as:
That is, when the dimensionality of processed input and encoder output are the same, the tiny-range encoder performs a role of residual for the other encoders in mixture. If , it is possible to down-sample (if ) or up-sample (if ) from by learned parameters and .
In summary, the tiny-range encoder can only focus on the last event by construction, meaning it has a temporal range of 1 by design. If we only use the output of to make predictions, we obtain recommendations results based on item co-occurrence.
RNN/CNN as Short-range Encoder As discussed in Section 3, the recent behavior of a user has substantial predictive power on current and future interactions. Therefore, to leverage the corresponding signals entailed in observations, we consider instantiating a short-range sequence encoder that puts more emphasis on recent past events. Given the processed input from past events , this encoder, represented as , focuses by design on a recent subset of logged events. Based on our quantitative data exploration, we believe it is suitable for to be highly sensitive to sequence order. For instance, we expect this encoder to capture the purchasing pattern iPhone iPhone case iPhone charger if it appears frequently in user sequences. As a result, we believe the Recurrent Neural Network (RNN (Mikolov et al., 2010)) and the Temporal Convolutional Network ((Van Den Oord et al., 2016; Bai et al., 2018; Yuan et al., 2019)) are fitting potential architectural choices. Such neural architectures have shown superb performances when modeling high-order causalities. Beyond accuracy, these two encoders are also order sensitive, unlike early sequence modeling method (i.e. Bag-of-Word (Jurafsky and Martin, 2014)). As a result we develop two interchangeable variants of M3: M3R and M3C using an RNN and a CNN respectively.
To further describe each of these options, let us introduce our RNN encoder . As shown in Figure 2(b) we obtain the output of by first computing the hidden state of RNN at step :
where is a recurrent cell that updates the hidden state at each step based on the previous hidden state and the current RNN input . Several choices such as Gated Recurrent Unit (GRU) (Cho et al., 2014) or Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) can be used. The output is then computed as follows:
where maps the hidden state to the encoder output space. We design our CNN encoder as a Temporal Convolutional Networks which has provided state-of-art sequential modeling performance (Van Den Oord et al., 2016; Gehring et al., 2017; Bai et al., 2018). As shown in Figure 2(c), this encoder consists of several stacked layers. Each layer computes
where indicates the layer number. The is a 1-D convolutional operator (combined with non-linear activations, see (Bai et al., 2018) for more details), which contains convolutional filters and operates on the convolutional inputs. With layers in our CNN encoder, the final output will be:
As highly valuable signals exist in the short-range part of user sequence, we propose two types of encoders to capture them. Our model can be instantiated in its first variant, M3R, if we use RNN encoder or M3C if a CNN is employed. Here M3C and M3R are totally interchangeable with each other and they show comparable results in our experiments (see Section 5.2.1). We believe such flexibility will help practitioners adapt their model to the hardware they intend to use, i.e. typically using GPU for faster CNN training or CPU for which RNNs are better suited. In terms of temporal range, the CNN only considers a limited finite window of inputs when producing any output. The RNN, although it does not have a finite receptive field, is hampered by difficulties when learning to leverage events located further back into the past (to leverage an event located observations ago the RNN needs steps). Regardless of the choice of a CNN or an RNN, our short-range encoder has a temporal range greater than 1, although it is challenging for this sub-model to capture signals too far away from current step. This second encoder is specifically designed to capture sequence patterns that concern recent events.
|Base model||Temporal range||Model size||Sensitive to order||Robustness|
|Item Co-occurrence||1||small (or 0)||very high||no|
|Recurrent Neural Nets||unknown||large||high||no|
|Temporal Convolution Nets||limited||large||high||no|
|Attention Model||unlimited||small (or 0)||no||high|
Attention Model as Long-range Encoder The choice of an attention model is also influenced by our preliminary quantitative analysis. As discussed in Section 3, as the temporal distance grows larger, the uncertainties affecting the influence of item consumption on future events get larger as well. Moreover, as opposed to the recent part of a given user’s interaction sequence, relative position does not matter as much when it comes to capturing the influence of temporally distant events. As we take these properties into account, we choose to employ Attention Model (Bahdanau et al., 2014; Vaswani et al., 2017) as our long-range sequence encoder. Usually, an attention model consists of three parts: attention queries, attention keys and attention values. One can simply regard an attention model as weighted-sum over attention values with weights resulting from the interaction between attention queries and attention keys. In our setting, we use (1) the last event’s processed input as attention queries, (2) all past events’ processed inputs as keys and values and (3) scaled dot-product (Vaswani et al., 2017) as the similarity metric in the attention softmax. For instance, if a user last purchased a pair of shoes, the attention mechanism will focus on footwear related previous purchases.
So that all encoders have the same output dimensionality, we need to transform111It is unnecessary if is same as . our processed input first as follows:
where is a learned matrix of parameters. Then for each position , we obtain its raw attention weights, with respect to the processed input , as follows:
where is the raw weight at position . Similarly, we compute the raw attention weights for all positions and normalize them with a function. Finally, we acquire the output of our long-range encoder as follows:
Our long-range encoder borrows several advantages from the attention model. First, it is not limited by a certain finite temporal range. That is, it has an unlimited temporal range and can ‘attend’ to anywhere in user’s sequence with O(1) steps. Second,because it computes its outputs as a weighted sum of inputs, the attention-based encoder is not as sensitive to sequential order as an RNN or a CNN as each event from the past has an equal chance of influencing the prediction. Third, the attention model is robust to noisy inputs due to its normalized attention weights and weighted-sum aggregation.
Gating Network Borrowing the idea from from Mixture-of-Experts model (Jacobs et al., 1991; Ma et al., 2018), we build a gating network to aggregate our encoders’ results. The gate is also helpful to better understand our model (see Section 5). To produce a simpler gating network, we use a feed-forward layer on the gating network’s inputs:
where is the input we feed into our gating network. We will discuss how the model performs overall with different choices of gate inputs in Section 5.4. The resulting
contains the gate value modulating each encoder. More importantly, an element-wise sigmoid function is applied to the gate values which allows encoders to ‘corporate’ with each other(Belletti et al., 2018). Note that a few previous works (Ma et al., 2018; Jordan and Jacobs, 1994; Shazeer et al., 2017) also normalize the gate values, but we found this choice led to the degeneracy of our mixture model as it would learn to only use which in turn hampers model performance.
Summary M3 is able to address limitations of pre-existing models as shown in Table 1: (1) M3 has a mixture of three encoders with different temporal ranges which can capture sequential patterns located anywhere in user sequences. (2) Instead of learning a set of latent factor for each user, M3 represents the long-term user preferences by using a long-range sequence encoder that provides a representation of the entire history of a user. Furthermore, M3 is efficient in both model size and computational cost. In particular M3 does not introduce any extra parameters under certain settings (i.e. ), and the computation of and are very efficient when using specialized hardware such as a GPU. With its simple gate design, M3 also provides good interpretability and adaptability.
Effectiveness. Given our analysis on user sequences, we assume M3 to be effective. As compared to past works, M3 is capable to capture signals from the whole sequence, it also satisfies the properties we found in different parts of sequence. Moreover, our three encoders constitute a diverse set of sequential encoder and, if well-trained, can model user sequence in a multi-scale manner, which is a key to success in past literature (Van Den Oord et al., 2016; Yu and Koltun, 2015).
Efficiency. In terms of model size, M3 is efficient. As compared to existing works which use short-range encoder only, though uses two other encoders, our M3 model doesn’t introduce any extra parameters (if ). In terms of computational efficiency, our M3 is good as well, as both and are nothing other than matrix multiplication, which is cheap when computed with optimized hardwares like Graphics Processing Unit (GPU).
Interpretability. Model’s interpretability is critical for diagnosing purpose. As we shall see later, with the gate network, we are able to visualize our network transparently by observing the gate values.
Adaptability. One issue in production recommender system is modeling users for different recommendation scenarios, as people may behave very differently. Two typical scenarios are HomePage recommendation and product DetailPage recommendation. However, as we shall introduce in later section, M3 is able to adapt to these scenarios if we use the scenario information as our gate input.
In this section, we study the two variants of M3 against several baseline state-of-the-art methods on both a publicly available dataset and our large-scale Youtube dataset.
Datasets We use MovieLens 20M222https://grouplens.org/datasets/movielens/20m/, which is a publicly available dataset, along with a large-scale anonymized dataset from YouTube to which we have access because we are employees of Google working on improving YouTube as a platform. The dataset is private, anonymized and accessible only internally by few employees whose work is directly related to Youtube.
As in previous works (Tang and Wang, 2018a; He and McAuley, 2016), we process the MovieLens data by first converting numeric ratings to 1 values, turning them into implicit logged item consumption feedback. We remove the items with less than ratings. Such items, because of how little user feedback is available for them, represent another research challenge — cold start — which is outside the scope of the present paper.
To focus on long user sequences, we filtered out users who had a sequence length of less than item consumed, while we didn’t filter items specifically. The maximum sequence length in the dataset being , we follow the method proposed in (He and McAuley, 2016; Tang and Wang, 2018a) and employ a sliding window of length to generate similarly long sequences of user/item interactions in which we aim to capture long range dependent patterns. Some statistics can be found in the first row of Table 3.
We do not use contextual annotations for the MovieLens data.
Evaluation protocol We split the dataset into training and test set by randomly choosing 80% of users for training and the remaining 20% for validation (10%) and testing (10%). As with the training data, a sliding window is used on the validation and test sets to generate sequences. We measure the mean average precision (mAP) as an indicator for models’ performances (Beutel et al., 2017; Tang and Wang, 2018b). We only focus on the top positions of our predictions, so we choose to use mAP@n with . There is only one target per instance here and therefore the mAP@n is expected to increase with which is consistent with (Belletti et al., 2018) but differs from (Tang and Wang, 2018a).
Details on model architectures We keep architectural parameters consistent across all experiments on MovieLens. In particular, we use identical representation dimensions: Such a choice decreases the number of free parameters as the sub-models and will not have learned parameters. A GRU cell is employed for the RNN while 2 stacked temporal convolution layers (Bai et al., 2018) of width 5 are used in the CNN. A ReLU activation function is employed in the feed-forward layers and . Item embeddings of dimension are learned with different weights on the input side (i.e., in Eq. 1) and output side (i.e., in Eq. 7). Although previous work (He and McAuley, 2016)
has constrained such embeddings to be identical on the input and output side of the model, we found that increasing the number of degrees of freedom led to better results.
Baselines We compare our two variants, i.e., M3R and M3C, with the following baselines:
FMC: The Factorizing model for the first-order Markov chain (FMC) (Rendle et al., 2010) is a simple but strong baseline in sequential recommendation task (Tang and Wang, 2018a; Beutel et al., 2018; Smirnova and Vasile, [n. d.]). As discussed in Section 1, we do not want to use explicit user representations. Therefore, we do not compare the personalized version of this model (FPMC).
DeepBoW: The Deep Bag-of-word model represent user by averaging item embeddings from all past events. The model then makes predictions through a feed-forward layer. In our experiments, we use a single hidden layer with size of 32 and ReLU as activation function.
GRU4Rec: Originally presented in (Hidasi et al., 2015), this method uses a GRU RNN over user sequences and is a state-of-the-art model for sequential recommendation with anonymized data.
Caser: The Convolutional Sequence Embeddings model (Tang and Wang, 2018a) applying horizontal and vertical convolutional filters over the embedding matrix and achieves state-of-the-art sequential recommendation performance. We try vertical filters and horizontal filters of size . In order to focus on the sequential encoding task, we discard the user embedding and only use the sequence embedding of this model to make predictions.
In the models above, due to the large number of items in input and output dictionaries, the learned embeddings comprise most of the free parameters. Therefore, having set the embedding dimension to
in all the baselines as well as in M3R and M3C, we consider models with similar numbers of learned parameters. The other hyperparameters mentioned above are tuned by looking at the mAP@20 on validation set. The training time of M3R/M3C is comparable with others and can be further improved with techniques like model compression(Tang and Wang, 2018b), quantization (Hubara et al., 2017), etc.
We report each model’s performance in Table 2. Each metric is averaged across all user sequences in test set. The best performer is highlighted in bold face. The results show that both M3C and M3R outperform other baselines by a large margin. Among the baselines, GRU4Rec achieves the best performance and DeepBoW worst one, suggesting the sequence order plays a very important predictive role. FMC performs surprisingly well, suggesting we could get considerable results with a simple model only taking the last event into account. The poor results of Caser may be caused by its design which relies on vertical filters of fixed size. Caser performs better in the next subsection which considers sequences whose lengths vary less within the training data.
The previous results have shown strong performance gains achieved by the models we intruced: M3C and M3R. We now investigate the origin of such improvements. The design of these models was inspired by an attempt to capture sequential patterns with different characteristic temporal extents. To check whether the models we introduced achieve this aim we construct multiple variants of MovieLens with different sequence lengths.
We vary the sequence length by having a maximum cutoff threshold which complements the minimal sequence length threshold . A sequence with more than only has its latest observations remained. We vary the values of , and the sequence generation window size. Table 3 summarizes the properties of the four variants of the MovieLens dataset we construct. It is noteworthy that such settings make Caser perform better as the sequence length is more consistent within each dataset variant.
|Min. Length||Max. Length||Window Size||Avg. Length||#Sequences||#Items|
GRU4Rec and Caser outperform the other baselines in the present setting and therefore we only report their performance. Figure 4
shows the improvements of M3C and M3R over the best baselines on four MovieLens variants. The improvement of each model is computed by its mAP@20 against the best baseline. In most cases, M3C and M3R can outperform the highest performing baseline. Specifically, on ML20M-S and ML20M-M, Caser performs similarly to GRU4Rec while both M3C and M3R have good performance. This is probably due to the contribution of the tiny-range encoder.
For the YouTube dataset, we filtered out users whose logged sequence length was less than 150 () and keep each user’s last 300 events () in their item consumption sequence. In the following experiments, we exploit contextual annotations such as user device (e.g., from web browser or mobile App), time-based features (e.g., dwelling time), etc. User sequences are all anonymized and precautions have been taken to guarantee that users cannot be re-identified. In particular, only public videos with enough views have been retained.
Neural recommender systems attempt at foreseeing the interest of users under extreme constraints of latency and scale. We define the task as predicting the next item the user will consume given a recorded history of items already consumed. Such a problem setting is indeed common in collaborative filtering (Sarwar et al., 2001; Linden et al., 2003) recommendations. We present here results obtained on a dataset where only about 2 million items are present that correspond to most popular items. While the user history can span over months, only watches from the last 7 days are used for labels in training and watches in the last 2 days are used for testing. The train/test split is . The test set does not overlap with the train set and corresponds to the last temporal slice of the dataset. In all, we have more than 200 million training sequences and more than 1 million test sequences, and with overall average sequence length approximately being 200.
The neural network predicts, for a sample of negatives, the probability that they are chosen and classically a negative sampling loss is employed in order to leverage observations belonging to a very large vocabulary (Jean et al., 2014). The loss being minimized is
where the SampledSoftmax (Jean et al., 2014) uses randomly sampled negatives and is the weight of each label.
Evaluation metrics To test the models’ performances, we measure the mean average precision (mAP) as in (Tang and Wang, 2018a; Beutel et al., 2018). We only focus on the top positions of our predictions, so we choose to use mAP@ with . The mAP is computed with the entire dictionary of candidate items as opposed to the training loss which samples negatives. There is only one target per instance here and therefore the mAP@n is expected to increase with n which is consistent with (Belletti et al., 2018) but differs from (Tang and Wang, 2018a).
Baselines In order to make fair comparisons with all previous baselines, we used their contextual counterparts if they are proposed or compared in literature.
Context-FMC: The Context-FMC condition the last event’s embedding on last event’s context features by concatenating them and having a feed-forward layer over them.
DeepYouTube: Proposed by (Covington et al., 2016), the DeepYoutube model is a state-of-the-art neural model for recommendation. It concatenates: (1) item embedding from users’ last event, (2) item embeddings averaged by all past events and (3) context features. The model then makes predictions through a feedforward layer composed of several ReLU layers.
Context-GRU: We used the contextual version of GRU proposed in (Smirnova and Vasile, [n. d.]). Among the three conditioning paradigms on context, we used the concatenation as it gives us better performances.
All models are implemented by TensorFlow(Abadi et al., [n. d.]) and by Adagrad (Duchi et al., 2011) over a parameter server (Li et al., [n. d.]) with many workers.
Model details In the following experiments, we keep the dimensions of processed input and encoder outputs identical for all experiments conducted on the same dataset. Once more, we also want to share some of our architectural parameters so that they are consistent across the two datasets. Again, by doing this, we make the parametrization of our models more parsimonious, because the sub-models and will be parameter-free. For the RNN cell, we use a GRU on both datasets for its effectiveness as well as efficiency. For the CNN version, we stacked layers of temporal convolution (Bai et al., 2018), with no dilation and width of . For the feed-forward layers and , we used ReLU as their activation functions, whereas they contains different number of sub-layers. For item embeddings on the input side (i.e., in Eq. 1) and on the output side (i.e., in Eq. 7), we learn them separately which improves all results.
We report each model’s performance on the private dataset in Table 4. The best performer is highlighted in bold face. As can be seen from this table, on our anonymized YouTube dataset, the Context-FMC performs worse followed by DeepYoutube while Context-GRU performs best among all baselines. The DeepYouTube and Context-GRU perform better than Context-FMC possibly because they have longer temporal range, which again shows that the temporal range matters significantly in long user sequences. One can therefore improve the performance of a sequential recommender if the model is able to leverage distant (long-range dependent) informantion in user sequences.
On both datasets, we observed our proposed two model variants M3R and M3C significantly outperform all other baselines. Within these two variants, the M3R preforms marginally better than the M3C, and it improves upon the best baselines by a large margin (more than 20% on MovieLens data and 16.0% on YouTube data).
|MovieLens 20M||YouTube Dataset|
To demonstrate how each encoder contributes to the overall performance, we now present an ablation test on our M3R model (results from M3C are similar) on our proprietary data. We use T, S, L to denote , and respectively. The results are described in Table 5. When we only enable single encoder for M3R, the best performer is M3R-T on MovieLens data and M3R-S on the YouTube data. This result is consistent with the results in Section 5.2.1. With more encoders involved in M3R the model performs better. In particular, when all encoders are incorporated, our M3R-TSL performs best on both datasets, indicating all three encoders matter for performance.
We now begin to study our gating network in order to answer the following questions: (1) Is the gating network beneficial to the overall model performance? (2) How do different gating network inputs influence the model performance? and (3) How can the gating network make our model more adaptable and interpretable?
Fixed gates versus learned gates: First of all, we examine the impact of our gating network by comparing it with a set of fixed gate values. More precisely, we fixed the gate values to be all equal to during the model training: , here is a vector. The first row of Table 6 shows the result of this fixed-gate model. We found that the fixed models are weaker than the best performing version of M3R (i.e., mAP@20 of 0.1743) and M3C (i.e., mAP@20 of 0.1654). This reveals that the gating network consistently improves M3-based models’ performances.
Influence of different gate inputs: In this paragraph we investigate the potential choices of inputs for the gating network, and how they result in different performance scores. In the existing Mixture-of-Experts (MOE) literature, the input for the gating network can be categorized into Contextual-switch and Bottom-switch. The Contextual-switch, used in (Belletti et al., 2018), uses context information as gate input:
where and are context features from input and output side. Intuitively, this suggests how context may influence the choices of different encoders. If no context information is available, we can still use the output of a shared layer operating before the MOE layer (Ma et al., 2018; Shazeer et al., 2017) as gate input, i.e., Bottom-switch:
The shared layer contains high-level semantic knowledge from the last event, which can also enable gate switching.
On the MovieLens data, we used Bottom-switched gate for all the results above because of the absence of contextual annotations. On the YouTube dataset, the last two rows from Table 6 provide the comparison results between Contextual-switched gate and Bottom-switched gate. We observe that context information is more useful to the gates than a shared layer. In other words, the decision of whether to focus more on recent part (i.e. large gate values for and ) or on the distant part (i.e. large values for ) from user sequence is easier to make based on contextual annotations.
The model architecture we design is based on quantitative findings and has two primary goals: capturing co-existing short-range and long-range behavioral patterns as well as serving recommendations given in different contexts with a single model.
We know for recommender systems in most applications (e.g. e-commerce like Amazon, streaming services like Netflix) that recommendations commonly occur in at least two different contexts: either a HomePage or a DetailPage. The Homepage is the page shown when users open the website or open the mobile App, while DetailPage is the page shown when users click on a certain item. User behaviors are different depending on which of these two pages they are browsing. Users are more likely to be satisfied by a recommendation related to recent events, especially the last event, when they are on a DetailPage. A straightforward solution to deal with these changing dynamics is to train two different models.
We now demonstrate that with the multi-temporal-range encoders architecture and gating mechanism in M3, we can have a single adaptive end-to-end model that provides good performance in a multi-faceted recommendation problem. To that end we analyze the behavior of our gating network and show the adaptability of the model as we gain a better understanding of its behavior.
What we observe in Figure 5 is that when contextual information is available to infer the recommendation scenario, the gating network can effectively automatically decide how to combine the results from different encoders in a dynamic manner to further improve performance. Figure 5 shows how gate values of M3R change w.r.t. across different recommendation scenarios. It is clear that M3R puts more emphasis on when users are on the HomePage, while it encourages all three encoders involved when users are on DetailPage. This result shows that the gating network uses different combinations of encoders for different recommendation scenarios.
As a result, we can argue that our architectural design choices do meet the expectations we set in our preliminary analysis. It is noteworthy that the gating mechanism we added on top of the three sub-models is helpful to improve predictive performance and ease model diagnosis. We have indeed been able to analyze recommendation patterns seamlessly.
M3 is an effective solution to provide better recommendations based on long user sequences. M3 is a neural model that avoids most of the limitations faced by pre-existing approaches and is well adapted to cases in which short term and long term temporal dependencies coexist. Other than effectiveness, this approach also provides several advantages such as the absence of a need extra parameters and interpretability. Our experiments on large public dataset as well as a large-scale production dataset suggest that M3 outperforms the state-of-the-art methods by a large margin for sequential recommendation with long user sequences. One shortcoming of the architecture we propose is that all sub-models are computed at serving time. As a next step, we plan to train a sparse context dependent gating network to address this shortcoming.
International Conference on Artificial Intelligence and Statistics. 1522–1530.
International workshop on multiple classifier systems. Springer, 1–15.
IEEE conference on Computer Vision and Pattern Recognition.
Variational Autoencoders for Collaborative Filtering. InProceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 689–698.