In some large E-COMMERCE, for example TABAO, AliExpress, the recommend algorithm is divided into two stages, i.e. the stage "match" and the stage "rank". In the stage "match", we need select item set with size from all the items. In the stage "rank", we compute a score for the items in the match item set, and rank them by score. We can use model of any form in the stage rank, but there are some restriction for the model in the stage match, it is that it must can quick pick up items from
or more items, hence it need an index. Only the models can generate index can be used in the stage match. The most familiar model for match is the static model, it compute the conditional probabilityas score, and save a table with the fields "triger id", "item id", "score" indexed by "triger id" and "score" in offline. In online, we recall items with top N score using "triger id" as index, where "triger id" com from the items which user has behaviour.
Sequence prediction is a problem that involves using historical sequence information to predict the next value or values in the sequence. There are a lot of applications of this type, for example, the language model and recommend system. Recurrent Neural Network (RNN) is widely used to solve the sequence prediction problems.
For a given sequence , we wish predict . In the situation of recommend system, the s are the item which user clicked ( or buy, added to wish list, etc), hence the sequence is the representative of user. In the situation of language model, the s are the words in the sentence, hence the sequence is the representative of front part of the sentence.
The final layer is a full connectional layer with a softmax.
This network structure is equivalent to the network structure in Figure 2. In Figure 2, we give two embedding for every item: if the item is passed, we call it “trigger”, and call the embedding “trigger embedding”; if the item is to be predict score, we call it “item”, and call its embedding “item embedding”. The layer “Extension by 1” means the static map
Because the output of final GRU layer collected all the trigger information up to time of the session, we can view the output of layer “Extension by 1” as “session embedding”. We set the dimension of item embedding equal to the dimension of session embedding, and define the output of network as the inner product of the session embedding and the item embedding. It easy to see that the network structure in Figure2 and Figure 2 are equivalent under the corresponding
where means the row of the matrix A and means the element of the column vector . Hence we call this method “vector embedding method”.
The session based model of with vector embedding method can used as a model for match. In fact, after the model is trained, we can save the vector embedding of items with some index, for example, KD-Tree, BallTree, …. When a user visit our recommend page, we compute the vector embedding of users session using the click sequences of user, and find the Top N items which vector embedding has max inner product with using index.
But the vector embedding method has an inherent defect. Because the interests of user may not be single. Suppose there are the item embeddings of dress and phone as shown in (Figure 4). Generally the interest to dress is independent to the interest to phone, we can suppose they are linear independent. If a user clicked dresses 20 times and phones 10 times in one session, then the vector embedding of this session will
mainly try to close to the dress, but will be drag away by phone under training, in the result, the vector embedding of session will lie between the dress and phone, which is not close to neither dress or phone. Hence when we predict using this embedding, we will recommend something like comb of dress and phone to the user as the top 1 selection, instead of the most interested dress. In other words, the scheme of vector embedding deprived the variousness of the intersection user in one session.
In order to model the variousness of the intersection user in one session, we use “matrix embedding” of session instead of “vector embedding”.
In our method, the items are modeled as vectors of dimension still, but a session is modeled as a symmetric matrix in instead of a vector in . The score which represent the interest of the session to the item is modeled as
where is the vector embedding of item, and is the matrix embedding of the session. Because the symmetric matrix has the eigendecomposition (Eigendecomposition )
where is a real orthogonal square matrix, and is a diagonal square matrix with the elements on the diagonal line. In fact, are the eigenvalues of , and the -th column of is the eigenvector according to . In the example in Figure 4, the matrix embedding of session can has two eigenvalues significantly greater than others, whose eigenvectors are close to the lines along the embedding vector of dress and the embedding vector of phone respectively. Hence, the function
take its max value close to the direction of dress, where means the unit ball in . When we using this model to predict, we will recommend dress to the user as the top 1 selection.
2 Network structure
We set the dimension of the hidden layers to be , where is the dimension of the embedding vectors of items.
We use the layer “reshape to a symmetric matrix” instead of the layer “extension by 1”. The layer “reshape to a symmetric matrix” is defined as
We use the layer
as the score layer instead of the inner product.
There is a modifying of the item embedding layer. It is the upper half hyperplane embedding, i.e, the embedding vectors of items in the upper half hyperplane
This modifying improve the performance greatly. We give some illustration of the reason: because the score value is invariant under the transformation , and if we train the model without the modifying of the item embedding layer, the embedding of items will lost its direction in training. The realizing of the layer “upper half hyperplane embedding” can be got through apply to the final coordinate of a vector in ordinal embedding layer.
3 Index method
For using as match method, we give two index method of matrix embedding method.
We formulate the problem as following:
There a lot of vector , for a symmetrical matrix , how we can find the top N s such that is maximal.
Because is a symmetrical matrix, we have
where , and , . Therefore, we can map the user session matrix embedding into a linear space of dimension using , and map the item vector embedding into the same linear space using , the score is equal to the inner product of and . Hence, we can construct the index of for the vector embedding for all items in offline and get Top N items of maximal inner product for every in online like usual method to get Top N items of maximal inner product.
The Flatten method need build index for vectors of dimension . When the dimension
is big, it is difficult to save the data, build the index and search the items of maximal inner product. Hence we need a method to get the approximate top N items of maximal inner product faster. In fact, we have the Singular Value Decomposition
where , and . Hence we have
As a approximate method, we take a small positive integral number , and take the top N items of maximal inner product for , and hence we have items, then we take top N items from these kN items by computing .
We give the experiment to compare matrix embedding method and vector embedding method on the Dataset RSC15 (RecSys Challenge 2015 111 http://2015.recsyschallenge.com/) and the last.fm last_fm dataset .
For the RSC15 dataset, after tuning the hyperparameters on the validation set, we retrained the three models above on the whole days among six months, and used the last single day to evaluate those models. When it comes to the last.fm playlists dataset, since the playlists have no timestamps, we followed the preprocessing procedure oflast_fm_paper , that is, randomly assigned each playlist to one of the 31 buckets (days), and used the lastest single day to evaluate.
We compare the tree models:
GRU4REC We re-implemented the code of GRU4REC which Hidasi et al. released online SESSION_BASED
in Tensorflow framework, including the whole GRU4REC architecture, the training procedure as well as the evaluation procedure.
GRU4REC with symmetric matrix To address the problem of GRU4REC demonstrated in section 1, we replace the output of the GRU i.e. the embedding vector of the current session with a symmetric matrix. More specifically…
GRU4REC with fully connected layer In addition to the above models, we also create a controlled experiment model as shown in Figure 3, which mainly based on the GRU4REC model but only add a fully connected layer right after the output of the GRU to expands the embedding vector size of the GRU output from dimensions to dimensions.
4.1 ACM RecSys 2015 Challenge Dataset
In order to evaluate the performance of the three models described in section 2.1, we constrained the total quantity of their parameters to the same range. The detail of the networks’ architecture are shown in table 1 respectively.
Table 2 shows the results when testing those three models on the last day of the ACM RecSys 2015 Challenge dataset for 10 epochs. After tuning on the validation set, we set lr=0.002, batch size = 256 for all the experiments. And since the GRU4REC and GRU4REC with FC layer model have less hidden units, dropout=0.8 shows better performance for them while dropout=0.5 performs better for the symmetric matrix model. Meanwhile, they’re using bpr loss and adam optimizer in all cases.
We additionally include the results in SESSION_BASED which uses 1000 hidden units for the GRU4REC model. It’s obvious that by combining symmetric matrix embedding method with GRU4REC, we could use less parameter to achive better recall@20 and mrr@20 performance.
|input_embedding||(37958, 32)||1214656||1214656||(37958, 32)||1214656||1214656||(37958, 32)||1214656||1214656|
|softmax_W||(37958, 64)||2429312||3643968||(37958, 55)||2087690||3302346||(37958, 32)||1214656||2429312|
|gru_cell/gates/kernel||(96, 128)||12288||3694214||(42, 20)||840||3341749||(560, 1056)||591360||3020672|
|gru_cell/candidate/kernel||(96, 64)||6144||3700486||(42, 10)||420||3342189||(560, 528)||295680||3317408|
4.2 Last.FM playlists Dataset
For the last.fm music playlists datasets, we applied the same network structure for each model as mentioned above, the specific parameters are shown in Table 3 as follows.
|input_embedding||(200668, 32)||6421376||6421376||(200668, 32)||6421376||6421376||(200668, 32)||6421376||6421376|
|softmax_W||(200668, 64)||12842752||19264128||((200668, 55)||11036740||17458116||(200668, 32)||6421376||12842752|
|gru_cell/gates/kernel||(96, 128)||12288||19477084||(42, 20)||840||17660229||(560, 1056)||591360||13434112|
|gru_cell/candidate/kernel||(96, 64)||6144||19483356||(42, 10)||420||17660669||(560, 528)||295680||13730848|
Since the music playlists dataset is so different from the e-commerce click sequence dataset, after tuning on the validation set, we finally set lr = 0.0012 for all cases while the batch size and dropout configuration remain the same.
Table 4 shows results for the last.fm music playlists datasets. We can notice the same trend when comparing the results with the RecSys15 dataset.
- (1) Balazs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Session-based Recommendations with Recurrent Neural Networks. International Conference on Learning Representations(2016).
- (2) Balazs Hidasi, Alexandros Karatzoglou. Recurrent Neural Networks with Top-k Gains for Session-based Recommendations. CIKM ’18 Proceedings of the 27th ACM International Conference on Information and Knowledge Management Pages 843-852. Torino, Italy — October 22 - 26, 2018 ACM New York, NY, USA ©2018 table of contents ISBN: 978-1-4503-6014-2.
- (3) Bogina, Veronika, and Tsvi Kuflik. “Incorporating dwell time in session-based recommendations with recurrent Neural networks.” CEUR Workshop Proceedings. Vol. 1922. 2017.
- (4) Quadrana, Massimo, et al. “Personalizing session-based recommendations with hierarchical recurrent neural networks.” Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 2017.
- (5) https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix
- (6) Thierry Bertin-Mahieux and Daniel P.W. Ellis and Brian Whitman and Paul Lamere. The Million Song Dataset. Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011). http://millionsongdataset.com/lastfm/
- (7) Dietmar Jannach and Malte Ludewig. When Recurrent Neural Networks meet the Neighborhood for Session-Based Recommendation. RecSys ’17 Proceedings of the Eleventh ACM Conference on Recommender Systems Pages 306-310.