1 Introduction
In some large ECOMMERCE, for example TABAO, AliExpress, the recommend algorithm is divided into two stages, i.e. the stage "match" and the stage "rank". In the stage "match", we need select item set with size from all the items. In the stage "rank", we compute a score for the items in the match item set, and rank them by score. We can use model of any form in the stage rank, but there are some restriction for the model in the stage match, it is that it must can quick pick up items from
or more items, hence it need an index. Only the models can generate index can be used in the stage match. The most familiar model for match is the static model, it compute the conditional probability
as score, and save a table with the fields "triger id", "item id", "score" indexed by "triger id" and "score" in offline. In online, we recall items with top N score using "triger id" as index, where "triger id" com from the items which user has behaviour.Sequence prediction is a problem that involves using historical sequence information to predict the next value or values in the sequence. There are a lot of applications of this type, for example, the language model and recommend system. Recurrent Neural Network (RNN) is widely used to solve the sequence prediction problems.
For a given sequence , we wish predict . In the situation of recommend system, the s are the item which user clicked ( or buy, added to wish list, etc), hence the sequence is the representative of user. In the situation of language model, the s are the words in the sentence, hence the sequence is the representative of front part of the sentence.
The final layer is a full connectional layer with a softmax.
In SESSION_BASED , a session based model for recommend system is proposed. The model is like Figure 2.
This network structure is equivalent to the network structure in Figure 2. In Figure 2, we give two embedding for every item: if the item is passed, we call it “trigger”, and call the embedding “trigger embedding”; if the item is to be predict score, we call it “item”, and call its embedding “item embedding”. The layer “Extension by 1” means the static map
Because the output of final GRU layer collected all the trigger information up to time of the session, we can view the output of layer “Extension by 1” as “session embedding”. We set the dimension of item embedding equal to the dimension of session embedding, and define the output of network as the inner product of the session embedding and the item embedding. It easy to see that the network structure in Figure2 and Figure 2 are equivalent under the corresponding
where means the row of the matrix A and means the element of the column vector . Hence we call this method “vector embedding method”.
The session based model of with vector embedding method can used as a model for match. In fact, after the model is trained, we can save the vector embedding of items with some index, for example, KDTree, BallTree, …. When a user visit our recommend page, we compute the vector embedding of users session using the click sequences of user, and find the Top N items which vector embedding has max inner product with using index.
But the vector embedding method has an inherent defect. Because the interests of user may not be single. Suppose there are the item embeddings of dress and phone as shown in (Figure 4). Generally the interest to dress is independent to the interest to phone, we can suppose they are linear independent. If a user clicked dresses 20 times and phones 10 times in one session, then the vector embedding of this session will
mainly try to close to the dress, but will be drag away by phone under training, in the result, the vector embedding of session will lie between the dress and phone, which is not close to neither dress or phone. Hence when we predict using this embedding, we will recommend something like comb of dress and phone to the user as the top 1 selection, instead of the most interested dress. In other words, the scheme of vector embedding deprived the variousness of the intersection user in one session.
In order to model the variousness of the intersection user in one session, we use “matrix embedding” of session instead of “vector embedding”.
In our method, the items are modeled as vectors of dimension still, but a session is modeled as a symmetric matrix in instead of a vector in . The score which represent the interest of the session to the item is modeled as
where is the vector embedding of item, and is the matrix embedding of the session. Because the symmetric matrix has the eigendecomposition (Eigendecomposition )
where is a real orthogonal square matrix, and is a diagonal square matrix with the elements on the diagonal line. In fact, are the eigenvalues of , and the th column of is the eigenvector according to . In the example in Figure 4, the matrix embedding of session can has two eigenvalues significantly greater than others, whose eigenvectors are close to the lines along the embedding vector of dress and the embedding vector of phone respectively. Hence, the function
take its max value close to the direction of dress, where means the unit ball in . When we using this model to predict, we will recommend dress to the user as the top 1 selection.
2 Network structure
The Network structure of our new method is showed in Figure 4. The main difference between Figure 4 and Figure 2 is that

We set the dimension of the hidden layers to be , where is the dimension of the embedding vectors of items.

We use the layer “reshape to a symmetric matrix” instead of the layer “extension by 1”. The layer “reshape to a symmetric matrix” is defined as
where

We use the layer
as the score layer instead of the inner product.

There is a modifying of the item embedding layer. It is the upper half hyperplane embedding, i.e, the embedding vectors of items in the upper half hyperplane
This modifying improve the performance greatly. We give some illustration of the reason: because the score value is invariant under the transformation , and if we train the model without the modifying of the item embedding layer, the embedding of items will lost its direction in training. The realizing of the layer “upper half hyperplane embedding” can be got through apply to the final coordinate of a vector in ordinal embedding layer.
3 Index method
For using as match method, we give two index method of matrix embedding method.
We formulate the problem as following:
There a lot of vector , for a symmetrical matrix , how we can find the top N s such that is maximal.
3.1 Flatten
Because is a symmetrical matrix, we have
where , and , . Therefore, we can map the user session matrix embedding into a linear space of dimension using , and map the item vector embedding into the same linear space using , the score is equal to the inner product of and . Hence, we can construct the index of for the vector embedding for all items in offline and get Top N items of maximal inner product for every in online like usual method to get Top N items of maximal inner product.
3.2 Decomposition
The Flatten method need build index for vectors of dimension . When the dimension
is big, it is difficult to save the data, build the index and search the items of maximal inner product. Hence we need a method to get the approximate top N items of maximal inner product faster. In fact, we have the Singular Value Decomposition
where , and . Hence we have
As a approximate method, we take a small positive integral number , and take the top N items of maximal inner product for , and hence we have items, then we take top N items from these kN items by computing .
4 Experiments
We give the experiment to compare matrix embedding method and vector embedding method on the Dataset RSC15 (RecSys Challenge 2015 ^{1}^{1}1 http://2015.recsyschallenge.com/) and the last.fm last_fm dataset .
For the RSC15 dataset, after tuning the hyperparameters on the validation set, we retrained the three models above on the whole days among six months, and used the last single day to evaluate those models. When it comes to the last.fm playlists dataset, since the playlists have no timestamps, we followed the preprocessing procedure of
last_fm_paper , that is, randomly assigned each playlist to one of the 31 buckets (days), and used the lastest single day to evaluate.We compare the tree models:
GRU4REC We reimplemented the code of GRU4REC which Hidasi et al. released online SESSION_BASED
in Tensorflow framework, including the whole GRU4REC architecture, the training procedure as well as the evaluation procedure.
GRU4REC with symmetric matrix To address the problem of GRU4REC demonstrated in section 1, we replace the output of the GRU i.e. the embedding vector of the current session with a symmetric matrix. More specifically…
GRU4REC with fully connected layer In addition to the above models, we also create a controlled experiment model as shown in Figure 3, which mainly based on the GRU4REC model but only add a fully connected layer right after the output of the GRU to expands the embedding vector size of the GRU output from dimensions to dimensions.
4.1 ACM RecSys 2015 Challenge Dataset
In order to evaluate the performance of the three models described in section 2.1, we constrained the total quantity of their parameters to the same range. The detail of the networks’ architecture are shown in table 1 respectively.
Table 2 shows the results when testing those three models on the last day of the ACM RecSys 2015 Challenge dataset for 10 epochs. After tuning on the validation set, we set lr=0.002, batch size = 256 for all the experiments. And since the GRU4REC and GRU4REC with FC layer model have less hidden units, dropout=0.8 shows better performance for them while dropout=0.5 performs better for the symmetric matrix model. Meanwhile, they’re using bpr loss and adam optimizer in all cases.
Method  recall@20  mrr@20 

GRU4REC  0.389  0.135 
GRU4REC+FC  0.515  0.515 
GRU4REC+Matrix  0.749  0.748 
GRU4REC(1000)  0.632  0.247 
We additionally include the results in SESSION_BASED which uses 1000 hidden units for the GRU4REC model. It’s obvious that by combining symmetric matrix embedding method with GRU4REC, we could use less parameter to achive better recall@20 and mrr@20 performance.
Model  GRU4REC  GRU4REC+FC  GRU4REC+Matrix  

shape  params  total  shape  params  total  shape  params  total  
input_embedding  (37958, 32)  1214656  1214656  (37958, 32)  1214656  1214656  (37958, 32)  1214656  1214656 
softmax_W  (37958, 64)  2429312  3643968  (37958, 55)  2087690  3302346  (37958, 32)  1214656  2429312 
softmax_b  (37958,)  37958  3681926  (37958,)  37958  3340304       
gru_cell/dense/kernel        (10, 55)  550  3340854       
gru_cell/dense/bias        (55,)  55  3340909       
gru_cell/gates/kernel  (96, 128)  12288  3694214  (42, 20)  840  3341749  (560, 1056)  591360  3020672 
gru_cell/gates/bias  (128,)  128  3694342  (20,)  20  3341769  (1056,)  1056  3021728 
gru_cell/candidate/kernel  (96, 64)  6144  3700486  (42, 10)  420  3342189  (560, 528)  295680  3317408 
gru_cell/candidate/bias  (64,)  64  3700550  (10,)  10  3342199  (528,)  528  3317936 
4.2 Last.FM playlists Dataset
For the last.fm music playlists datasets, we applied the same network structure for each model as mentioned above, the specific parameters are shown in Table 3 as follows.
Model  GRU4REC  GRU4REC+FC  GRU4REC+Matrix  

shape  params  total  shape  params  total  shape  params  total  
input_embedding  (200668, 32)  6421376  6421376  (200668, 32)  6421376  6421376  (200668, 32)  6421376  6421376 
softmax_W  (200668, 64)  12842752  19264128  ((200668, 55)  11036740  17458116  (200668, 32)  6421376  12842752 
softmax_b  ((200668,)  200668  19464796  (200668,)  200668  17658784       
gru_cell/dense/kernel        (10, 55)  550  17659334       
gru_cell/dense/bias        (55,)  55  17659389       
gru_cell/gates/kernel  (96, 128)  12288  19477084  (42, 20)  840  17660229  (560, 1056)  591360  13434112 
gru_cell/gates/bias  (128,)  128  19477212  (20,)  20  17660249  (1056,)  1056  13435168 
gru_cell/candidate/kernel  (96, 64)  6144  19483356  (42, 10)  420  17660669  (560, 528)  295680  13730848 
gru_cell/candidate/bias  (64,)  64  19483420  (10,)  10  17660679  (528,)  528  13731376 
Since the music playlists dataset is so different from the ecommerce click sequence dataset, after tuning on the validation set, we finally set lr = 0.0012 for all cases while the batch size and dropout configuration remain the same.
Table 4 shows results for the last.fm music playlists datasets. We can notice the same trend when comparing the results with the RecSys15 dataset.
Method  recall@20  mrr@20 

GRU4REC  0.027  0.022 
GRU4REC+FC  0.054  0.054 
GRU4REC+Matrix  0.164  0.164 
GRU4REC(1000)  0.121  0.053 
References
 (1) Balazs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Sessionbased Recommendations with Recurrent Neural Networks. International Conference on Learning Representations(2016).
 (2) Balazs Hidasi, Alexandros Karatzoglou. Recurrent Neural Networks with Topk Gains for Sessionbased Recommendations. CIKM ’18 Proceedings of the 27th ACM International Conference on Information and Knowledge Management Pages 843852. Torino, Italy — October 22  26, 2018 ACM New York, NY, USA ©2018 table of contents ISBN: 9781450360142.
 (3) Bogina, Veronika, and Tsvi Kuflik. “Incorporating dwell time in sessionbased recommendations with recurrent Neural networks.” CEUR Workshop Proceedings. Vol. 1922. 2017.
 (4) Quadrana, Massimo, et al. “Personalizing sessionbased recommendations with hierarchical recurrent neural networks.” Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 2017.
 (5) https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix
 (6) Thierry BertinMahieux and Daniel P.W. Ellis and Brian Whitman and Paul Lamere. The Million Song Dataset. Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011). http://millionsongdataset.com/lastfm/
 (7) Dietmar Jannach and Malte Ludewig. When Recurrent Neural Networks meet the Neighborhood for SessionBased Recommendation. RecSys ’17 Proceedings of the Eleventh ACM Conference on Recommender Systems Pages 306310.
Comments
There are no comments yet.