Matrix embedding method in match for session-based recommendation

08/27/2019 ∙ by Qizhi Zhang, et al. ∙ Tmall 0

Session based model is widely used in recommend system. It use the user click sequence as input of a Recurrent Neural Network (RNN), and get the output of the RNN network as the vector embedding of the session, and use the inner product of the vector embedding of session and the vector embedding of the next item as the score that is the metric of the interest to the next item. This method can be used for the "match" stage for the recommendation system whose item number is very big by using some index method like KD-Tree or Ball-Tree and etc.. But this method repudiate the variousness of the interest of user in a session. We generated the model to modify the vector embedding of session to a symmetric matrix embedding, that is equivalent to a quadratic form on the vector space of items. The score is builded as the value of the vector embedding of next item under the quadratic form. The eigenvectors of the symmetric matrix embedding corresponding to the positive eigenvalues are conjectured to represent the interests of user in the session. This method can be used for the "match" stage also. The experiments show that this method is better than the method of vector embedding.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In some large E-COMMERCE, for example TABAO, AliExpress, the recommend algorithm is divided into two stages, i.e. the stage "match" and the stage "rank". In the stage "match", we need select item set with size from all the items. In the stage "rank", we compute a score for the items in the match item set, and rank them by score. We can use model of any form in the stage rank, but there are some restriction for the model in the stage match, it is that it must can quick pick up items from

or more items, hence it need an index. Only the models can generate index can be used in the stage match. The most familiar model for match is the static model, it compute the conditional probability

as score, and save a table with the fields "triger id", "item id", "score" indexed by "triger id" and "score" in offline. In online, we recall items with top N score using "triger id" as index, where "triger id" com from the items which user has behaviour.

Sequence prediction is a problem that involves using historical sequence information to predict the next value or values in the sequence. There are a lot of applications of this type, for example, the language model and recommend system. Recurrent Neural Network (RNN) is widely used to solve the sequence prediction problems.

For a given sequence , we wish predict . In the situation of recommend system, the s are the item which user clicked ( or buy, added to wish list, etc), hence the sequence is the representative of user. In the situation of language model, the s are the words in the sentence, hence the sequence is the representative of front part of the sentence.

The final layer is a full connectional layer with a softmax.

In SESSION_BASED , a session based model for recommend system is proposed. The model is like Figure 2.

This network structure is equivalent to the network structure in Figure 2. In Figure 2, we give two embedding for every item: if the item is passed, we call it “trigger”, and call the embedding “trigger embedding”; if the item is to be predict score, we call it “item”, and call its embedding “item embedding”. The layer “Extension by 1” means the static map

Because the output of final GRU layer collected all the trigger information up to time of the session, we can view the output of layer “Extension by 1” as “session embedding”. We set the dimension of item embedding equal to the dimension of session embedding, and define the output of network as the inner product of the session embedding and the item embedding. It easy to see that the network structure in Figure2 and Figure 2 are equivalent under the corresponding

where means the row of the matrix A and means the element of the column vector . Hence we call this method “vector embedding method”.

The session based model of with vector embedding method can used as a model for match. In fact, after the model is trained, we can save the vector embedding of items with some index, for example, KD-Tree, BallTree, …. When a user visit our recommend page, we compute the vector embedding of users session using the click sequences of user, and find the Top N items which vector embedding has max inner product with using index.

Figure 1: The network structure in SESSION_BASED
Figure 2: A equivalent network structure to SESSION_BASED

But the vector embedding method has an inherent defect. Because the interests of user may not be single. Suppose there are the item embeddings of dress and phone as shown in (Figure 4). Generally the interest to dress is independent to the interest to phone, we can suppose they are linear independent. If a user clicked dresses 20 times and phones 10 times in one session, then the vector embedding of this session will

mainly try to close to the dress, but will be drag away by phone under training, in the result, the vector embedding of session will lie between the dress and phone, which is not close to neither dress or phone. Hence when we predict using this embedding, we will recommend something like comb of dress and phone to the user as the top 1 selection, instead of the most interested dress. In other words, the scheme of vector embedding deprived the variousness of the intersection user in one session.

Figure 3: The inherent defect of vector embedding method
Figure 4: The matrix embedding method

In order to model the variousness of the intersection user in one session, we use “matrix embedding” of session instead of “vector embedding”.

In our method, the items are modeled as vectors of dimension still, but a session is modeled as a symmetric matrix in instead of a vector in . The score which represent the interest of the session to the item is modeled as

where is the vector embedding of item, and is the matrix embedding of the session. Because the symmetric matrix has the eigendecomposition (Eigendecomposition )

where is a real orthogonal square matrix, and is a diagonal square matrix with the elements on the diagonal line. In fact, are the eigenvalues of , and the -th column of is the eigenvector according to . In the example in Figure 4, the matrix embedding of session can has two eigenvalues significantly greater than others, whose eigenvectors are close to the lines along the embedding vector of dress and the embedding vector of phone respectively. Hence, the function

take its max value close to the direction of dress, where means the unit ball in . When we using this model to predict, we will recommend dress to the user as the top 1 selection.

2 Network structure

The Network structure of our new method is showed in Figure 4. The main difference between Figure 4 and Figure 2 is that

  1. We set the dimension of the hidden layers to be , where is the dimension of the embedding vectors of items.

  2. We use the layer “reshape to a symmetric matrix” instead of the layer “extension by 1”. The layer “reshape to a symmetric matrix” is defined as

    where

  3. We use the layer

    as the score layer instead of the inner product.

  4. There is a modifying of the item embedding layer. It is the upper half hyperplane embedding, i.e, the embedding vectors of items in the upper half hyperplane

    This modifying improve the performance greatly. We give some illustration of the reason: because the score value is invariant under the transformation , and if we train the model without the modifying of the item embedding layer, the embedding of items will lost its direction in training. The realizing of the layer “upper half hyperplane embedding” can be got through apply to the final coordinate of a vector in ordinal embedding layer.

3 Index method

For using as match method, we give two index method of matrix embedding method.

We formulate the problem as following:

There a lot of vector , for a symmetrical matrix , how we can find the top N s such that is maximal.

3.1 Flatten

Because is a symmetrical matrix, we have

where , and , . Therefore, we can map the user session matrix embedding into a linear space of dimension using , and map the item vector embedding into the same linear space using , the score is equal to the inner product of and . Hence, we can construct the index of for the vector embedding for all items in offline and get Top N items of maximal inner product for every in online like usual method to get Top N items of maximal inner product.

3.2 Decomposition

The Flatten method need build index for vectors of dimension . When the dimension

is big, it is difficult to save the data, build the index and search the items of maximal inner product. Hence we need a method to get the approximate top N items of maximal inner product faster. In fact, we have the Singular Value Decomposition

where , and . Hence we have

As a approximate method, we take a small positive integral number , and take the top N items of maximal inner product for , and hence we have items, then we take top N items from these kN items by computing .

4 Experiments

We give the experiment to compare matrix embedding method and vector embedding method on the Dataset RSC15 (RecSys Challenge 2015 111 http://2015.recsyschallenge.com/) and the last.fm last_fm dataset .

For the RSC15 dataset, after tuning the hyperparameters on the validation set, we retrained the three models above on the whole days among six months, and used the last single day to evaluate those models. When it comes to the last.fm playlists dataset, since the playlists have no timestamps, we followed the preprocessing procedure of

last_fm_paper , that is, randomly assigned each playlist to one of the 31 buckets (days), and used the lastest single day to evaluate.

We compare the tree models:

GRU4REC We re-implemented the code of GRU4REC which Hidasi et al. released online SESSION_BASED

in Tensorflow framework, including the whole GRU4REC architecture, the training procedure as well as the evaluation procedure.

GRU4REC with symmetric matrix To address the problem of GRU4REC demonstrated in section 1, we replace the output of the GRU i.e. the embedding vector of the current session with a symmetric matrix. More specifically…

GRU4REC with fully connected layer In addition to the above models, we also create a controlled experiment model as shown in Figure 3, which mainly based on the GRU4REC model but only add a fully connected layer right after the output of the GRU to expands the embedding vector size of the GRU output from dimensions to dimensions.

4.1 ACM RecSys 2015 Challenge Dataset

In order to evaluate the performance of the three models described in section 2.1, we constrained the total quantity of their parameters to the same range. The detail of the networks’ architecture are shown in table 1 respectively.

Table 2 shows the results when testing those three models on the last day of the ACM RecSys 2015 Challenge dataset for 10 epochs. After tuning on the validation set, we set lr=0.002, batch size = 256 for all the experiments. And since the GRU4REC and GRU4REC with FC layer model have less hidden units, dropout=0.8 shows better performance for them while dropout=0.5 performs better for the symmetric matrix model. Meanwhile, they’re using bpr loss and adam optimizer in all cases.

Method recall@20 mrr@20
GRU4REC 0.389 0.135
GRU4REC+FC 0.515 0.515
GRU4REC+Matrix 0.749 0.748
GRU4REC(1000) 0.632 0.247
Table 1: Results for the RSC15 dataset.

We additionally include the results in SESSION_BASED which uses 1000 hidden units for the GRU4REC model. It’s obvious that by combining symmetric matrix embedding method with GRU4REC, we could use less parameter to achive better recall@20 and mrr@20 performance.

Model GRU4REC GRU4REC+FC GRU4REC+Matrix
shape params total shape params total shape params total
input_embedding (37958, 32) 1214656 1214656 (37958, 32) 1214656 1214656 (37958, 32) 1214656 1214656
softmax_W (37958, 64) 2429312 3643968 (37958, 55) 2087690 3302346 (37958, 32) 1214656 2429312
softmax_b (37958,) 37958 3681926 (37958,) 37958 3340304 - - -
gru_cell/dense/kernel - - - (10, 55) 550 3340854 - - -
gru_cell/dense/bias - - - (55,) 55 3340909 - - -
gru_cell/gates/kernel (96, 128) 12288 3694214 (42, 20) 840 3341749 (560, 1056) 591360 3020672
gru_cell/gates/bias (128,) 128 3694342 (20,) 20 3341769 (1056,) 1056 3021728
gru_cell/candidate/kernel (96, 64) 6144 3700486 (42, 10) 420 3342189 (560, 528) 295680 3317408
gru_cell/candidate/bias (64,) 64 3700550 (10,) 10 3342199 (528,) 528 3317936
Table 2: Network Parameters For RecSys15 Dataset.

4.2 Last.FM playlists Dataset

For the last.fm music playlists datasets, we applied the same network structure for each model as mentioned above, the specific parameters are shown in Table 3 as follows.

Model GRU4REC GRU4REC+FC GRU4REC+Matrix
shape params total shape params total shape params total
input_embedding (200668, 32) 6421376 6421376 (200668, 32) 6421376 6421376 (200668, 32) 6421376 6421376
softmax_W (200668, 64) 12842752 19264128 ((200668, 55) 11036740 17458116 (200668, 32) 6421376 12842752
softmax_b ((200668,) 200668 19464796 (200668,) 200668 17658784 - - -
gru_cell/dense/kernel - - - (10, 55) 550 17659334 - - -
gru_cell/dense/bias - - - (55,) 55 17659389 - - -
gru_cell/gates/kernel (96, 128) 12288 19477084 (42, 20) 840 17660229 (560, 1056) 591360 13434112
gru_cell/gates/bias (128,) 128 19477212 (20,) 20 17660249 (1056,) 1056 13435168
gru_cell/candidate/kernel (96, 64) 6144 19483356 (42, 10) 420 17660669 (560, 528) 295680 13730848
gru_cell/candidate/bias (64,) 64 19483420 (10,) 10 17660679 (528,) 528 13731376
Table 3: Network Parameters For Last.fm Dataset.

Since the music playlists dataset is so different from the e-commerce click sequence dataset, after tuning on the validation set, we finally set lr = 0.0012 for all cases while the batch size and dropout configuration remain the same.

Table 4 shows results for the last.fm music playlists datasets. We can notice the same trend when comparing the results with the RecSys15 dataset.

Method recall@20 mrr@20
GRU4REC 0.027 0.022
GRU4REC+FC 0.054 0.054
GRU4REC+Matrix 0.164 0.164
GRU4REC(1000) 0.121 0.053
Table 4: Results for the last.fm dataset.

References

  • (1) Balazs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Session-based Recommendations with Recurrent Neural Networks. International Conference on Learning Representations(2016).
  • (2) Balazs Hidasi, Alexandros Karatzoglou. Recurrent Neural Networks with Top-k Gains for Session-based Recommendations. CIKM ’18 Proceedings of the 27th ACM International Conference on Information and Knowledge Management Pages 843-852. Torino, Italy — October 22 - 26, 2018 ACM New York, NY, USA ©2018 table of contents ISBN: 978-1-4503-6014-2.
  • (3) Bogina, Veronika, and Tsvi Kuflik. “Incorporating dwell time in session-based recommendations with recurrent Neural networks.” CEUR Workshop Proceedings. Vol. 1922. 2017.
  • (4) Quadrana, Massimo, et al. “Personalizing session-based recommendations with hierarchical recurrent neural networks.” Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 2017.
  • (5) https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix
  • (6) Thierry Bertin-Mahieux and Daniel P.W. Ellis and Brian Whitman and Paul Lamere. The Million Song Dataset. Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011). http://millionsongdataset.com/lastfm/
  • (7) Dietmar Jannach and Malte Ludewig. When Recurrent Neural Networks meet the Neighborhood for Session-Based Recommendation. RecSys ’17 Proceedings of the Eleventh ACM Conference on Recommender Systems Pages 306-310.