With the exponential growth of data and information on the Internet, recommendation system plays a critical role in reducing information overload. Recommendation systems are widely deployed on many online services, including E-commerce, social networks and online news systems. How to design an effective recommendation system has been a fundamental problem in both academia and industry.
The key for recommendation system is to model the users’ preferences based on their interactions (e.g., clicks and rating) with the items. One of the most popular recommendation methods are known as collaborative filtering (CF) . Its basic idea is to match the users with similar item preferences. Among the various collaborative filtering methods, matrix factorization [10, 25]
is the mostly used one. However, these matrix factorization based methods are regarded as the linear methods, which are difficult to model the user-item interactions. Then following works use the deep neural networks to model the user-item relationships. Despite of their success, these CF-based methods only aim to model the direct links, i.e. the first-order relationship between the users and items. However, for a graph, only preserving the first-order relationships between the nodes is not enough to characterize the network structure and thus cannot achieve good performance [4, 2].
To preserve the second-order local structure in the networks, network embedding is an effective way 
. Network embedding aims to embed nodes into a low-dimensional vector space with the goal of capturing the low-order and high-order topological characteristics in graphs[16, 19, 9, 20, 24]. Although network embedding is able to incorporate local structures, they mainly target on tasks of common link prediction and node classification. Few of them deal with the task of recommendation and thus they seldom consider some specific properties of recommendation, which makes them difficult to get a good performance on recommendation. Last but not least, few of these methods can be applied to the billion-scale networks.
To extend network embedding to recommendation, we meet three challenges. (1) Diversity of user interest. User interest is always diverse and the diverse recommendation can help users explore new items of interest. Therefore, diversity has been a very important measure to evaluate the recommendation system . However, existing network embedding methods seldom consider the diversity. (2) Dynamic changes of user interest. User’s preference is dynamic and how to model such a dynamic property is another challenge. (3) Scalability of recommendation system. Existing recommendation scenario often has a huge number of users and items, which is a serious problem with a scale beyond most of existing network embedding methods.
To address these challenges, we propose RNE, a scalable Recommendation-based Network Embedding method. In our method, when discovering the local structure of a user, we will not model all the items the user clicked. Instead, we propose a sampling method, which considers the diversity and dynamics of the user interest, to sample a portion of the items the user has clicked as the user’s neighbors. In this way, the sampling method not only can incorporate the important properties of recommendation, i.e. the diversity and dynamics, to improve the recommendation accuracy, but also reduce the computational complexity of the algorithm. Furthermore, we deploy the algorithm on a recommendation system based on the Parameter Server to do distributed and parallel computing, which further facilitates the large-scale training available.
In summary, the contributions of the paper can be listed as follows:
We propose a network-embedding-based recommendation method, named RNE. When modeling the local structures between the users and items, our method is able to incorporate the dynamics and diversity of the user interest to produce more accurate and diverse recommendation results.
We implement our recommendation algorithm in a distributed way based on parameter server, which jointly makes the system available for billion-scale recommendation.
Experimentally, we deploy the whole system on a recommendation scenario of Taobao. Online A/B tests demonstrate that our method is able to achieve more accurate results compared with CF and greatly improve the diversity of the recommendation results. Experiments on offline dataset Pinterest also demonstrate the quality of our method.
2 Related Work
2.1 Collaborative Filtering
Recommendation algorithms and systems are well-investigated research fields. In our work, we are only given the user-item interaction data. Therefore, we mainly introduce the CF-based recommendation methods and omit the discussions of content-based recommendation methods and the hybrid recommendation methods.
Collaborative Filtering exploits the interaction graph between the users and items to give the recommendation lists to users. Its basic idea is to match the users which have similar item preferences. Earlier CF methods mainly use the matrix factorization on the user-item matrices to obtain the latent user factors and item factors [5, 10, 17]. The user factors and item factors together aim to reconstruct the original user-item matrices. However, the matrix factorization is just the linear-based methods, which is difficult to capture the user-item relationships. To overcome such a drawback, following works use the deep neural networks to perform collaborative filtering [21, 18]. However, most of the CF-based methods only aim to model the pairwise relationships between the user and item but omit their local structures. And many graph-based works have demonstrate that local structures like second-order relationships are very important for capturing graph structures . In this way, existing CF-based methods are sub-optimal for capturing the relationships between user and items.
2.2 Network Representation Learning
Network embedding has been demonstrated as an effective methods for modeling local and global structures of a graph. It aims to learn a low-dimensional vector-representation for each node. DeepWalk  and Node2vec  propose to use the random walk and skip-gram to learn the node representations. LINE  and SDNE  propose explicit objective functions for preserving first- and second-order proximity. Some further works [15, 2] use the matrix factorization to factorize high-order relation matrix. Aforementioned methods are designed for homogeneous networks. Then some following embedding methods for heterogeneous networks are proposed, like Metapath2vec  , HNE , BiNE  and EOE 
. Some works further focuses on knowledge graph embedding. Although these network embedding methods are able to preserve the local structures of the vertices, most of them are not specifically designed for the task of recommendation. They do not consider some specific properties of the recommendation tasks like the diversity and dynamic changes of user interest, the scalability issues of large-scale recommendation tasks. Therefore, how to propose an effective network embedding method for billion-scale recommendation is still an open problem.
In summary, we compare our method and the related works in Table 1. Our method is specifically designed for the recommendation scenario and thus consider some specific properties. Furthermore, the proposed method is very scalable and thus can apply to billion-scale recommendations.
3 The Methodology
In our scenario, we have a large number of users and items. Each user may have different ways to interact with the items. For example, the user may view the items, collect the items or buy the items. In this way, we can build the user-item interaction graph, formally formulated as . Here denotes the total of users and denotes the total of items. denotes the set of nodes in . If a user views, collects or buys an item , there is an edge between and . We use to denote the edges connected to the node . We assume that is connected. The recommendation problem is that given a user , we hope to recommend some personalized items to the user based on his previous behavior.
3.1 Network Embedding for Recommendation
Given the user-item interaction graph , we aim to map each user and item to a common low-dimensional latent space, where user can be embedded as and item can be embedded as . Then with the embeddings for each user and item, we can retrieve the similar items for the user as his recommendation results.
To achieve this, we propose our method, whose framework can be shown in Figure 1
. It consists of the embedding-lookup layer, embedding layer and softmax layer. The embedding-lookup layer helps us obtain the embeddings for the users and items. The embedding layer and softmax layer together model the interactions between the users and items to update the embedding-lookup layer. Then we introduce the designed loss functions to update the embeddings.
We first consider how to model the local structure of a user in the given user-item graph. In the original space, the empirical distributions given a user can be defined as:
where is the weight between user and item and is the degree of user .
Then we hope to estimate the local structure of a user in the embedding space. Word2vec inspires us to use the inner product between two vertices to model their interactions. Then in our work, given a user
, we define the probability of itemgenerated by user as:
where means the transpose of a matrix.
Eqn. 1 is a softmax-like loss function, which defines the conditional distributions of user over its neighborhoods, i.e. the entire item set, in the embedding space.
With the empirical distributions on the original network and reconstructed distributions on the embedding space, we can learn the embedding by making the defined probability specified by the low-dimensional representations be close to the empirical distributions . We use the KL-divergence to measure the distance between the distributions. Then the loss functions can be defined as:
where denotes the prestige of user and we set .
Minimizing Eq. 2 will make the vertices with similar neighbors similar to each other. Therefore, it can not only model the observed links on the graph, but also preserve the local structures for each node.
3.2 Recommendation-based Sub-sampling
However, aforementioned network embedding meets two challenges for large-scale recommendation: (1) Minimizing Eqn. 2 is time-consuming since for each edge it needs to run over the entire set of the items when evaluating . In this way, the whole complexity is , which is unbearable for real recommendation systems. (2) Minimizing Eqn. 2 only considers the topology of the graph. It does not consider the diversity and the time decay of the user interest, which are very important properties for recommendation systems.
To reduce the complexity, we first adopt negative sampling as many methods do . For each positive edge , we will sample some negative edges according to predefined distributions . By performing negative sampling, the objective function for each edge can be reformulated as:
where is the number of negative samples for each user-item pair, .
Although negative sampling can reduce the time complexity from to , for billion-scale recommendation, a complexity linear to the number of edges is still a great challenge.
To further reduce the complexity, we only select a portion of the items the user has clicked to obtain his behavior sequence. Then the question comes to how to select the items to effectively represent the user’s interest. Here, we mainly consider two properties specified for recommendation. (1) The diversity of user interest: User interest is always diverse. A user will always focus on the items of more than one cluster. (2) The time decay of user interest: User interest is always dynamic. More recent user behavior is more reliable to reflect the recent user interest. Therefore, we should more focus on recent user behavior. Based on these two considerations, we define the selection probability for each user-item pair as follows:
where is the hours of the item from the most recent item, is the cluster index of item and , is set to . Then for each user, we will sample samples according to the defined probability in Eqn. 4 to represent his behavior sequence. Then in this way, the complexity can be reduced from to , which is linear to the number of nodes.
In summary, on the one hand, if a user more recently shows the interest to an item, the item should have a larger probability to be sampled. On the other hand, the method is prone to sample the items of the clusters clicked less times by the user. In this way, our method may cover more clusters to ensure the diversity. Therefore, such a sampling strategy can simultaneously model the diversity and time decay of the user interest Furthermore, with the sampling strategy, we do not need to model all the edges in one iteration but instead for each user we only model a portion of its preferred items as the user’s behavior sequence. It significantly reduce the time complexity.
In this section, we will introduce the technical implementation of the proposed RNE. The whole process can be divided into two phases: offline model training and online retrieval. This section will describe them in detail.
3.3.1 Off-line Model Training
To train the proposed RNE, we utilize the Stochastic Gradient Descent (SGD) on the loss function of Eqn.3 to update the node embeddings. In detail, we use to denote all the positive edges sampled by the method we proposed before. Then for each , we can update their embeddings as follows:
where is the indicator function that if , , otherwise . is the negative neighborhoods of vertex . denotes the learning rate. Similarly, we can update embedding for an item in a similar way, which we will not discussed more.
From Eqn. 5, when given a positive edge, we can update their embeddings. Then we will go over all the pair of positive edges for several iterations to update their embeddings. The whole algorithm can be summarized in Alg. 1.
From Alg. 1, we find that the learning process from Line to Line is independent for different edges, which inspires us to use some parallelization mechanism to implement it. Then we deploy the whole algorithm on the parameter server, which implements a data-parallelization mechanism. In detail, from Eqn. 5 we find that to update a node’s embedding, we only need to know the node’s previous embeddings, the node’s neighborhoods and their embeddings. Therefore, we can resort to parameter server to implement such a process in a parallelized way. The main workflow of the system is built as follows: (1) In each iteration, the server will assign each worker a subset of the vertices of the graph . (2) Each worker will pull the assigned vertices from the server and calculate the positive and negative neighborhoods for the assigned vertices. Then with positive and negative sets, each worker can update the embeddings of the assigned vertices according to Eqn 5. (3) After updating, each worker will push his assigned vertices’ embddings to the server. Such a training process will be iterated several times.
3.3.2 Online Efficient Nearest Neighbor Search
For online recommendation, we use the nearest neighbor search on the learned embedding space to make recommendations. That is, given a query user , we can recommend items whose embeddings are the -nearest-neighbors (-nn) of the query user’s embedding . To achieve the -nn search, we use the Faiss library ， which is an efficient implementation for state-of-the-art product-quantization methods. Given that RNE is trained offline and all the user and item embeddings are computed via Parameter Server and saved in database, the efficient -nn search enables the system to recommend items online.
The goal of RNE is to produce high-quality and scalable recommendations for real-world systems. Therefore, we conduct comprehensive experiments in two ways: Online A/B tests and Offline experiments.
We use two real-world datasets, i.e. Ali-mobile taobao and Pinterest in this paper.
Ali-mobile taobao: It is a mobile recommendation scenario deployed on Taobao, the largest E-commerce platform in China. The dataset is extremely large. It has about billion users, tens of million items and a total of about one hundred billion edges. Each edge denotes whether the user has clicked the products. We deploy our algorithm on the service to do online A/B test to evaluate our method.
Pinterest: The dataset is an image recommendation dataset constructed by . We filter the users which have very few interactions with the items and only retain the users which have more than interactions. After the pre-processing, the dataset consists of thousand users, thousand items and million user-item edges. Each edge denotes whether the user has pinned the items.
4.2 Online A/B tests
The ultimate goal of the recommendation system is to lift the user’s interest in the items. Therefore, we perform random A/B experiments on Ali-mobile taobao to demonstrate this, where a random set of users obtain the recommendation results of RNE and another obtain the results of CF-based methods. Any difference in the engagement of the items between the two groups can truly reflect the recommendation quality of two methods. Note that here we only use one baseline because deploying many methods online to do A/B tests will cost a lot of resources. And the reason why we choose CF is that it is well investigated for recommendation and existing network embedding methods cannot scale to billion-scale dataset. For more comparisons with state-of-the-art methods, we do offline experiment, which we will introduce in detail later.
We use the following six metrics to measure the recommendation quality.
AVD (Averaged View Depth): The metric denotes how deep a user views the page. It measures the recommendation quality.
ACN (Averaged Click Number): The metric measures the number of clicks on the items for each user in average. It measures the recommendation quality.
P-CTR (Page Click-through Rate): for a page , . It measures the recommendation quality.
U-CTR (User Click-through Rate): for a user , . It measures the recommendation quality.
Re-C (Recommended number of clusters): The averaged number of clusters recommended to users. The clusters are obtained by using our clustering algorithm. The metric measures the recommendation diversity.
CK-C (Clicked number of clusters): The averaged number of clusters clicked by users. It measures both the recommendation quality and the diversity.
We find that RNE can achieve a significant improvement in terms of AVD and ACN over the CF. It indicates by using the results of RNE, users are more willing to go deeper to view more items and click more items, which indirectly demonstrates the ranking quality of RNE.
In terms of the two CTR metrics, a popular and well-accepted metric to evaluate the recommendation quality, our proposed method also achieves a better result than CF. It further demonstrates that RNE is able to produce personalized items for users. The reason for a better recommendation quality is twofold. (1) Our method is able to capture the local structures between the users and items. (2) Our method considers the dynamic change of the user’s interest.
We find that RNE achieves a higher Re-C compared with CF, which indicates that our recommended results are from more clusters. The reason is that our proposed method incorporates the diversity issue into the model design.
More importantly, our method achieves a higher CK-C than CF, which demonstrates that not only our method can produce more diverse recommendations, but also the users are willing to click these diverse recommended items. It indicates that our method is able to improve the recommendation quality while improving the recommendation diversity.
Under the billion-scale scenario, RNE can be deployed online and still obtain good results, which demonstrates the superiority of our method.
In this section, we give some showcase to see some intuitions regarding the embeddings we learn. After the learning process of RNE, all the items will have embeddings. Then in this experiment, given a query item’s embedding, we aim to find the most similar items whose embeddings have the smallest distance with the query. Then we display both the query image and the recommended images in Figure 2.
In Figure 2, the query item is a princess-style educational toy for girls. When we look at the returned results, these images belong to different categories with the query, like plasticine and origami. But all of them are for fun and a majority of them are also princess-style. It demonstrates that our method is able to find more categories of items but retain the primary style of the item. In Figure 2, the query item is a woman sweatshirt of the brand of Peacebird. The returned images are all coat, sweater or sweatshirt of the brand of Peacebird or Only. Similarly, the returned images and the query image are all casual style but belong to different fine-grained categories. Moreover, actually the brand of Peacebird and Only have very similar styles and our method can learn their inherit relationships.
In summary, in our method, we do not have the item features and the direct relationships between items. We are only given the user-item interaction graph. Although in this case, our method still can model the item relationships by using the user behaviors as the bridge. It demonstrates that by using the network embedding method we propose, the learned embeddings can capture the local relationships between the entities.
4.4 Offline Experiment
To compare more baselines to get comprehensive results, we conduct the offline experiments on the dataset of Pinterest. We randomly sample 90 user-item pairs as the training set and the rest as the testset. For training set, we use -fold cross-validation to tune the parameters for all the methods. Note that in this dataset, we do not have the cluster and time information for the item. So we uniformly sample the items to do training. To evaluate the performance, we use the following three metrics: Normalized Discounted Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR) and Hit Rate (HR). NDCG and MRR will consider the rank of the hit and will assign higher scores to hits at top ranks. While HR will only evaluate whether the test items are hit or not. We calculate all the metrics for the test users and report the average score.
We first use the advanced CF-based methods GMF, MLP and NCF  as baseline methods. We perform the same process of parameter search as the work  did to select the optimal parameters. For network embedding methods, since we only have the graph topology, in this case LINE  and node2vec  are state-of-the-art network embedding methods, so we choose them as the baselines. For LINE, we use LINE with the default parameter settings. For node2vec, we also use the default settings except for the bias parameters , which we conduct the grid search from . The embedding dimension of them is all set as 128.
, we find that RNE achieves a better performance than all the CF-based methods. The reason is that RNE is able to capture the local structure of each user while CF-based methods only focus on the direct links the user has clicked. It demonstrates that capturing the local structures on the user-item graph is important for recommendation. LINE, node2vec and RNE achieve similar performance in different evaluation metrics and scenarios. But our method runs much faster than node2vec and LINE, which will be discussed later. Therefore, RNE is a better balance between accuracy and efficiency.
Now we discuss the training time of LINE, node2vec and RNE. For a fair comparison, we do not use the distributed strategy for RNE. From Figure 3, we find that RNE can boost the running time over LINE and node2vec. Specifically, when the training edges increase from million to million, the running time improvement of RNE compared with LINE will be larger and larger, from x to x. When the edges continuously increase to the billion-scale dataset like the Ali-mobile taobao dataset, it is difficult for LINE and node2vec to obtain the results. But RNE can still obtain a good result. The reasons why our method can scale to billion-scale dataset are twofold: (1) The proposed sampling method avoids us running over all the edges in the graph. (2) Our method can be deployed on distributed system for parallel computations.
In summary, RNE has a good scalability, which is much more efficiency than baseline methods and can scale to billion-scale recommendation scenario, meanwhile RNE do not sacrifice its recommendation accuracy.
In this paper, we propose a novel network embedding method named RNE for scalable recommendation. The proposed network embedding method is able to capture the local structures on the user-item graph to achieve a better recommendation quality. Specifically, to consider the specific properties for recommendation, i.e the diversity and time-decay of user interest, we design a sampling method for embedding process to incorporate these properties. And the sampling method also guarantees the scalability of the proposed method while almost preserving the recommendation quality. We also deploy our algorithm on parameter server to make it available for large-scale recommendation. Experimental results on online A/B tests and offline experiments all demonstrate the superiority of the proposed method.
For the future work, we may consider the user and item features, which can further address the sparsity and cold-start problem. We also want to analyze the role of features and topology structures for recommendation.
We would like to thank all the colleagues of our team and all the members of our cooperative team: the search engine team in Alibaba. They provide many helpful comments for the paper. We also would like to thank the support of the Initiative Postdocs Supporting Program and the valuable comments provided by all the reviewers.
-  Adomavicius, G., Kwon, Y.O.: Improving aggregate recommendation diversity using ranking-based techniques. TKDE 24(5), 896–911 (2012)
-  Cao, S., Lu, W., Xu, Q.: Grarep: Learning graph representations with global structural information. In: CIKM. pp. 891–900 (2015)
-  Chang, S., Han, W., Tang, J., Qi, G.J., Aggarwal, C.C., Huang, T.S.: Heterogeneous network embedding via deep architectures. In: SIGKDD. pp. 119–128. ACM (2015)
-  Cui, P., Wang, X., Pei, J., Zhu, W.: A survey on network embedding. TKDE (2018)
-  Deshpande, M., Karypis, G.: Item-based top- n recommendation algorithms. Acm Trans.inf.syst 22(1), 143–177 (2004)
-  Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: Scalable representation learning for heterogeneous networks. In: SIGKDD. pp. 135–144. ACM (2017)
-  Gao, M., Chen, L., He, X., Zhou, A.: Bine: Bipartite network embedding (2018)
-  Geng, X., Zhang, H., Bian, J., Chua, T.S.: Learning image and user features for recommendation in social networks. In: ICCV. pp. 4274–4282 (2015)
-  Grover, A., Leskovec, J.: node2vec:scalable feature learning for networks. In: SIGKDD. pp. 855–864 (2016)
-  Harvey, M., Carman, M.J., Ruthven, I., Crestani, F.: Bayesian latent variable models for collaborative item rating prediction. In: CIKM. pp. 699–708 (2011)
-  He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.S.: Neural collaborative filtering pp. 173–182 (2017)
-  Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. Acm Transactions on Information Systems 22(1), 5–53 (2004)
-  Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734 (2017)
-  Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
-  Ou, M., Cui, P., Pei, J., Zhang, Z., Zhu, W.: Asymmetric transitivity preserving graph embedding. In: SIGKDD. pp. 1105–1114 (2016)
-  Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: SIGKDD. pp. 701–710. ACM (2014)
-  Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: International Conference on World Wide Web. pp. 285–295 (2001)
-  Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Large-scale information network embedding. In: WWW. pp. 1067–1077 (2015)
-  Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: SIGKDD. pp. 1225–1234 (2016)
-  Wu, Y., Dubois, C., Zheng, A.X., Ester, M.: Collaborative denoising auto-encoders for top-n recommender systems pp. 153–162 (2016)
-  Xiao, H., Huang, M., Zhu, X.: Transg: A generative model for knowledge graph embedding. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). vol. 1, pp. 2316–2325 (2016)
-  Xu, L., Wei, X., Cao, J., Yu, P.S.: Embedding of embedding (eoe): Joint embedding for coupled heterogeneous networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. pp. 741–749. ACM (2017)
-  Zhang, Z., Cui, P., Li, H., Wang, X., Zhu, W.: Billion-scale network embedding with iterative random projection. arXiv preprint arXiv:1805.02396 (2018)
-  Zhou, X., Chen, L., Zhang, Y., Qin, D., Cao, L., Huang, G., Wang, C.: Enhancing online video recommendation using social user interactions. VLDBJ 26(5), 637–656 (2017)