1 Introduction
Matrix factorization into low rank components plays a vital role in many applications, such as recommender systems [1], computational biology [2] and computer vision [3]. Standard approaches to matrix factorization involves either explicitly looking for lowrank approximation via nonconvex optimization [4, 5] or resorting to convex, nuclear norm based methods [6, 7]
. These methods can be viewed as learning dense and low dimensional vector
embeddings of the row and column entities (for example, users and movies), the inner products of which best approximate entries of the target matrix (for example, ratings). Such approaches for MF suffer from two drawbacks: practical datasets are often highly imbalanced, where some users might rate several items, whereas some users rarely provide feedback. Some of the items themselves are more popular than the others, making the observed entries highly nonuniform [8, 9]. Secondly, modern datasets also contain vast amounts of side information, incorporating which typically aids in prediction [10, 11, 12, 13, 14]. For example, one might have access to user social networks, product copurchasing graphs, gene interaction networks and so on.While it seems natural to enforce the learned embeddings to be faithful to the side information present, there also exists implicit information in the data. For instance, two users consistently providing same ratings to the same set of movies implicitly suggests the similarity in their preferences irrespective of the explicit links that may or may not exist between them. Indeed, low rank representations inherently look to achieve this by assuming that users (in this case) all lie in a common low dimensional subspace. However, explicitly modeling such an implicit similarity tends to yield better results. Recent advances have demonstrated that exploiting secondorder cooccurrences of users and items indeed results in better recommendations [15]. This implicit information is especially useful when the dataset is highly sparse and imbalanced, as is the case in most real world applications.
In this paper, we develop a framework that can both seamlessly incorporate the provided side information if it exists, as well as use higher order information in the data to learn user and item embeddings. The method we propose tackles both the issue of sparse data, as well as utilizing side information in an elegant manner. Furthermore, we develop an algorithm that precludes the computation of matrix higher powers^{2}^{2}2To be made clear in the sequel, preserving the sparsity of the original problem and making computations highly efficient.
We model the target matrix combined with side information as a graph and then learn lowrank vector embeddings that best approximate the local neighborhood structure of the constructed graph. The key idea is that given a transition probability matrix for a graph , the entry of encodes the probability of jumping between nodes and in steps. Thus, even when the given data is very sparse, we can use it to glean missing entries, and use those to learn embeddings for users and items.
Let us consider a movie recommendation system. We construct a graph with all the users and movies as nodes. An (weighted, undirected) edge exists between a user node and movie node if the corresponding entry in the target matrix exists i.e. the user has rated the movie. Side information such as social networks amongst users, when present, are also used to form edges between user nodes. We aim to approximate the multistep transition probability, i.e., probability of jumping between two nodes within a small, predefined number of hops, via a function proportional to the inner product between the node embeddings. This objective naturally encodes both the implicit and side information: transition is more likely between user nodes preferring a similar set of items; a dense clique in user social network also increase their probability of cooccurrence within a small number of steps. Noting that we consider cooccurrence within multiple steps, our embedding could account for higher order implicit similarities rather than the usual pairwise similarity between users or movies, or just second order cooccurrences [15]. Our experiments reinforce our claims that taking higherorder information indeed improves the prediction accuracy on a variety of datasets. We consider datasets with binary ratings as well as starratings, and those with and without side information. We call our method HOMF: Higher Order Matrix Factorization.
HOMF is closely related to recent advances in representation learning in natural language processing
[16, 17] and graphs [18, 19, 20]. [21] provides a comprehensive survey on various graph embedding techniques. While these works also consider cooccurrences of words within a small window in a sentence or nodes within a few steps on graphs, their objective is equivalent to factorizing the logarithm of such local cooccurrences [22]. In contrast, HOMF directly attempts to approximate the multistep transition probabilities. Moreover, in this paper, we explore various methods to constructedge weights of our graph from both ratinglike and binary target matrix. Also, unlike the negative sampling and stochastic gradient descent approach in
[16, 18], we derive a coordinate descent based methods for solving the corresponding optimization problem. Finally, a recent line of work has focussed on using convolutional neural networks defined on graphs
[23]to estimate the embeddings for users and items
[24]. The key idea in these methods is that node embeddings can be learnt via an aggregation of those of it’s neighbors, and the “scale” of the convolutional filter acts as the number of hops in a random walk. The method we propose is significantly simpler, and the optimization problem is more efficient. Despite this, we show that our method is competitive, and in several cases outperforms the methods that rely on deep neural nets to learn the embeddings.The rest of the paper is organized as follows: We first review related work. In the next section, we formally set up the problem we wish to solve and provide motivation for our algorithm. In Section 3 we describe our algorithm, and comment on its computational complexity as well as remark on its generality. We provide extensive experimental results in Section 4. In Section 5, we discuss possible future directions and conclude the paper in Section 6.
1.1 Related Work
1.1.1 Incorporating Side Information:
While several methods for vanilla matrix factorization (see (2)) have been proposed, incorporating explicit side information is a much more recent phenomenon. Given attributes or features for the row and/or column entities, [13, 25] consider the row and column factors to be a linear combination of the features. This was extended to nonlinear combinations in [26]. If graph information is known beforehand, then methods proposed in [11, 14] can be applied to solve a slightly modified program: (2). Further, note that one can construct the pairwise relationship between entities from the feature representations. Methods that can implicitly glean relationships between row and/or column entities have not been that forthcoming, an exception being [15], that uses second order information only.
1.1.2 Random Walks on Graphs:
[27, 28] and references therein view collaborative filtering as functions of random walks on graphs. Indeed, the canonical useritem rating matrix can be seen as a bipartite graph, and several highly scalable methods have been proposed that take this viewpoint [29]. Methods that incorporate existing graph information in this context have also been studied [30, 12]. [31] consider metrics such as average commute times on random walks to automatically figure out the similarity between nodes and apply it to recommender systems, but also note that such methods do not always yield the best results. Similarly, [32] consider local random walks on a useritem graph, and resorts to a PageRankstyle algorithm. Furthermore, they require the graph to have a high average degree, something most applications we consider will not have. In fact, canonical datasets might have a few nodes with high degree, but most nodes will have a very low degree.
1.1.3 Learning Node Embeddings on Graphs:
Recently, ideas from learning vector representations of words [16] have been used to obtain vector embeddings for nodes in a graph. Specifically, the authors in [18] considered random walks on a graph as equivalent to “sentences” in a text corpus, and applied the standard Word2Vec method on the resulting dataset. [33] showed that such a method is equivalent to factorizing a matrix polynomial after logarithmic transformations. In contrast, we directly aim to factorize a matrix polynomial with no transformations. Furthermore, we develop a method that makes computing the higher matrix powers in this polynomial redundant. These methods that rely on learning node embeddings of a graph can be viewed through a common lens: the embeddings are typically a weighted (nonlinear) combination of their neighbors in the graph. [24] use the notion of graph convolutional neural networks to learn the embeddings, in the presence of side information presented as graphs. Finally, [34] consider a neighborhoodweighted scheme, and while the proposed method is not practical, provide theoretical justification for learning such embeddings.
2 Problem Setup
We assume we are given a partially observed target matrix . Let the set of observed entries in be denoted by , where typically . Given a rank parameter , the goal in standard Matrix Factorization (MF) is to learn dimensional embedding vectors corresponding to the row and column indices of the matrix . The standard matrix factorization algorithm aims to solve a problem of the form^{3}^{3}3bias variables are typically included, but we omit them from the text here for ease of explanation.:
(1) 
where is the projection operator that retains those entries of the matrix that lie in , and .
Incorporating Side Information : Following recent works by [11, 14], it is possible to model the relationships between row (and column) entities via graphs. For example, in the case of a recommender system there might exist a graph among users, such as a social network, and a product copurchasing graph
among items. Therefore, it is reasonable to encourage users belonging to the same social community or products often copurchased together to have similar embeddings. Current stateoftheart techniques proposed to solve the MF problem with side information encourage the embeddings of the row and column entities to be faithful with respect to the eigenspace of the corresponding Laplacians:
(2) 
where and are the graph Laplacians corresponding to and respectively.
Our approach HOMF: While the aforementioned methods take side information into account, they cannot handle sparsity in the data. Specifically, if the observed matrix is highly sparse, and if the observed nonzero pattern is highly nonuniform, then it becomes hard to obtain reasonable solutions. In this paper, we look for a unified approach that can make use of explicit side information provided in the form of graphs and implicit information inherent in the data, via the use of higher order relationships among the variables (rows and columns).
To this end, we propose constructing a (weighted, undirected) graph with containing all the row entities and column entities. The edges in the graph are constructed as follows:

If two row entities are connected in , we form an edge with . Here is some nonnegative, monotonic function of the edge weight in . The same procedure is repeated for column entities that are connected in .

If is a row node (e.g., user) and is a column node (e.g., a movie), we form an edge to encode the interactions observed in the data matrix . should also be a nonnegative, monotonic function, and can potentially be the same as .

We then scale the sideinformation edges with some weight parameter . The datamatrix edges are scaled by . This allows us to tradeoff the side information, and serves the same purpose as the parameter in (2).
As a result of our graph construction, if a user rates a movie highly, then there will be a large edge weight between that user and movie. Similarly, if two movies are connected to each other via , then again we expect the edge weight to be large. Figure 1 fortifies these claims where we find the nearest neighbors for different movies using the vector embeddings generated by HOMF. We see that movies rated similarly by users tend to cluster together, and if users’ embeddings are close to each other, they will be recommended similar movies.
Typical choices for include:

Exponential :

Linear : for some

Step :
Appropriately arranging the nodes, the overall adjacency matrix of our constructed graph can be represented as:
(3) 
where with some abuse of notation, the functions act elementwise on the arguments (i.e. matrices). We denote by the rownormalized version of this matrix , so that each row of sums up to . is thus a transition probability matrix (TPM), with indicating the probability that a random walk starting at node jumps to node in one step.
Let
for some positive integer . Let be the set of nonzero entries of .
From a probabilistic point of view, the entry of is the probability of jumping from node to in random steps on . Thus, is the probability of jumping from to at least once in steps in a random walk on . Hence, if this value is large, then it is safe to assume that nodes and are highly related to each other, even if there was no edge between them in . Our aim is to then ensure that the learned embeddings for these two nodes are similar. For example, two users rating the same set of movies have a higher probability of jumping between each other in a few steps on . In other words, if two users have rated movies in a similar fashion, then their future interactions will tend to be more alike than not. Similarly, two users with common friends in the side social network also have a higher transition probability to land on the other within a small number of jumps.
Our objective in this paper is to solve:
(4) 
where . Letting be the row of , we use as a proxy for the predicted value of the corresponding entry in the target matrix.
We note that the design parameter controls the overall weights of the side information, and the number of steps determines how “local” our search space in the graph will be. Overall, (2) encourages nodes with higher cooccurrence probability within a small number of steps to have similar vector representations. The adverse effects of noisy/missing side information can be reduced by appropriately tuning . When there is no side information, the algorithm forces to be 0 and ignores the side information matrix. Further motivation for factorizing is provided in Section 5.
2.1 Connections to Graph Convolutions
Graph convolutions (e.g [35]) have been recently developed as a generalization of spatial convolutions applied on graphs. The authors in [23] showed that graph convolutions can be efficiently computed by considering polynomials of the graph Laplacians, and performing matrix multiplications in the spectral domain. The key idea is that a node representation can be considered as a nonlinear combination of representations of it’s neighbors, where the scale of the convolutional filter determines the number of hops from a node. Specifically, given a scale , one can compute the coefficients of the polynomial filter
where
is the singular values of the graph Laplacian. However, one first needs to compute the SVD of the Laplacian, which can be prohibitive even for moderately sized graphs. Furthermore, the iteration complexity in this case is equivalent to that of learning a Convolutional Neural Network. We work directly with the graph adjacency matrix and hence avoid SVD computations. Furthermore, in the next section we show that we need not even compute the higher powers of the adjacency matrix. We also do not learn the parameters of the linear combination
, and in fact including such a parametrization of the matrix might yield further gains, at the expense of additional computations, something which we will leave for future work. Despite not incorporating these complexities, we show that HOMFperforms comparably to and often outperforms Graph CNN based models.3 An Efficient Algorithm For HOMF
We now describe the algorithm to solve (2). Note that once the matrix is formed, the problem reduces to standard matrix factorization, for which highly efficient methods exist. However, obtaining is expensive both from a computational and memory point of view. Indeed, regardless of the sparsity of , for even small , (say ) will not be sparse. Figure 2 displays this phenomenon, where we created a random matrix , randomly select the observed set with different sparsity levels, and constructed a block diagonal (eqn. (3)), without side information graphs (). ^{4}^{4}4In this case, the maximal sparsity of can be . Even for sparse , the multistep transition quickly become dense.
Given the size of modern datasets, storing such will be prohibitive, let alone storing . One thus needs an efficient method to solve the optimization problem we have. We will see in the sequel that we in fact need not compute , but only specific rows or columns of the matrix, which can be done efficiently.
We propose a coordinate descent method to alternatively update and . Let (respectively ) be the row (respectively column) of (respectively ) at iteration . We describe our method to update keeping fixed. The updates for with fixed will be analogous. At iteration , the update equation for keeping the other entities fixed is as follows:
(5) 
This ridge regression problem can be solved columnwise without storing the
matrix. The key observation is that each column of can be calculated efficiently via the following recursion:where is the canonical basis vector. This computation requires only the storage of sparse matrix and can be executed onthefly by sparse matrixvector multiplications. We outline the recursive details in Algorithm 1. An analogous procedure to the one mentioned in Algorithm 1 can be applied to obtain the row wise updates.
The updates for rowwise and columnwise optimization strategy lead us to solve the following problem:
(6) 
where is the set of nonzeros of the column of . We note that Eq. (6) can be solved using standard regularized least squares solvers, for example the conjugate gradient method as we adopted in this paper. Moreover, the updates for different columns can be parallelized. The empirical speedup via this parallelization is investigated later in the experiment section. Algorithm 2 details the pseudocode for solving (5). A rowwise analogous version of Algorithm 2 can be used to update .
3.1 Computational Complexity
We briefly describe the computational complexity of our method. Each matrixvector multiplication in Algorithm 1 can be performed in time. For a given , the complexity for column sampling is then . The main bottleneck in the conjugate gradients procedure to solve (6) is the Hessianvector multiplication. The Hessian of (6) is , and hence the complexity of the conjugate gradient method is . The periteration complexity of Algorithm 2 is thus , since the percolumn update can be parallelized. Typically is a small number, so , and hence the complexity of the method is essentially linear in and the size of the data, including graph side information, since .
3.2 Generality of HOMF
We see that HOMF is highly general, and can efficiently incorporate preexisting side information via the parameter . Indeed, the side information graphs can also yield interesting higher order interaction information, as we will see from our experiments. When side information is not present, we can obtain higher order information from only existing data. Furthermore, when , effectively ignores higher order information, and is conceptually similar to the standard MF routine. in the same vein is similar to the cofactor method in [15]. Finally, we can parameterize as
and obtain further gains by learning the parameter vector . We leave this for future work.
4 Experiments
Dataset  Type  rows  columns  entries  Graph present?  links 
ML1M  05  6040  3952  1000208  no   
FilmTrust  05  1508  2071  35497  rows  1632 
Gene  Binary  1071  1150  2908  rows and columns  7424 (rows) 
437239 (columns) 
In this section, we test our method against standard matrix factorization and other stateoftheart methods that use graph side information. We have attempted to use a variety of datasets, some with additional graph containing side information, and some without to demonstrate the universality of our algorithm. We also consider a binary dataset with graphs. Details about the datasets are provided in Table 1. MovieLens 1 million (ML1M) is a standard movie recommendation dataset ^{5}^{5}5https://grouplens.org/datasets/movielens/1m/ [36]. The FilmTrust dataset [37] has a similar task as that of ML1m, but also has useruser network given ^{6}^{6}6http://www.librec.net/datasets.html. The Gene dataset is a binary matrix, where the task is to determine what genes are useful for predicting the occurrence of various diseases ^{7}^{7}7http://bigdata.ices.utexas.edu/project/genedisease/. This dataset also contains both genegene interaction network data and a disease comorbidity graph. Our primary motivation between choosing such diverse datasets is to emphasise the effectiveness and flexibility of our algorithm in handling different types of data, side information without significant additional preprocessing.
For ML1M and FilmTrust, we generate a random 80/20 TrainTest split of the data. For Gene, we use the split provided online. Since this is a oneclass classification problem, we randomly sampled the same number of negative samples as in the training data, to act as the negative class for classification.
4.1 Baselines, Parameters and Initialization
We fixed the rank (k) for all methods to be 10. For comparing against standard matrix completion and the graph based counterparts (equations (2) and (2)), we used the GRALS method [14]^{8}^{8}8Evaluated using the code available online: https://www.cs.utexas.edu/ rofuyu/expcodes/grmfexp.zip. We also compared to Weighted MF [38]^{9}^{9}9Evaluated using the code available online: https://github.com/benanne/wmf. For FilmTrust dataset with side information graph, we also compared against the TrustWalker algorithm [12] which makes recommendations using random walks on graphs. We also compare against CoFactor [15], which can be seen as a specialization of our method to use only second order information, and the graph convolutional method introduced recently in [24] (MGCNN), where the authors show improved performance over GRALS for the same tasks.
We implemented our method in python using the multiprocessing framework to parallelize the method in Algorithm 2 ^{10}^{10}10Python Implementation of our algorithm is publicly available at https://github.com/vatsal2020/HOMF . Wherever applicable, we varied and crossvalidate on a fixed validation set (20 of the training data). When the graph side information is available, we also varied . We consider random walks of lengths for all datasets. For ML1M, FilmTrust and Gene, was fixed at and respectively. We initialized the factor matrices such that each entry is an independent uniform random variable. For the method in [24], we used the default neural network parameters, set the rank = 10 to be consistent with our models, and used the output of the graph CNN as the user and movie embeddings. The default “bandwidth” for the graph CNNs is set to , which we refer to as (MGCNNdef). We also varied this bandwidth using the same as in our method, which we call MGCNN in the results below.
4.2 Evaluation Metrics
For the Gene dataset with binary observations, we computed the standard AUC score on the test set. For nonbinary data (ML1M and FilmTrust), we compute Precision, Recall, Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (NDCG), all various thresholds (K). Note that since we are factorizing a nonlinear transformation of the ratings matrix , the RMSE will not be a useful metric to compare. For the ML1M dataset, we determine that a movie rated by a user is a true positive if the corresponding rating is 5 while . For the FilmTrust dataset, we determine that a movie is a true positive if the corresponding rating is at least 3 and corresponding values of are set to and since the dataset is small and it is hard to find many highly rated movies for most users. The K and threshold values are determind by the average number of highly rated movies for each user which is consistent with most available literaute. For the sake of completeness and to prevent confusion, we provide explicit formulas for the metrics used. Let be items in test set rated by user sorted by the predicted score. Let if user rated item as relevant in the groundtruth test data and otherwise. For simplicity, let be the total number of relevant items per user. We first calculate for each user :
and then average across all the users to get Precision@K, Recall@K, and MAP@K. For NDCG, we first calculate
and based on the ordered ground truth ranking of all the items. We obtain and average it across all users.
Data  Algos  Precision  Recall  MAP  NDCG  

ML1M  MF  0.316  0.311  0.598  0.614  0.462  0.474  0.692  0.707 
Cofactor  0.360  0.320  0.530  0.641  0.480  0.514  0.492  0.531  
WMF  0.357  0.329  0.532  0.649  0.475  0.535  0.653  0.744  
MGCNNdef  0.367  0.323  0.549  0.649  0.496  0.530  0.740  0.744  
MGCNN  0.367  0.329  0.578  0.665  0.496  0.533  0.740  0.744  
HOMF  0.370  0.331  0.544  0.662  0.499  0.529  0.744  0.749  
FilmTrust  MF  0.701  0.633  0.345  0.436  0.795  0.744  0.761  0.747 
TrustWalker  0.506  0.497  0.316  0.456  0.598  0.607  0.584  0.568  
GRALS  0.752  0.740  0.365  0.492  0.812  0.801  0.772  0.770  
WMF  0.750  0.746  0.366  0.501  0.803  0.761  0.775  0.771  
Cofactor  0.716  0.706  0.349  0.469  0.773  0.762  0.716  0.732  
MGCNNdef  0.751  0.741  0.370  0.495  0.805  0.793  0.772  0.767  
MGCNN  0.761  0.748  0.367  0.499  0.822  0.793  0.777  0.775  
HOMF  0.754  0.745  0.375  0.502  0.816  0.802  0.778  0.773 
Comparison of various algorithms using TopN evaluation metrics. HOMF performs better than the methods compared consistently. Bold values indicate the best result among the methods considered. The exponential activation function was used to obtain the following results, and we set
for our graphs.4.3 Results
Table 2 summarizes the results obtained on the test set in ML1M and FilmTrust. We see that HOMFconsistently outperforms standard matrix factorization, and also the version that uses graph side information (GRALS) as well as CoFactor. In some cases, the performance gap is significant. MGCNN is a much more complex model, and it is often better than the other methods as the authors in [24] show. We see that HOMF is for the most part competitive with, and often outperforms MGCNN, despite being significantly simpler to optimize and having fewer parameters to learn.
Data  Method  AUC 

Gene  MF  0.546 
GRALS  0.572  
Cofactor  0.506  
WMF  0.568  
MGCNNdef  0.569  
MGCNN  0.599  
HOMF(exp)  0.623  
HOMF(linear)  0.630 
In Table 3, we show the results obtained on the GeneDisease dataset. In this setting, with rich graph information encoding known relationships between the entities, HOMF significantly outperforms the competing methods. This suggests that there are hidden, higher order interactions present in the data, which the given graphs do not fully capture by themselves. This opens an interesting avenue for further research, especially in domains such as computational biology where obtaining data is hard, and hence hidden information in the data is even more valuable. Furthermore, a simpler parameterization helps in this case, since the data is sparse and NN based methods typically need a lot of data to train.
4.4 Effect of , and
Next, we show that the walk length is a (hyper)parameter of our method, and should be tuned during cross validation. Figure 3 (a)(b) shows the effect of varying for the ML1M dataset. We see that bigger is not necessarily better when it comes to but for higher orders , we usually get superior performance in terms of the evaluation metrics.
We also studied the effect of varying on the FilmTrust dataset, where we fixed . Figure 3 (c)(d) again shows that different values of yield different results, though they are fairly uniform.
We now show the effect of the parameter (Figs. 3) and 4, the length of walks considered, in HOMF, as well as the weighting parameter (Fig. 3) when the side information graphs are present. We also consider both the and functions for
Figure 4 shows the effect of varying the walk length on the runtime of the algorithm, on the ML1M dataset. As expected, the time increases with T. Importantly, note that the time increases nearly linearly with (with a very small slope), as indicated by the complexity argument in Section 3.1. However, it also has to be noted that the time difference is not much: approximately 15 seconds extra for the method when , compared to seems to suggest that the method is scalable. From a theoretical perspective, even though it may not be useful to take very long walks () ^{11}^{11}11for most real world graphs, very large T would mean a dense matrix to factorize, it is impressive to note that long walk lengths do not hamper computational performance.
data used  Algorithm  Precision  Recall  MAP  NDCG  

25  HOMF  0.342  0.307  0.514  0.630  0.461  0.500  0.714  0.721 
WMF  0.329  0.296  0.499  0.608  0.452  0.491  0.706  0.709  
50  HOMF  0.361  0.329  0.536  0.665  0.466  0.507  0.718  0.734 
WMF  0.325  0.296  0.498  0.613  0.443  0.487  0.699  0.705  
75  HOMF  0.372  0.329  0.554  0.672  0.502  0.538  0.740  0.746 
WMF  0.324  0.293  0.497  0.606  0.443  0.484  0.697  0.701  
100  HOMF  0.370  0.331  0.544  0.662  0.499  0.529  0.744  0.749 
WMF  0.357  0.329  0.532  0.649  0.475  0.535  0.653  0.744 
4.5 Effect of Sparse Data
The experiments were performed by randomly subsampling the training data but the best hyperparameter search and evaluation was performed on the complete validation and test sets respectively. Table
4 shows a marginal degradation of performance on the ML1M dataset even when we hide 5075 of the data. MGCNN, which is similar in spirit also performs similarly, but methods such as WMF, which do not account for higher order interactions show a significant drop in performance.4.6 Speedup via Parallelization
For a given number of processors , we define the speedup to be
Figure 5 shows that we obtain nearlinear speedups via the parallel procedure in Algorithm 2. We used the ML1M dataset here and varied the number of processors over which we parallelized the updates. When the number of processors is increased to about 20, we see that the corresponding speedup is nearly 17.
5 Discussion and Future Work
Our motivation behind transforming the original rating matrix into a Transition Probability Matrix is to find higher order information from the graph. will encode the order information. Our aim is to assimilate as much (nonredundant) higher order information as possible. Another key point is that our approach is highly general, and by setting we can obtain similar performance to standard matrix factorization.
At large , might not capture too much discriminative information as two nodes far apart in the graph are forced to have a similar representation. At low values of , one can miss out on the implicit higher order relationships between nearby nodes of the graph. Hence, it is necessary to be prudent while choosing .
The development of a fully distributed method remains an open challenge. Given the matrix , we can solve the problem in a parallel fashion on a multicore machine as we have demonstrated in this paper. However, for extremely large datasets, the matrix might not fit into memory. When itself is distributed, computing higher order powers can be a challenging task, since Algorithm 1 requires access to the full matrix .
From a theoretical perspective, we would like to address the problem posed in Equation 2 and obtain convergence guarantees for the low rank matrix factorization for functions of matrix without explicitly computing the function. As of now, we are unaware of any literature pertaining to matrix factorization of higher orders of , say let alone for general matrices . Another interesting avenue is to compute the minimum number of samples required in to recover as accurately as possible.
6 Conclusion
In this paper, we presented HOMF, a unified method for matrix factorization that can take advantage of both explicit and implicit side information in the data. We developed a scalable method for solving the resulting optimization problem and presented results on several datasets. Our method efficiently includes both existing side information and implicit higher order information. Experiments on datasets with and without side information shows that our method outperforms several baselines, and often also those that rely on deep neural networks to make predictions.
References
 [1] Y. Koren, R. Bell, C. Volinsky, et al., “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
 [2] A. K. Biswas, M. Kang, D.C. Kim, C. H. Ding, B. Zhang, X. Wu, and J. X. Gao, “Inferring disease associations of the long noncoding rnas through nonnegative matrix factorization,” Network Modeling Analysis in Health Informatics and Bioinformatics, vol. 4, no. 1, pp. 1–17, 2015.

[3]
B. Haeffele, E. Young, and R. Vidal, “Structured lowrank matrix
factorization: Optimality, algorithm, and applications to image processing,”
in
Proceedings of the 31st International Conference on Machine Learning (ICML14)
, pp. 2007–2015, 2014.  [4] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lockfree approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, pp. 693–701, 2011.
 [5] H.F. Yu, C.J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems,” in 2012 IEEE 12th International Conference on Data Mining, pp. 765–774, IEEE, 2012.
 [6] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, pp. 717–772, 2009.
 [7] J.F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.
 [8] S. Bhojanapalli, P. Jain, and S. Sanghavi, “Tighter lowrank approximation via sampling the leveraged element,” in Proceedings of the twentysixth annual ACMSIAM symposium on Discrete algorithms, pp. 902–920, SIAM, 2014.
 [9] A. Mnih and R. R. Salakhutdinov, “Probabilistic matrix factorization,” in Advances in neural information processing systems, pp. 1257–1264, 2008.
 [10] D. Crandall, D. Cosley, D. Huttenlocher, J. Kleinberg, and S. Suri, “Feedback effects between similarity and social influence in online communities,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 160–168, ACM, 2008.
 [11] T. Zhou, H. Shan, A. Banerjee, and G. Sapiro, “Kernelized probabilistic matrix factorization: Exploiting graphs and side information.,” in SDM, vol. 12, pp. 403–414, SIAM, 2012.
 [12] M. Jamali and M. Ester, “Trustwalker: a random walk model for combining trustbased and itembased recommendation,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 397–406, ACM, 2009.
 [13] M. Xu, R. Jin, and Z.H. Zhou, “Speedup matrix completion with side information: Application to multilabel learning,” in Advances in Neural Information Processing Systems, pp. 2301–2309, 2013.
 [14] N. Rao, H.F. Yu, P. K. Ravikumar, and I. S. Dhillon, “Collaborative filtering with graph information: Consistency and scalable methods,” in Advances in Neural Information Processing Systems, pp. 2107–2115, 2015.
 [15] D. Liang, J. Altosaar, L. Charlin, and D. M. Blei, “Factorization meets the item embedding: Regularizing matrix factorization with item cooccurrence,” in Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 59–66, 2016.

[16]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in
Advances in neural information processing systems, pp. 3111–3119, 2013.  [17] O. Levy and Y. Goldberg, “Neural word embedding as implicit matrix factorization,” in Advances in neural information processing systems, pp. 2177–2185, 2014.
 [18] B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710, ACM, 2014.
 [19] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Largescale information network embedding,” in Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077, 2015.

[20]
C. Tu, W. Zhang, Z. Liu, and M. Sun, “Maxmargin deepwalk: discriminative
learning of network representation,” in
Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence (IJCAI 2016)
, pp. 3889–3895, 2016.  [21] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A survey,” arXiv preprint arXiv:1705.02801, 2017.
 [22] C. Yang and Z. Liu, “Comprehend deepwalk as matrix factorization,” arXiv preprint arXiv:1501.00358, 2015.
 [23] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in Neural Information Processing Systems, pp. 3844–3852, 2016.
 [24] F. Monti, M. Bronstein, and X. Bresson, “Geometric matrix completion with recurrent multigraph neural networks,” in Advances in Neural Information Processing Systems, 2017.
 [25] P. Jain and I. S. Dhillon, “Provable inductive matrix completion,” arXiv preprint arXiv:1306.0626, 2013.
 [26] S. Si, K.Y. Chiang, C.J. Hsieh, N. Rao, and I. S. Dhillon, “Goal directed inductive matrix completion,” KDD, 2016.
 [27] M. Gori, A. Pucci, V. Roma, and I. Siena, “Itemrank: A randomwalk based scoring algorithm for recommender engines.,” in IJCAI, vol. 7, pp. 2766–2771, 2007.
 [28] W. Xie, D. Bindel, A. Demers, and J. Gehrke, “Edgeweighted personalized pagerank: breaking a decadeold performance barrier,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1325–1334, ACM, 2015.
 [29] H. Yun, H.F. Yu, C.J. Hsieh, S. Vishwanathan, and I. Dhillon, “Nomad: Nonlocking, stochastic multimachine algorithm for asynchronous and decentralized matrix completion,” Proceedings of the VLDB Endowment, vol. 7, no. 11, pp. 975–986, 2014.
 [30] J. A. Golbeck, “Computing and applying trust in webbased social networks,” ACM Digital Library, 2005.
 [31] F. Fouss, A. Pirotte, J.M. Renders, and M. Saerens, “Randomwalk computation of similarities between nodes of a graph with application to collaborative recommendation,” IEEE Transactions on knowledge and data engineering, vol. 19, no. 3, pp. 355–369, 2007.
 [32] Z. Abbassi and V. S. Mirrokni, “A recommender system based on local random walks and spectral methods,” in Proceedings of the 9th WebKDD and 1st SNAKDD 2007 workshop on Web mining and social network analysis, pp. 102–108, ACM, 2007.
 [33] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang, “Network representation learning with rich text information,” in Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, pp. 2111–2117, 2015.
 [34] C. Borgs, J. Chayes, C. E. Lee, and D. Shah, “Thy friend is my friend: Iterative collaborative filtering for sparse matrix estimation,” in Advances in Neural Information Processing Systems, pp. 4718–4729, 2017.
 [35] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [36] F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 5, no. 4, p. 19, 2016.
 [37] G. Guo, J. Zhang, and N. YorkeSmith, “A novel bayesian similarity measure for recommender systems,” in Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), pp. 2619–2625, 2013.
 [38] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback datasets,” in Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pp. 263–272, Ieee, 2008.
7 Appendix
A Simple Example
Let us consider a simple ratings matrix, R:
(7) 
In this part, we will describe how to construct from and (if given). We will assume that we use the exponential weighting (any nonnegative nondecreasing mapping can be used)of the ratings to construct the transition probability matrix . then becomes,
(8)  
(9) 
Let us assume that we have no side information for the moment. The row normalized version of
i.e. is given as follows:(10) 
In presence of side information for movies, becomes,
(11) 
where
(12) 
Using the same exponential decompostion (or a linear one), we can rewrite A as:
(13) 
where
7.1 Some observations
In this section, we will enlist some of the properties we came across for , which we believe should be useful for theoretical guarantees:
P1:
P2:
is a monotonic nonincreasing function of (for ) and nondecreasing function of for .
The motivation of using is twofold: retention of the higher order information and ensuring the existence of low rank factorization. The latter is evident by P2 and Figure 6.
In Figure 6, we demonstrate that the magnitude of all the eigenvalues ^{12}^{12}12Note that these eigenvalues do not correspond to a particular dataset but are representative for the trajectories the actual eigenvalues of will follow. decay as we collect higherorder information. This buttresses our claims that for large can be approximated with a lowrank factorization.
P3:
The trace of is positive for any and so is not nilpotent.
As we allow the random walks to travel back to the same entity (user/movie) from which it began in two or more steps, the probability of returning back to the same entity is nonzero and so the trace of the amtrix is always nonzero for higher powers of .
P4:
Assuming the matrix does not have allzero rows/columns,
Assuming no rows/columns with all zero entries, there has to be atleast one column with a sum greater than 1, since all rows sum to 1. Since the matrix is not doubly stochastic, its norm is greater than 1.
Comments
There are no comments yet.