We study the prediction problem where the observation is a sequence of graphs adjacency matrices and the goal is to predict . This type of problem arises in applications such as recommender systems where, given information on purchases made by some users, one would like to predict future purchases. In this context, users and products can be modeled as nodes of a bipartite graph, while purchases or clicks are modeled as edges.
In functional genomics and systems biology, estimating regulatory networks in gene expression can be performed by modeling the data as graphs and fitting predictive models is a natural way for estimating evolving networks in these contexts.
A large variety of methods for link prediction only consider predicting from a single static snapshot of the graph - this includes heuristics(Liben-Nowell & Kleinberg, 2007; Sarkar et al., 2010), matrix factorization (Koren, 2008) or probabilistic methods (Taskar et al., 2003)
. More recently, some works have investigated using sequences of observations of the graph to improve the prediction, such as using regression on features extracted from the graphs(Richard et al., 2010), using matrix factorization (Koren, 2010), or in some special cases probabilistic techniques. Most techniques, however, do not explicitly take into account the inherently sparse nature of usual sequences of adjacency matrices. In this work, we extend the work of (Richard et al., 2010) to address this and propose in addition a more principled way of predicting using features extracted from the sequence of graph snapshots.
We make the following assumptions about the graph sequence (represented by adjacency matrices ):
Low-rank. has low rank. This reflects the presence of highly connected groups of nodes such as communities in social networks.
Autoregressive linear features. We assume given a linear map defined by a set of matrices
such that the vector time serieshas an autoregressive evolution:
where is a sparse matrix such that is stationary. An example of linear features is degrees that is a popularity measure in social and commerce networks.
2 Formulation of an optimization problem
In order to reflect the stationarity assumption on
we use a convex loss function
to penalize the dissimilarity between two feature vectors at successive time steps. Let us introduce
We also use to design the elementwise extension of to s. In case of quadratic loss, we consider the following penalized regression objective :
To predict , we propose a regression term penalized by the sum of and trace norm in the same fashion as in (Richard et al., 2012) in order to predict the future graph given the prediction of its features should approximate well:
The overall objective function we consider here is the sum of the two partial objectives and , which is convex as and are both convex.
Let us introduce the linear map defined by
The objective can be written as a penalized least squared regression on the joint variable :
3 Oracle inequality
Define , i.e.,
We define , where are defined in (1) and let
We defined and such that they verify
The following result can be proved using the tools introduced in (Koltchinskii et al., 2011).
Let be the minimizers of over a convex cone . Suppose that
for some , and for any and ,
, , for any real number ;
The latter inequality shows how the quality of the solution is bounded by the rank and sparsity of the future graph , and the interplay between these two prior through the parameter . The dependence in quantifies the improvement of the estimation in terms of the number of observations.
4.1 Generalized forward-backward algorithm for minimizing
We use the algorithm designed in (Raguet et al., 2011) for minimizing our objective function. Note that this algorithm outperforms the method introduced in (Richard et al., 2010) as it directly minimizes jointly in whereas the previous method first estimates by minimizing a functional similar to and then minimizes .
In addition to this we use the novel joint penalty from (Richard et al., 2012) that is more suited for estimating graphs. The proximal operator for the trace norm is given by the shrinkage operation, if
is the singular value decomposition of,
Similarly, the proximal operator for the -norm is the soft thresholding operator defined by using the entry-wise product of matrices denoted by :
The algorithm converges under very mild conditions when the step size is smaller than , where is the operator norm of .
4.2 Non-convex Factorization Method
An alternative method to the estimation of low-rank and sparse matrices by penalizing a mixed penalty of the form as in (Richard et al., 2012) is to factorize where are sparse matrices, and penalize . The objective function to be minimized is
which is a non-convex function of the joint variable , making the theoretical analysis more difficult. Given that the objective is convex in a neighborhood of the solution, by initializing the variables adequately, we can write an algorithm inspired by proximal gradient descent for minimizing it.
5 Numerical Experiments
5.1 A generative model for graphs having linearly autoregressive features
Let be a sparse matrix, its pseudo-inverse such, that . Fix two sparse matrices and . Now define the sequence of matrices for by
for i.i.d sparse noise matrices and , which means that for any pair of indices
, with high probabilityand .
If we define the linear feature map , note that
The sequence follows the linear autoregressive relation
For any time index , the matrix is close to that has rank at most
is sparse, and furthermore is sparse
We tested the presented methods on synthetic data generated as in section (5.1). In our experiments the noise matrices and where built by soft-thresholding iid noise , . After choosing the parameters by 10-fold cross-validation, we compare our methods to standard baselines in link prediction (Liben-Nowell & Kleinberg, 2007). We use the area under the ROC curve as the measure of performance and report empirical results averaged over 10 runs. Nearest Neighbors (NN) relies on the number of common friends between each pair of nodes, which is given by when is the cumulative graph adjacency matrix and we denote by Shrink the low-rank approximation of . Since is unknown we consider the feature map where is the SVD of .
The experiments suggest the empirical superiority of the proposed approaches to the standard baselines. It is very intriguing that the non-convex matrix factorization outperforms the convex rival. A possible explanation is that minimizing the nuclear norm by using the shrinkage operator results in factorizations of the solution by two orthogonal matrices, which conflicts with the sparsity of the solution. The other benefit of the non-convex formulation is its scalability, as the proximal method proposed for the convex formulation scales in in storage and in time. Several questions open perspectives for further investigations.
Choice of the feature map . In the current work we used the projection onto the vector space of the top- singular vectors of the cumulative adjacency matrix as the linear map , and this choice has shown empirical superiority to other choices. The question of choosing the best measurement to summarize graph information as in compress sensing seems to have both theoretical and application potential.
Characterization of sparse and low-rank matrices. Can all the sparse and low-rank matrices be written as where are both sparse? Or in other terms, what is the relation between the solution of problems penalized by -such as - and those, e.g. , penalized by ?
[Appendix : Proof of proposition (1)]
Appendix A Preliminary Tools
Definition 1 (Orthogonal projections associated with ).
Let be a rank matrix. We can write the SVD of in two ways: or , where are orthogonal and . Let and matrices of size ortho-normally completing the bases of and , and define the projections onto the vector spaces spanned by vectors and for :
and define the orthogonal projection
We highlight the fact that can also be written as
We know that and . The two following inequalities will also be useful:
Lemma 1 (Rank inequalities).
For any matrix ,
Lemma 2 (Orthogonality of the decomposition).
For any matrix , with the same notations, we have
and the 4 terms are pairwise orthogonal. It follows that
We have the identity
It follows that
Appendix B Proof
We have for any , by optimality of :
Thanks to trace-duality and -duality we have and for any , so for any ,
now using assumptions , , and and then triangular inequality
For proving the other bound, we start by setting some notations. Let , and let , , . Let be the SVD of and let , where , are sign matrices of , and such that , and is the entry-wise product. Let
denote an element of the subgradient of the convex function , so and . There exist and such that
We use the two standard inequalities of convex function subdifferentials and and a similar inequality on subdifferentials on and of , denoted by and . We get
Therefore we obtain
The inequality (5) can be written as
Thanks to Cauchy-Schwarz
so we have
We need to bound . For this, note that by definition,
and decompose for any
We get by applying triangle inequality, Cauchy-Schwarz, Hölder inequality written for the trace-norm and -norm
by rank and support inequalities obtained again by Cauchy-Schwarz
Now by using
we can rewrite the inequality (7) as follows:
the last inequality being due to the assumptions , and .
So finally, and by using again these assumptions,
and the result follows by using
and setting :
- Koltchinskii et al. (2011) Koltchinskii, V., Lounici, K., and Tsybakov, A. Nuclear norm penalization and optimal rates for noisy matrix completion. Annals of Statistics, 2011.
- Koren (2008) Koren, Y. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 426–434. ACM, 2008.
- Koren (2010) Koren, Y. Collaborative filtering with temporal dynamics. Communications of the ACM, 53(4):89–97, 2010.
- Liben-Nowell & Kleinberg (2007) Liben-Nowell, D. and Kleinberg, J. The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007.
- Raguet et al. (2011) Raguet, H., Fadili, J., and Peyré, G. Generalized forward-backward splitting. Arxiv preprint arXiv:1108.4404, 2011.
- Richard et al. (2010) Richard, E., Baskiotis, N., Evgeniou, Th., and Vayatis, N. Link discovery using graph feature tracking. Proceedings of Neural Information Processing Systems (NIPS), 2010.
Richard et al. (2012)
Richard, E., Savalle, P.-A., and Vayatis, N.
Estimation of simultaneously sparse and low-rank matrices.
Proceeding of 29th Annual International Conference on Machine Learning, 2012.
- Sarkar et al. (2010) Sarkar, P., Chakrabarti, D., and Moore, A.W. Theoretical justification of popular link prediction heuristics. In International Conference on Learning Theory (COLT), pp. 295–307, 2010.
- Taskar et al. (2003) Taskar, B., Wong, M.F., Abbeel, P., and Koller, D. Link prediction in relational data. In Neural Information Processing Systems, volume 15, 2003.