Graph Prediction in a Low-Rank and Autoregressive Setting

05/07/2012 ∙ by Emile Richard, et al. ∙ 0

We study the problem of prediction for evolving graph data. We formulate the problem as the minimization of a convex objective encouraging sparsity and low-rank of the solution, that reflect natural graph properties. The convex formulation allows to obtain oracle inequalities and efficient solvers. We provide empirical results for our algorithm and comparison with competing methods, and point out two open questions related to compressed sensing and algebra of low-rank and sparse matrices.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study the prediction problem where the observation is a sequence of graphs adjacency matrices and the goal is to predict . This type of problem arises in applications such as recommender systems where, given information on purchases made by some users, one would like to predict future purchases. In this context, users and products can be modeled as nodes of a bipartite graph, while purchases or clicks are modeled as edges.

In functional genomics and systems biology, estimating regulatory networks in gene expression can be performed by modeling the data as graphs and fitting predictive models is a natural way for estimating evolving networks in these contexts.

A large variety of methods for link prediction only consider predicting from a single static snapshot of the graph - this includes heuristics

(Liben-Nowell & Kleinberg, 2007; Sarkar et al., 2010), matrix factorization (Koren, 2008) or probabilistic methods (Taskar et al., 2003)

. More recently, some works have investigated using sequences of observations of the graph to improve the prediction, such as using regression on features extracted from the graphs

(Richard et al., 2010), using matrix factorization (Koren, 2010), or in some special cases probabilistic techniques. Most techniques, however, do not explicitly take into account the inherently sparse nature of usual sequences of adjacency matrices. In this work, we extend the work of (Richard et al., 2010) to address this and propose in addition a more principled way of predicting using features extracted from the sequence of graph snapshots.

We make the following assumptions about the graph sequence (represented by adjacency matrices ):

  1. Low-rank. has low rank. This reflects the presence of highly connected groups of nodes such as communities in social networks.

  2. Autoregressive linear features. We assume given a linear map defined by a set of matrices

    (1)

    such that the vector time series

    has an autoregressive evolution:

    where is a sparse matrix such that is stationary. An example of linear features is degrees that is a popularity measure in social and commerce networks.

2 Formulation of an optimization problem

In order to reflect the stationarity assumption on

we use a convex loss function

to penalize the dissimilarity between two feature vectors at successive time steps. Let us introduce

and

We also use to design the elementwise extension of to s. In case of quadratic loss, we consider the following penalized regression objective :

To predict , we propose a regression term penalized by the sum of and trace norm in the same fashion as in (Richard et al., 2012) in order to predict the future graph given the prediction of its features should approximate well:

The overall objective function we consider here is the sum of the two partial objectives and , which is convex as and are both convex.

Let us introduce the linear map defined by

The objective can be written as a penalized least squared regression on the joint variable :

3 Oracle inequality

Define , i.e.,

and

We define , where are defined in (1) and let

We defined and such that they verify

The following result can be proved using the tools introduced in (Koltchinskii et al., 2011).

Proposition 1.

Let be the minimizers of over a convex cone . Suppose that

  1. for some , and for any and ,

  2. , , for any real number ;

then

(2)

The latter inequality shows how the quality of the solution is bounded by the rank and sparsity of the future graph , and the interplay between these two prior through the parameter . The dependence in quantifies the improvement of the estimation in terms of the number of observations.

4 Algorithms

4.1 Generalized forward-backward algorithm for minimizing

We use the algorithm designed in (Raguet et al., 2011) for minimizing our objective function. Note that this algorithm outperforms the method introduced in (Richard et al., 2010) as it directly minimizes jointly in whereas the previous method first estimates by minimizing a functional similar to and then minimizes .

In addition to this we use the novel joint penalty from (Richard et al., 2012) that is more suited for estimating graphs. The proximal operator for the trace norm is given by the shrinkage operation, if

is the singular value decomposition of

,

Similarly, the proximal operator for the -norm is the soft thresholding operator defined by using the entry-wise product of matrices denoted by :

The algorithm converges under very mild conditions when the step size is smaller than , where is the operator norm of .

  Initialize
  repeat
     Compute .
     Compute
     Compute
     Set
     Set
  until convergence
  return minimizing
Algorithm 1 Generalized Forward-Backward to Minimize

4.2 Non-convex Factorization Method

An alternative method to the estimation of low-rank and sparse matrices by penalizing a mixed penalty of the form as in (Richard et al., 2012) is to factorize where are sparse matrices, and penalize . The objective function to be minimized is

which is a non-convex function of the joint variable , making the theoretical analysis more difficult. Given that the objective is convex in a neighborhood of the solution, by initializing the variables adequately, we can write an algorithm inspired by proximal gradient descent for minimizing it.

  Initialize
  repeat
     Compute .
     Set
     Set
     Set
  until convergence
  return minimizing
Algorithm 2 Minimize

5 Numerical Experiments

5.1 A generative model for graphs having linearly autoregressive features

Let be a sparse matrix, its pseudo-inverse such, that . Fix two sparse matrices and . Now define the sequence of matrices for by

and

for i.i.d sparse noise matrices and , which means that for any pair of indices

, with high probability

and .

If we define the linear feature map , note that

  1. The sequence follows the linear autoregressive relation

  2. For any time index , the matrix is close to that has rank at most

  3. is sparse, and furthermore is sparse

5.2 Results

We tested the presented methods on synthetic data generated as in section (5.1). In our experiments the noise matrices and where built by soft-thresholding iid noise , . After choosing the parameters by 10-fold cross-validation, we compare our methods to standard baselines in link prediction (Liben-Nowell & Kleinberg, 2007). We use the area under the ROC curve as the measure of performance and report empirical results averaged over 10 runs. Nearest Neighbors (NN) relies on the number of common friends between each pair of nodes, which is given by when is the cumulative graph adjacency matrix and we denote by Shrink the low-rank approximation of . Since is unknown we consider the feature map where is the SVD of .

Method AUC
NN 0.8691 0.0168
Shrink 0.8739 0.0169
0.9094 0.0176
0.9454 0.0087
Table 1: Performance of algorithms in terms of Area Under the ROC Curve.

6 Discussion

The experiments suggest the empirical superiority of the proposed approaches to the standard baselines. It is very intriguing that the non-convex matrix factorization outperforms the convex rival. A possible explanation is that minimizing the nuclear norm by using the shrinkage operator results in factorizations of the solution by two orthogonal matrices, which conflicts with the sparsity of the solution. The other benefit of the non-convex formulation is its scalability, as the proximal method proposed for the convex formulation scales in in storage and in time. Several questions open perspectives for further investigations.

  1. Choice of the feature map . In the current work we used the projection onto the vector space of the top- singular vectors of the cumulative adjacency matrix as the linear map , and this choice has shown empirical superiority to other choices. The question of choosing the best measurement to summarize graph information as in compress sensing seems to have both theoretical and application potential.

  2. Characterization of sparse and low-rank matrices. Can all the sparse and low-rank matrices be written as where are both sparse? Or in other terms, what is the relation between the solution of problems penalized by -such as - and those, e.g. , penalized by ?

[Appendix : Proof of proposition (1)]

Appendix A Preliminary Tools

Definition 1 (Orthogonal projections associated with ).

Let be a rank matrix. We can write the SVD of in two ways: or , where are orthogonal and . Let and matrices of size ortho-normally completing the bases of and , and define the projections onto the vector spaces spanned by vectors and for :

and define the orthogonal projection

We highlight the fact that can also be written as

or

We know that and . The two following inequalities will also be useful:

Lemma 1 (Rank inequalities).

For any matrix ,

Lemma 2 (Orthogonality of the decomposition).

For any matrix , with the same notations, we have

and the 4 terms are pairwise orthogonal. It follows that

  1. We have the identity

  2. It follows that

Appendix B Proof

We have for any , by optimality of :

(3)

Thanks to trace-duality and -duality we have and for any , so for any ,

(4)

now using assumptions , , and and then triangular inequality

For proving the other bound, we start by setting some notations. Let , and let , , . Let be the SVD of and let , where , are sign matrices of , and such that , and is the entry-wise product. Let

denote an element of the subgradient of the convex function , so and . There exist and such that

We use the two standard inequalities of convex function subdifferentials and and a similar inequality on subdifferentials on and of , denoted by and . We get

(5)

Therefore we obtain

The inequality (5) can be written as

(6)

Thanks to Cauchy-Schwarz

similarly

and

so we have

(7)

We need to bound . For this, note that by definition,

and decompose for any

We get by applying triangle inequality, Cauchy-Schwarz, Hölder inequality written for the trace-norm and -norm

by rank and support inequalities obtained again by Cauchy-Schwarz

(8)

Now by using

we can rewrite the inequality (7) as follows:

(9)

the last inequality being due to the assumptions , and .

So finally, and by using again these assumptions,

(10)

and gives

and the result follows by using

and setting :

References

  • Koltchinskii et al. (2011) Koltchinskii, V., Lounici, K., and Tsybakov, A. Nuclear norm penalization and optimal rates for noisy matrix completion. Annals of Statistics, 2011.
  • Koren (2008) Koren, Y. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 426–434. ACM, 2008.
  • Koren (2010) Koren, Y. Collaborative filtering with temporal dynamics. Communications of the ACM, 53(4):89–97, 2010.
  • Liben-Nowell & Kleinberg (2007) Liben-Nowell, D. and Kleinberg, J. The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007.
  • Raguet et al. (2011) Raguet, H., Fadili, J., and Peyré, G. Generalized forward-backward splitting. Arxiv preprint arXiv:1108.4404, 2011.
  • Richard et al. (2010) Richard, E., Baskiotis, N., Evgeniou, Th., and Vayatis, N. Link discovery using graph feature tracking. Proceedings of Neural Information Processing Systems (NIPS), 2010.
  • Richard et al. (2012) Richard, E., Savalle, P.-A., and Vayatis, N. Estimation of simultaneously sparse and low-rank matrices. In

    Proceeding of 29th Annual International Conference on Machine Learning

    , 2012.
  • Sarkar et al. (2010) Sarkar, P., Chakrabarti, D., and Moore, A.W. Theoretical justification of popular link prediction heuristics. In International Conference on Learning Theory (COLT), pp. 295–307, 2010.
  • Taskar et al. (2003) Taskar, B., Wong, M.F., Abbeel, P., and Koller, D. Link prediction in relational data. In Neural Information Processing Systems, volume 15, 2003.