# Graph Prediction in a Low-Rank and Autoregressive Setting

We study the problem of prediction for evolving graph data. We formulate the problem as the minimization of a convex objective encouraging sparsity and low-rank of the solution, that reflect natural graph properties. The convex formulation allows to obtain oracle inequalities and efficient solvers. We provide empirical results for our algorithm and comparison with competing methods, and point out two open questions related to compressed sensing and algebra of low-rank and sparse matrices.

## Authors

• 6 publications
• 3 publications
• 28 publications
• ### Link Prediction in Graphs with Autoregressive Features

In the paper, we consider the problem of link prediction in time-evolvin...
09/14/2012 ∙ by Emile Richard, et al. ∙ 0

• ### Simultaneous Low-rank Component and Graph Estimation for High-dimensional Graph Signals: Application to Brain Imaging

We propose an algorithm to uncover the intrinsic low-rank component of a...
09/26/2016 ∙ by Rui Liu, et al. ∙ 0

• ### Graph-Induced Rank Structures and their Representations

A new framework is proposed to study rank-structured matrices arising fr...
11/13/2019 ∙ by Shivkumar Chandrasekaran, et al. ∙ 0

• ### Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies

Given the superposition of a low-rank matrix plus the product of a known...
04/30/2012 ∙ by Morteza Mardani, et al. ∙ 0

• ### Bayesian methods for low-rank matrix estimation: short survey and theoretical study

The problem of low-rank matrix estimation recently received a lot of att...
06/17/2013 ∙ by Pierre Alquier, et al. ∙ 0

• ### Matrix Coherence and the Nystrom Method

The Nystrom method is an efficient technique used to speed up large-scal...
08/09/2014 ∙ by Ameet Talwalkar, et al. ∙ 0

• ### Fast Stochastic Algorithms for Low-rank and Nonsmooth Matrix Problems

Composite convex optimization problems which include both a nonsmooth te...
09/27/2018 ∙ by Dan Garber, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We study the prediction problem where the observation is a sequence of graphs adjacency matrices and the goal is to predict . This type of problem arises in applications such as recommender systems where, given information on purchases made by some users, one would like to predict future purchases. In this context, users and products can be modeled as nodes of a bipartite graph, while purchases or clicks are modeled as edges.

In functional genomics and systems biology, estimating regulatory networks in gene expression can be performed by modeling the data as graphs and fitting predictive models is a natural way for estimating evolving networks in these contexts.

A large variety of methods for link prediction only consider predicting from a single static snapshot of the graph - this includes heuristics

(Liben-Nowell & Kleinberg, 2007; Sarkar et al., 2010), matrix factorization (Koren, 2008) or probabilistic methods (Taskar et al., 2003)

. More recently, some works have investigated using sequences of observations of the graph to improve the prediction, such as using regression on features extracted from the graphs

(Richard et al., 2010), using matrix factorization (Koren, 2010), or in some special cases probabilistic techniques. Most techniques, however, do not explicitly take into account the inherently sparse nature of usual sequences of adjacency matrices. In this work, we extend the work of (Richard et al., 2010) to address this and propose in addition a more principled way of predicting using features extracted from the sequence of graph snapshots.

We make the following assumptions about the graph sequence (represented by adjacency matrices ):

1. Low-rank. has low rank. This reflects the presence of highly connected groups of nodes such as communities in social networks.

2. Autoregressive linear features. We assume given a linear map defined by a set of matrices

 ω(S)=(⟨Ω1,S⟩,⋯,⟨Ωd,S⟩) (1)

such that the vector time series

has an autoregressive evolution:

 ω(At+1)=W⊤0 ω(At)+Nt

where is a sparse matrix such that is stationary. An example of linear features is degrees that is a popularity measure in social and commerce networks.

## 2 Formulation of an optimization problem

In order to reflect the stationarity assumption on

we use a convex loss function

 ℓ:Rd×Rd→R+

to penalize the dissimilarity between two feature vectors at successive time steps. Let us introduce

 XT−1=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝ω(A0)⊤ω(A1)⊤⋮ω(AT−1)⊤⎞⎟ ⎟ ⎟ ⎟ ⎟⎠∈RT×d

and

 XT=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝ω(A1)⊤ω(A2)⊤⋮ω(AT)⊤⎞⎟ ⎟ ⎟ ⎟ ⎟⎠∈RT×d  .

We also use to design the elementwise extension of to s. In case of quadratic loss, we consider the following penalized regression objective :

 J1(W)=1d∥XT−1W−XT∥22+κ ∥W∥1  .

To predict , we propose a regression term penalized by the sum of and trace norm in the same fashion as in (Richard et al., 2012) in order to predict the future graph given the prediction of its features should approximate well:

 J2(S,W)=1d∥ω(AT)W−ω(S)∥2F+τ∥S∥∗+γ∥S∥1

The overall objective function we consider here is the sum of the two partial objectives and , which is convex as and are both convex.

 L (S,W)≐1d∥XT−1W−XT∥22+κ ∥W∥1 +1d∥ω(AT)⊤W−ω(S)⊤∥22+τ∥S∥∗+γ∥S∥1 .

Let us introduce the linear map defined by

 Φ(S,W)=(XT−1W,  ω(S)⊤−ω(AT)⊤W)  .

The objective can be written as a penalized least squared regression on the joint variable :

 L(S,W)=1d ∥Φ(S,W)−(XT,0)∥22 +γ∥S∥1+τ∥S∥∗+∥W∥1  .

## 3 Oracle inequality

Define , i.e.,

 δ=XT−XT−1W0

and

 ϵ=W⊤0ω(AT)−ω(AT+1)  .

We define , where are defined in (1) and let

 Ξ=X⊤T−1δ−ω(AT)ϵ⊤  .

We defined and such that they verify

 ⟨(δ,ϵ),Φ(S,W)⟩=⟨(M,Ξ),(S,W)⟩

The following result can be proved using the tools introduced in (Koltchinskii et al., 2011).

###### Proposition 1.

Let be the minimizers of over a convex cone . Suppose that

1. for some , and for any and ,

 1d∥Φ(S1−S2,W1− W2)∥22 ≥μ−2( ∥S1−S2∥2F+T∥W1−W2∥2F)
2. , , for any real number ;

then

 ∥ˆS−AT+1∥2F+T∥ˆW−W0∥2F≤μ2min{μ2d(τ√rank(AT+1)√2+12+γ√∥AT+1∥0)2+μ2κ2dT∥W0∥0,  2τ∥AT+1∥∗+2γ∥AT+1∥1+2κ∥W0∥1} . (2)

The latter inequality shows how the quality of the solution is bounded by the rank and sparsity of the future graph , and the interplay between these two prior through the parameter . The dependence in quantifies the improvement of the estimation in terms of the number of observations.

## 4 Algorithms

### 4.1 Generalized forward-backward algorithm for minimizing L

We use the algorithm designed in (Raguet et al., 2011) for minimizing our objective function. Note that this algorithm outperforms the method introduced in (Richard et al., 2010) as it directly minimizes jointly in whereas the previous method first estimates by minimizing a functional similar to and then minimizes .

In addition to this we use the novel joint penalty from (Richard et al., 2012) that is more suited for estimating graphs. The proximal operator for the trace norm is given by the shrinkage operation, if

is the singular value decomposition of

,

 proxτ||.||∗(Z)=Udiag((σi−τ)+)iVT.

Similarly, the proximal operator for the -norm is the soft thresholding operator defined by using the entry-wise product of matrices denoted by :

 proxγ||.||1=sgn(Z)∘(|Z|−γ)+.

The algorithm converges under very mild conditions when the step size is smaller than , where is the operator norm of .

### 4.2 Non-convex Factorization Method

An alternative method to the estimation of low-rank and sparse matrices by penalizing a mixed penalty of the form as in (Richard et al., 2012) is to factorize where are sparse matrices, and penalize . The objective function to be minimized is

 J (U,V,W)≐1d∥XT−1W−XT∥2F+κ ∥W∥1 +1d∥ω(AT)⊤W−ω(UV⊤)⊤∥22+γ(∥U∥1+∥V∥1)

which is a non-convex function of the joint variable , making the theoretical analysis more difficult. Given that the objective is convex in a neighborhood of the solution, by initializing the variables adequately, we can write an algorithm inspired by proximal gradient descent for minimizing it.

## 5 Numerical Experiments

### 5.1 A generative model for graphs having linearly autoregressive features

Let be a sparse matrix, its pseudo-inverse such, that . Fix two sparse matrices and . Now define the sequence of matrices for by

 Ut=Ut−1W0+Nt

and

 At=UtV⊤0+Mt

for i.i.d sparse noise matrices and , which means that for any pair of indices

, with high probability

and .

If we define the linear feature map , note that

1. The sequence follows the linear autoregressive relation

 ω(At)⊤=ω(At−1)⊤W0+Nt+MtV⊤†0  .
2. For any time index , the matrix is close to that has rank at most

3. is sparse, and furthermore is sparse

### 5.2 Results

We tested the presented methods on synthetic data generated as in section (5.1). In our experiments the noise matrices and where built by soft-thresholding iid noise , . After choosing the parameters by 10-fold cross-validation, we compare our methods to standard baselines in link prediction (Liben-Nowell & Kleinberg, 2007). We use the area under the ROC curve as the measure of performance and report empirical results averaged over 10 runs. Nearest Neighbors (NN) relies on the number of common friends between each pair of nodes, which is given by when is the cumulative graph adjacency matrix and we denote by Shrink the low-rank approximation of . Since is unknown we consider the feature map where is the SVD of .

## 6 Discussion

The experiments suggest the empirical superiority of the proposed approaches to the standard baselines. It is very intriguing that the non-convex matrix factorization outperforms the convex rival. A possible explanation is that minimizing the nuclear norm by using the shrinkage operator results in factorizations of the solution by two orthogonal matrices, which conflicts with the sparsity of the solution. The other benefit of the non-convex formulation is its scalability, as the proximal method proposed for the convex formulation scales in in storage and in time. Several questions open perspectives for further investigations.

1. Choice of the feature map . In the current work we used the projection onto the vector space of the top- singular vectors of the cumulative adjacency matrix as the linear map , and this choice has shown empirical superiority to other choices. The question of choosing the best measurement to summarize graph information as in compress sensing seems to have both theoretical and application potential.

2. Characterization of sparse and low-rank matrices. Can all the sparse and low-rank matrices be written as where are both sparse? Or in other terms, what is the relation between the solution of problems penalized by -such as - and those, e.g. , penalized by ?

[Appendix : Proof of proposition (1)]

## Appendix A Preliminary Tools

###### Definition 1 (Orthogonal projections associated with S).

Let be a rank matrix. We can write the SVD of in two ways: or , where are orthogonal and . Let and matrices of size ortho-normally completing the bases of and , and define the projections onto the vector spaces spanned by vectors and for :

 PU=UU⊤,     PU⊥=U⊥U⊥⊤
 PV=VV⊤,    PV⊥=V⊥V⊥⊤

and define the orthogonal projection

 PS(B)=B−PU⊥BPV⊥

We highlight the fact that can also be written as

 PS(B)=PUBPV+PUBPV⊥+PU⊥BPV

or

 PS(B)=PUB+PU⊥BPV

We know that and . The two following inequalities will also be useful:

For any matrix ,

###### Lemma 2 (Orthogonality of the decomposition).

For any matrix , with the same notations, we have

 B=PU⊥BPV⊥+PUBPV+PUBPV⊥+PU⊥BPV

and the 4 terms are pairwise orthogonal. It follows that

1. We have the identity

 ∥B∥2F=∥PU⊥BPV⊥∥2F+∥PUBPV∥2F+∥PUBPV⊥∥2F+∥PU⊥BPV∥2F
2. It follows that

## Appendix B Proof

We have for any , by optimality of :

 1d(∥Φ(ˆS−AT+1,ˆW−W0)∥2F−∥Φ(S−AT+1,W−W0)∥2F)=1d(∥Φ(ˆS,ˆW)∥2F−∥Φ(S,W)∥2F−2⟨Φ(ˆS−S,ˆW−W),Φ(AT+1,W0)⟩)≤2d⟨Φ(ˆS−S,ˆW−W),(XT,0)−Φ(AT+1,W0)⟩+τ(∥S∥∗−∥ˆS∥∗)+γ(∥S∥1−∥ˆS∥1)+κ(∥W∥1−∥ˆW∥1)=2d⟨(ˆS−S,ˆW−W),(M,Ξ)⟩+τ(∥S∥∗−∥ˆS∥∗)+γ(∥S∥1−∥ˆS∥1)+κ(∥W∥1−∥ˆW∥1) (3)

Thanks to trace-duality and -duality we have and for any , so for any ,

 1d∥Φ(ˆS−AT+1,ˆW−W0)∥2F≤1d∥Φ(S−AT+1,W−W0)∥2F+τ∥S∥∗−τ∥ˆS∥∗+2α∥ˆS−S∥∗∥M∥op+γ∥S∥1−γ∥ˆS∥1+2(1−α)∥ˆS−S∥1∥M∥∞+κ∥W∥1−κ∥ˆW∥1+2∥ˆW−W∥1∥Ξ∥∞ (4)

now using assumptions , , and and then triangular inequality

 1d∥Φ(ˆS−AT+1,ˆW−W0)∥2F≤1d∥Φ(S−AT+1,W−W0)∥2F+2τ∥S∥∗+2γ∥S∥1+2κ∥W∥1□

For proving the other bound, we start by setting some notations. Let , and let , , . Let be the SVD of and let , where , are sign matrices of , and such that , and is the entry-wise product. Let

 Z=τZ∗+γZ1=τ(r∑j=1ujv⊤j+PU⊥G∗PV⊥)+γ(ΘS+G1∘Θ⊥S)

denote an element of the subgradient of the convex function , so and . There exist and such that

 ⟨Z,ˆS−S⟩=τ⟨r∑j=1ujv⊤j,ˆS−S⟩+τ∥PU⊥ˆSPV⊥∥∗+γ⟨ΘS,ˆS−S⟩+γ∥Θ⊥S∘ˆS∥1

We use the two standard inequalities of convex function subdifferentials and and a similar inequality on subdifferentials on and of , denoted by and . We get

 ⟨∂L(ˆS,ˆW),(ˆS−S,ˆW−W)⟩−⟨ˆZ−Z,ˆS−S⟩−⟨ˆQ−Q,ˆW−W⟩≤0 (5)

Therefore we obtain

 ⟨∇(S,W)∥Φ(ˆS,ˆW)−(XT,0)∥22,(ˆS−S,ˆW−W)⟩=2⟨(δ,ϵ),Φ(ˆS−S,ˆW−W)⟩−2⟨Φ(ˆS−AT+1,ˆW−W0),Φ(ˆS−S,ˆW−W)⟩

The inequality (5) can be written as

 2d⟨Φ(ˆS−AT+1,ˆW−W0),Φ(ˆS−S,ˆW−W)⟩≤2d⟨(δ,ϵ),Φ(ˆS−S,ˆW−W)⟩−τ⟨r∑j=1ujv⊤j,ˆS−S⟩−τ∥P⊥S(ˆS)∥∗−γ⟨ΘS,ˆS−S⟩−γ∥Θ⊥S∘ˆS∥1−κ⟨ΘW,ˆW−W⟩−κ∥Θ⊥W∘ˆW∥1 (6)

Thanks to Cauchy-Schwarz

 |⟨r∑j=1ujv⊤j,ˆS−S⟩|≤√r∥PU(ˆS−S)PV∥F

similarly

 |⟨ΘS,ˆS−S⟩|≤√k∥ΘS∘(ˆS−S)∥F

and

 |⟨ΘW,ˆW−W⟩|≤√q∥ΘW∘(ˆW−W)∥F

so we have

 2d⟨Φ(ˆS−AT+1,ˆW−W0),Φ(ˆS−S,ˆW−W)⟩≤2d⟨(δ,ϵ),Φ(ˆS−S,ˆW−W)⟩+τ√r∥PU(ˆS−S)PV∥F−τ∥PU⊥ˆSPV⊥∥∗+γ√k∥ΘS∘(ˆS−S)∥F−γ∥Θ⊥S∘ˆS∥1+κ√q∥ΘW∘(ˆW−W)∥F−κ∥Θ⊥W∘ˆW∥1 (7)

We need to bound . For this, note that by definition,

 ⟨(δ,ϵ),Φ(ˆS−S,ˆW−W)⟩=⟨(M,Ξ),(ˆS−S,ˆW−W⟩

and decompose for any

We get by applying triangle inequality, Cauchy-Schwarz, Hölder inequality written for the trace-norm and -norm

 ⟨M,ˆS−S⟩≤α(∥PS(M)∥F∥PS(ˆS−S)∥F+∥PU⊥MPV⊥∥op∥PU⊥ˆSPV⊥∥∗)+(1−α)(∥ΘS∘M∥F∥ΘS∘(ˆS−S)∥F+∥Θ⊥S∘M∥∞∥Θ⊥S∘ˆS∥1)

by rank and support inequalities obtained again by Cauchy-Schwarz

 ⟨M,ˆS−S⟩≤α(√2 r∥M∥op∥PS(ˆS−S)∥F+∥M∥op∥PU⊥ˆSPV⊥∥∗)+(1−α)(√k∥M∥∞∥ΘS∘(ˆS−S)∥F+∥M∥∞∥Θ⊥S∘ˆS∥1) (8)

Now by using

 2⟨Φ(ˆS−AT+1,ˆW−W0),Φ(ˆS−S,ˆW−W)⟩=∥Φ(ˆS−AT+1,ˆW−W0)∥22+∥Φ(ˆS−S,ˆW−W)∥22−∥Φ(S−AT+1,W−W0)∥22

we can rewrite the inequality (7) as follows:

 (9)

the last inequality being due to the assumptions , and .

So finally, and by using again these assumptions,

 1d(∥Φ(ˆS−AT+1,ˆW−W0)∥22+∥Φ(ˆS−S,ˆW−W)∥22)≤1d∥Φ(S−AT+1)∥22+(√rτ(√2+1)+2√kγ)∥ˆS−S∥F+2√qκ∥W−ˆW∥F≤∥Φ(S−AT+1,W−W0)∥22+μ√d(√rτ(√2+1)+2√kγ)∥Φ(ˆS−S,ˆW−W)∥F+2μκ√q√dT∥Φ(ˆS−S,ˆW−W)∥F (10)

and gives

 1d∥Φ(ˆS−AT+1,ˆW−W0)∥22≤1d∥Φ(S−AT+1,W−W0)∥22+μ24d(τ√r(√2+1)+2√kγ)2+μ2κ2qdT

and the result follows by using

 1d∥Φ(S1−S2,W1−W2)∥22≥μ−2∥S1−S2∥2F+μ−2T∥W1−W2∥2F

and setting :

 ∥ˆS−AT+1∥2F+T∥ˆW−W0∥2F≤μ4d(√rτ(√2+1)+2√kγ)2+μ2κ2qdT   □

## References

• Koltchinskii et al. (2011) Koltchinskii, V., Lounici, K., and Tsybakov, A. Nuclear norm penalization and optimal rates for noisy matrix completion. Annals of Statistics, 2011.
• Koren (2008) Koren, Y. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 426–434. ACM, 2008.
• Koren (2010) Koren, Y. Collaborative filtering with temporal dynamics. Communications of the ACM, 53(4):89–97, 2010.
• Liben-Nowell & Kleinberg (2007) Liben-Nowell, D. and Kleinberg, J. The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007.
• Raguet et al. (2011) Raguet, H., Fadili, J., and Peyré, G. Generalized forward-backward splitting. Arxiv preprint arXiv:1108.4404, 2011.
• Richard et al. (2010) Richard, E., Baskiotis, N., Evgeniou, Th., and Vayatis, N. Link discovery using graph feature tracking. Proceedings of Neural Information Processing Systems (NIPS), 2010.
• Richard et al. (2012) Richard, E., Savalle, P.-A., and Vayatis, N. Estimation of simultaneously sparse and low-rank matrices. In

Proceeding of 29th Annual International Conference on Machine Learning

, 2012.
• Sarkar et al. (2010) Sarkar, P., Chakrabarti, D., and Moore, A.W. Theoretical justification of popular link prediction heuristics. In International Conference on Learning Theory (COLT), pp. 295–307, 2010.
• Taskar et al. (2003) Taskar, B., Wong, M.F., Abbeel, P., and Koller, D. Link prediction in relational data. In Neural Information Processing Systems, volume 15, 2003.