Forecasting systems behavior with multiple responses has been a challenging issue in many contexts of applications such as collaborative filtering, financial markets, or bioinformatics, where responses can be, respectively, movie ratings, stock prices, or activity of genes within a cell. Statistical modeling techniques have been widely investigated in the context of multivariate time series either in the multiple linear regression setup
or with autoregressive models. More recently, kernel-based regularized methods have been developed for multitask learning [7, 2]. These approaches share the use of the correlation structure among input variables to enrich the prediction on every single output. Often, the correlation structure is assumed to be given or it is estimated separately. A discrete encoding of correlations between variables can be modeled as a graph so that learning the dependence structure amounts to performing graph inference through the discovery of uncovered edges on the graph. The latter problem is interesting per se and it is known as the problem of link prediction where it is assumed that only a part of the graph is actually observed [16, 9]. This situation occurs in various applications such as recommender systems, social networks, or proteomics, and the appropriate tools can be found among matrix completion techniques [22, 5, 1]. In the realistic setup of a time-evolving graph, matrix completion was also used and adapted to take into account the dynamics of the features of the graph . In this paper, we study the prediction problem where the observation is a sequence of graphs adjacency matrices and the goal is to predict
. This type of problem arises in applications such as recommender systems where, given information on purchases made by some users, one would like to predict future purchases. In this context, users and products can be modeled as the nodes of a bipartite graph, while purchases or clicks are modeled as edges. In functional genomics and systems biology, estimating regulatory networks in gene expression can be performed by modeling the data as graphs and fitting predictive models is a natural way for estimating evolving networks in these contexts. A large variety of methods for link prediction only consider predicting from a single static snapshot of the graph - this includes heuristics[16, 21], matrix factorization , diffusion , or probabilistic methods 
. More recently, some works have investigated using sequences of observations of the graph to improve the prediction, such as using regression on features extracted from the graphs, using matrix factorization , continuous-time regression . Our main assumption is that the network effect is a cause and a symptom at the same time, and therefore, the edges and the graph features should be estimated simultaneously. We propose a regularized approach to predict the uncovered links and the evolution of the graph features simultaneously. We provide oracle bounds under the assumption that the noise sequence has subgaussian tails and we prove that our procedure achieves a trade-off in the calibration of smoothing parameters which adjust with the sparsity and the rank of the unknown adjacency matrix. The rest of this paper is organized as follows. In Section 2, we describe the general setup of our work with the main assumptions and we formulate a regularized optimization problem which aims at jointly estimating the autoregression parameters and predicting the graph. In Section 3, we provide technical results with oracle inequalities and other theoretical guarantees on the joint estimation-prediction. Section 4 is devoted to the description of the numerical simulations which illustrate our approach. We also provide an efficient algorithm for solving the optimization problem and show empirical results. The proof of the theoretical results are provided as supplementary material in a separate document.
2 Estimation of low-rank graphs with autoregressive features
Our approach is based on the asumption that features can explain most of the information contained in the graph, and that these features are evolving with time. We make the following assumptions about the sequence of adjacency matrices of the graphs sequence.
We assume that the matrices have low-rank. This reflects the presence of highly connected groups of nodes such as communities in social networks, or product categories and groups of loyal/fanatic users in a market place data, and is sometimes motivated by the small number of factors that explain nodes interactions.
Autoregressive linear features.
We assume to be given a linear map defined by
where is a set of matrices. These matrices can be either deterministic or random in our theoretical analysis, but we take them deterministic for the sake of simplicity. The vector time series has autoregressive dynamics, given by a VAR (Vector Auto-Regressive) model:
where is a unknown sparse matrix and is a sequence of noise vectors in . An example of linear features is the degree (i.e. number of edges connected to each node, or the sum of their weights if the edges are weighted), which is a measure of popularity in social and commerce networks. Introducing
which are both matrices, we can write this model in a matrix form:
This assumes that the noise is driven by time-series dynamics (a martingale increment), where each coordinates are independent (meaning that features are independently corrupted by noise), with a sub-gaussian tail and variance uniformly bounded by a constant. In particular, no independence assumption between the is required here.
The notations , , , and stand, respectively, for the Frobenius norm, entry-wise norm, entry-wise
norm, trace-norm (or nuclear norm, given by the sum of the singular values) and operator norm (the largest singular value). We denote bythe Euclidean matrix product. A vector in is always understood as a matrix. We denote by the number of non-zero elements of . The product between two matrices with matching dimensions stands for the Hadamard or entry-wise product between and . The matrix contains the absolute values of entries of . The matrix is the componentwise positive part of the matrix M, and is the sign matrix associated to with the convention
If is a matrix with rank , we write its SVD as where is a diagonal matrix containing the non-zero singular values of in decreasing order, and , are matrices with columns given by the left and right singular vectors of . The projection matrix onto the space spanned by the columns (resp. rows) of is given by (resp. ). The operator given by is the projector onto the linear space spanned by the matrices and for and . The projector onto the orthogonal space is given by . We also use the notation .
2.1 Joint prediction-estimation through penalized optimization
In order to reflect the autoregressive dynamics of the features, we use a least-squares goodness-of-fit criterion that encourages the similarity between two feature vectors at successive time steps. In order to induce sparsity in the estimator of , we penalize this criterion using the norm. This leads to the following penalized objective function:
where is a smoothing parameter.
Now, for the prediction of , we propose to minimize a least-squares criterion penalized by the combination of an norm and a trace-norm. This mixture of norms induces sparsity and a low-rank of the adjacency matrix. Such a combination of and trace-norm was already studied in  for the matrix regression model, and in  for the prediction of an adjacency matrix.
The objective function defined below exploits the fact that if is close to , then the features of the next graph should be close to . Therefore, we consider
where are smoothing parameters. The overall objective function is the sum of the two partial objectives and , which is jointly convex with respect to and :
If we choose convex cones and , our joint estimation-prediction procedure is defined by
It is natural to take and since there is no a priori on the values of the feature matrix , while the entries of the matrix must be positive.
In the next section we propose oracle inequalities which prove that this procedure can estimate and predict at the same time.
2.2 Main result
The central contribution of our work is to bound the prediction error with high probability under the following natural hypothesis on the noise process.
We assume that satisfies for any and that there is such that for any and and :
Moreover, we assume that for each , the coordinates are independent.
The main result can be summarized as follows. The prediction error and the estimation error can be simultaneously bounded by the sum of three terms that involve homogeneously (a) the sparsity, (b) the rank of the adjacency matrix , and (c) the sparsity of the VAR model matrix . The tight bounds we obtain are similar to the bounds of the Lasso and are upper bounded by:
The positive constants are proportional to the noise level . The interplay between the rank and sparsity constraints on are reflected in the observation that the values of and can be changed as long as their sum remains constant.
3 Oracle inequalities
In this section we give oracle inequalities for the mixed prediction-estimation error which is given, for any and , by
It is important to have in mind that an upper-bound on implies upper-bounds on each of its two components. It entails in particular an upper-bound on the feature estimation error that makes smaller and consequently controls the prediction error over the graph edges through .
The upper bounds on given below exhibit the dependence of the accuracy of estimation and prediction on the number of features , the number of edges and the number of observed graphs in the sequence.
Let us recall and introduce the noise processes
which are, respectively, and random matrices. The source of randomness comes from the noise sequence , see Assumption 1. If these noise processes are controlled correctly, we can prove the following oracle inequalities for procedure (4). The next result is an oracle inequality of slow type (see for instance ), that holds in full generality.
Let be given by (4) and suppose that
for some . Then, we have
is of importance. Restricted eigenvalue conditions are implied by, and in general weaker than, the so-calledincoherence or RIP (Restricted isometry property, ) assumptions, which excludes, for instance, strong correlations between covariates in a linear regression model. This condition is acknowledged to be one of the weakest to derive fast rates for the Lasso (see  for a comparison of conditions).
Matrix version of these assumptions are introduced in . Below is a version of the RE assumption that fits in our context. First, we need to introduce the two restriction cones.
The first cone is related to the term used in procedure (4). If , we denote by the signed sparsity pattern of and by the orthogonal sparsity pattern. For a fixed matrix and , we introduce the cone
This cone contains the matrices that have their largest entries in the sparsity pattern of .
The second cone is related to mixture of the terms and in procedure (4). Before defining it, we need further notations and definitions.
For a fixed and , we introduce the cone
This cone consist of the matrices with large entries close to that of and that are “almost aligned” with the row and column spaces of . The parameter quantifies the interplay between these too notions.
Definition 1 (Restricted Eigenvalue (RE)).
For and , we introduce
For and , we introduce
The RE assumption consists of assuming that the constants and are non-zero. Now we can state the following Theorem that gives a fast oracle inequality for our procedure using RE.
Note that the residual term from this oracle inequality mixes the notions of sparsity of and via the terms , and . It says that our mixed penalization procedure provides an optimal trade-off between fitting the data and complexity, measured by both sparsity and low-rank. This is the first result of this nature to be found in literature.
which are the (observable) variance terms that naturally appear in the controls of the noise processes. We introduce also
which is a small (observable) technical term that comes out of our analysis of the noise process . This term is a small price to pay for the fact that no independence assumption is required on the noise sequence , but only a martingale increment structure with sub-gaussian tails.
The proof of Theorem 3 follows directly from Theorem 2 basic noise control results. In the next Theorem, we propose more explicit upper bounds for both the indivivual estimation of and the prediction of .
Under the same assumptions as in Theorem 3, for any the following inequalities hold with a probability larger than :
4 Algorithms and Numerical Experiments
4.1 Generalized forward-backward algorithm for minimizing
We use the algorithm designed in  for minimizing our objective function. Note that this algorithm is preferable to the method introduced in  as it directly minimizes jointly in rather than alternately minimizing in and .
Moreover we use the novel joint penalty from  that is more suited for estimating graphs. The proximal operator for the trace norm is given by the shrinkage operation, if
is the singular value decomposition of,
Similarly, the proximal operator for the -norm is the soft thresholding operator defined by using the entry-wise product of matrices denoted by :
The algorithm converges under very mild conditions when the step size is smaller than , where is the operator norm of the joint quadratic loss:
4.2 A generative model for graphs having linearly autoregressive features
Let be a sparse matrix, its pseudo-inverse such, that . Fix two sparse matrices and . Now define the sequence of matrices for by
for i.i.d sparse noise matrices and , which means that for any pair of indices , with high probability and . We define the linear feature map , and point out that
The sequence follows the linear autoregressive relation
For any time index , the matrix is close to that has rank at most
The matrices and are both sparse by construction.
4.3 Empirical evaluation
We tested the presented methods on synthetic data generated as in section (4.2). In our experiments the noise matrices and where built by soft-thresholding i.i.d. noise . We took as input successive graph snapshots on nodes graphs of rank . We used linear features, and finally the noise level was set to
. We compare our methods to standard baselines in link prediction. We use the area under the ROC curve as the measure of performance and report empirical results averaged over 50 runs with the corresponding confidence intervals in figure1. The competitor methods are the nearest neighbors (NN) and static sparse and low-rank estimation, that is the link prediction algorithm suggested in . The algorithm NN scores pairs of nodes with the number of common friends between them, which is given by when is the cumulative graph adjacency matrix and the static sparse and low-rank estimation is obtained by minimizing the objective , and can be seen as the closest static version of our method. The two methods autoregressive low-rank and static low-rank are regularized using only the trace-norm, (corresponding to forcing ) and are slightly inferior to their sparse and low-rank rivals. Since the matrix defining the linear map is unknown we consider the feature map where is the SVD of . The parameters and are chosen by 10-fold cross validation for each of the methods separately.
Left: performance of algorithms in terms of Area Under the ROC Curve, average and confidence intervals over 50 runs. Right: Phase transition diagram.
Comparison with the baselines. This experiment sharply shows the benefit of using a temporal approach when one can handle the feature extraction task. The left-hand plot shows that if few snapshots are available ( in these experiments), then static approaches are to be preferred, whereas feature autoregressive approaches outperform as soon as sufficient number graph snapshots are available (see phase transition). The decreasing performance of static algorithms can be explained by the fact that they use as input a mixture of graphs observed at different time steps. Knowing that at each time step the nodes have specific latent factors, despite the slow evolution of the factors, adding the resulting graphs leads to confuse the factors.
Phase transition. The right-hand figure is a phase transition diagram showing in which part of rank and time domain the estimation is accurate and illustrates the interplay between these two domain parameters.
Choice of the feature map . In the current work we used the projection onto the vector space of the top- singular vectors of the cumulative adjacency matrix as the linear map , and this choice has shown empirical superiority to other choices. The question of choosing the best measurement to summarize graph information as in compress sensing seems to have both theoretical and application potential. Moreover, a deeper understanding of the connections of our problem with compressed sensing, for the construction and theoretical validation of the features mapping, is an important point that needs several developments. One possible approach is based on multi-kernel learning, that should be considered in a future work.
Generalization of the method. In this paper we consider only an autoregressive process of order 1. For better prediction accuracy, one could consider mode general models, such as vector ARMA models, and use model-selection techniques for the choice of the orders of the model. A general modelling based on state-space model could be developed as well. We presented a procedure for predicting graphs having linear autoregressive features. Our approach can easily be generalized to non-linear prediction through kernel-based methods.
[Appendix : Proof of propositions]
Appendix A Proofs of the main results
From now on, we use the notation and for any and .
Let us introduce the linear mapping given by
Using this mapping, the objective (3) can be written in the following reduced way:
Recalling that the error writes, for any and :
Let us introduce also the empirical risk
The proofs of Theorem 1 and 2 are based on tools developped in  and . However, the context considered here is very different from the setting considered in these papers, so our proofs require a different scheme.
a.1 Proof of Theorem 1
First, note that
The next Lemma will come in handy several times in the proofs.
For any and we have
a.2 Proof of Theorem 2
Let and be fixed, and let be the SVD of . Recalling that is the entry-wise product, we have , where is the entry-wise sign matrix of and is the orthogonal sparsity pattern of .
The definition (4) of is equivalent to the fact that one can find (an element of the subgradient of at ) that belongs to the normal cone of at . This means that for such a , and any and , we have
Any subgradient of the function writes
for some and (see for instance ). So, if , we have, by monotonicity of the sub-differential, that for any
and, by duality, we can find such that
By using the same argument with the function and by computing the gradient of the empirical risk , Equation (11) entails that
Using Pythagora’s theorem, we have