I Introduction
Networkaware signal and information processing is having a major impact in technology and the biobehavioral sciences; see e.g, [19, Ch. 1]. In this context, graph signal processing (GSP) builds on a graphtheoretic substrate to effectively model signals with complex relational structures [26, 37, 32]. However, the required connectivity information is oftentimes not explicitly available. This motivates the prerequisite step of using signals (e.g., brain activity traces, distributed sensor measurements) to unveil latent network structure, or, to construct discriminative graph representations to facilitate downstream learning tasks. As graph data grow in size and complexity, there is an increasing need to develop customized, fast and computationallyefficient graph learning algorithms.
Given nodal measurements (known as graph signals in the GSP parlance), the network topology inference problem is to search for a graph within a model class that is optimal in some applicationspecific sense, e.g., [19, Ch. 7]. The adopted criterion is naturally tied to the signal model relating the observations to the sought network, which can include constraints motivated by physical laws, statistical priors, or, explainability goals. Workhorse probabilistic graphical models include Gaussian Markov random fields, and topology identification arises with sotermed highdimensional graphical model selection [8, 39, 12, 21, 11, 29, 20]. Other recent approaches embrace a signal representation perspective to reveal parsimonious data signatures with respect to the underlying graph. These include stationarity induced via linear network diffusion [34, 28, 36] and smoothness (i.e., bandlimitedness) [18, 9, 16, 17, 5, 30, 33, 4, 22]. The interested reader is referred to [24, 10, 13] for comprehensive tutorial treatments of network topology inference advances.
In this short letter, we develop a fast and scalable algorithm to estimate graphs subject to a smoothness prior (Section
II outlines the required background and formally states the problem). Adopting the wellappreciated graph learning framework of [18, 17], in Section III we bring to bear the fast proximalgradient (PG) iterations in [2] to solve the resulting strongly convex, signal smoothnessregularized optimization problem in the dual domain. There are noteworthy recent scalable solvers for this problem that rely on the primaldual (PD) method [18], PG [31], or, the linearized alternatingdirection method of multipliers (ADMM) [38]. Unlike these algorithms, the novel iterations come with global convergence rate guarantees and do not require additional stepsize tuning. Borrowing results from [2], we show that a (possibly infeasible) primal sequence generated from the accelerated graph learning algorithm converges to a globally optimal solution at a rate of . To the best of our knowledge, this is the first work that establishes the convergence rate of topology inference algorithms subject to smoothnes priors. Computer simulations in Section IV showcase the favorable convergence properties of the proposed approach when recovering a wide variety of graphs. In the interest of reproducible research, the code used to generate all figures in this letter is publicly available. Conclusions are in Section V. Due to page contraints, proofs are deferred to the accompanying Supplementary Material.Ii Graph Learning from Smooth Signals
Let be an undirected graph, where are the nodes (or vertices) with , are the edges, and is the symmetric adjacency matrix collecting the edge weights. For we have . We exclude the possibility of selfloops, so is hollow meaning , . We acquire graph signal observations , where denotes the signal value at vertex . More general graphs capturing directionality are important [23], but beyond the scope of this letter.
Iia Graph signal smoothness
For undirected graphs one typically adopts the Laplacian as descriptor of graph structure, where collects the vertex degrees. As the central object in spectral graph theory, is instrumental in formalizing the notion of smooth (i.e., lowpass bandlimited) signals on graphs [40, 26]. Specifically, the total variation (TV) of the graph signal with respect to is given by the quadratic form
(1) 
We can interpret as a smoothness measure for graph signals, which gauges the extent to which varies across local neighborhoods in . Accordingly, we say a signal is smooth if it has a small total variation. For reference, , where is the spectral radius of . The lower bound is attained by constant signals. The ubiquity of smooth network data has been welldocumented, with examples spanning sensor measurements [6], protein function annotations [19], and product ratings [15]. These empirical findings motivate adopting smoothness as the criterion to search for graphs on which measurements exhibit desirable parsimony or regularity.
IiB Problem statement
We study the following graph learning problem.
Problem 1
Given a set of graph signal observations, the goal is to learn an undirected graph such that the observations in are smooth on .
We now briefly review the method proposed in [18, 17] to tackle Problem 1, from which we henceforth build on to develop a fast graph learning algorithm.
Consider the matrix , whose columns are the observations in . The rows, denoted by , collect all measurements at vertex . Define then the nodal Euclideandistance matrix , where , . Using these notions, the signal smoothness measure over can be equivalently written as
(2) 
where denotes elementwise product [18]. Smoothness minimization as criterion in Problem 1 has the following intuitive interpretation: when pairwise nodal distances in are sampled from a smooth manifold, the learnt topology tends to be sparse, preferentially choosing edges whose corresponding are smaller [cf. the weighted norm in (2)].
Leveraging this neat link between signal smoothness and edge sparsity, a fairly general graphlearning framework was put forth in [18]. The idea therein is to solve the following convex inverse problem
(3)  
s. to 
where are tunable regularization parameters. Different from [9], the logarithmic barrier on the vertex degrees excludes the possibility of having (often undesirable) isolated vertices in the estimated graph. Through , the Frobeniusnorm penalty offers a handle on the graphs’ edge sparsity level. Among the parameterized familiy of solutions to (3), the sparsest graph is obtained when .
Arguably, the most important upshot of identity (2) is computational. It facilitates formulating (3) as a search over adjacency matrices, and the resulting constraints (null diagonal, symmetry and nonnegativity) are separable across the variables . This does not hold for the Laplacian . Exploting this favorable structure of (3), efficient solvers were developed based on PD iterations [18], the PG method [31], or the ADMM [38]. However, none of these graph learning methods come with convergence rate guarantees because the objective function of (3) lacks a Lipschitz continuous gradient. To close this gap, next we develop a markedly faster firstorder algorithm using an accelerated dualbased PG method [2].
Iii Fast Dual Proximal Gradient Algorithm
Because is hollow and symmetric, the optimization variables in (3) are effectively the, say, uppertriangular elements ,
. Thus, it suffices to retain only those entries in the vector
, were we have adopted convenient Matlab notation. To impose that edge weights are nonnegative, we penalize the cost with the indicator function if , else [18]. This way, we equivalently reformulate (3) as the unconstrained, nondifferentiable problem(4) 
where and maps edge weights to nodal degrees, i.e., . The nonsmooth function is strongly convex with strong convexity parameter (details are in the Supplementary Material), while is a (strictly) convex function for all . Under the aforementioned properties of and , the composite problem (4) has a unique optimal solution ; see e.g., [2] and [38].
A fast dualbased PG algorithm was developed in [2] to solve the nonsmooth, strictly convex optimization problem of which (4) is a particular instance. In the remainder of this section we will bring to bear this optimization framework to develop a novel graph learning algorithm with global rate of convergence guarantees.
Iiia The dual problem
The structure of (4) lends itself naturally to variablesplitting via the equivalent linearlyconstrained form
(5) 
Attaching Lagrange multipliers to the equality constraints and minimizing the Lagrangian function w.r.t. the primal variables , one arrives at the (minimization form) dual problem [2]
(6) 
where
(7)  
(8) 
Interestingly, the strong convexity of induces useful smoothness properties for (namely, the composition of with the Fenchel conjugate of ), that we summarize next. The result is adapted from [2, Lemma 3.1] and the additional proof arguments can be found in the Supplementary Material.
Lemma 1
Function in (7) is smooth, and the gradient is Lipschitz continuous with constant .
IiiB Accelerated dual proximal gradient algorithm
The FISTA algorithm applied to the dual problem (6) yields the following iterations (initialized as and , henceforth denotes the iteration index)
(9)  
(10)  
(11) 
where the proximal operator of a proper, lower semicontinuous convex function is (see e.g., [27])
(12) 
An adaptation of the result in [2, Lemma 3.2] – stated as Proposition 1 below – yields the novel graph learning iterations tabulated under Algorithm 1. Again, due to page constraints the proof details are deferred to the Supplementary Material.
Proposition 1
The updates in Proposition 1 are fully expressible in terms of parameters from the original graph learning problem, namely , , , and the data in . This is to be contrasted with (9), which necessitates the conjugate functions and .
Algorithm 1’s overall computational complexity is dominated by the update (13), which incurs a per iteration cost of . The remaining updates are also given in closed form, through simple operations of vectors living in the dual dimensional domain of nodal degrees [cf. the dimensional primal variables ]. The overall complexity of is in par with stateoftheart PD and linearized ADMM algorithms [38], which have been shown to scale well to large networks with in the order of thousands. The computational cost can be further reduced by constraining a priori the space of possible edges; see [17] for examples where this approach is warranted. For a given problem instance, there are no stepsize parameters to tune here (on top of and ) since we can explicitly compute the Lipschitz constant in Lemma 1. On the other hand, the linearized ADMM algorithm in [38] necessitates tuning two stepsizes and the penalty parameter defining the augmented Lagrangian.
The distinctive feature of the proposed accelerated dual PG algorithm is that it comes with global convergence rate guarantees. These results are outlined in the ensuing section.
IiiC Convergence rate analysis
Moving on to convergence properties, when the iterates generated by Algorithm 1 provably approach a dual optimal solution that minimizes in (6); see e.g., [1]. The celebrated FISTA rate of convergence for the dual cost function is stated next.
This welldocumented global convergence rate of accelerated PG algorithms implies an iteration complexity to return an optimal dual solution measured in terms of values.
We now consider a primal sequence generated from the iterates of Algorithm 1, and borrow the results from [2] to show the sequence is globally convergent to at a rate of . To this end, suppose that for all we are given dual updates generated from the accelerated dual PG algorithm. We can construct a primal sequence as , namely [cf. (7)]
(16) 
As noted in [38], this primal sequence may be infeasible in the sense that resulting nodal degrees are not guaranteed to lie in the domain of . The promised rate of converge result for is stated next.
Iv Numerical Results
Here we test the proposed fast dual PG (FDPG) algorithm for learning random and realworld graphs from simulated signals. The merits of the formulation (3) in terms of recovering highquality graphs have been well documented; see e.g., [18, 17, 24, 10] and references therein. For this reason, the numerical experiments that follow will exclusively focus on algorithmic performance, with no examination of the quality of the optimal solution that defines the learnt graph. In all ensuing test cases, we search for the best regularization parameters in terms of graph recovery performance, adopting the edgedetection Fmeasure as criterion. We compare Algorithm 1 to other stateoftheart methods such as PD [18], PG [31], and linearized ADMM [38]. We also consider the nonaccelerated dual PG (DPG) method that is obtained from Algorithm 1 when for all . For FDPG we implemented customary fixedinterval restarts of the momentum term in Algorithm 1; see also [25] for adaptive restart rules. Moreover, the ADMM parameters and PD stepsize are tuned to yield the best possible convergence rate. Implementation details can be found in the publicly available code^{1}^{1}1http://www.ece.rochester.edu/~gmateosb/code/FDPG.zip., which can be used to generate all plots in Fig. 1.
Iva Random graphs
We generate groundtruth graphs as draws from the ErdősRényi (ER) model (edge formation probability
) with and nodes, as well as from the block Stochastic Block Model (SBM) with the same number of nodes, and connection probability for nodes in the same community and for nodes in different blocks. We simulate i.i.d. graph signals , where represents the noise level and is the Laplacian of the groundtruth random graph. For a graphbased factor analysis model justifying this approach to smooth signal generation, see e.g., [9]. We compare the convergence performance of the aforementioned methods through the evolution of the primal variable error . To obtain for the chosen and , we ran the PD method for iterations. The results of these comparisons are illustrated in Fig. 1 (a). Apparently, the proposed FDPG algorithm markedly outperforms all other methods in terms of convergence rate, uniformly across graph model classes and number of nodes. Here, convergence to the largest graphs takes less iterations than for .IvB Brain and road networks
We first focus on recovering the topology of unweighted structural brain graphs [14], all with regions of interest (ROIs) and whose edges connect ROIs with nontrivial density of axonal bundles; see also [35] for additional details. For a largerscale experiment, we adopt the Minnesota road network which is an unweighted and undirected graph with intersections [7]. In both cases, we generated synthetic smooth signals over the real topologies using the generative model in Section IVA. The high value of renders the ADMM’s 3D parameter search a significantly time consuming operation. Hence, for the Minnesota road network experiment, we only focus on the proposed (F)DPG methods and the PD algorithm in [18].
Fig. 1 (b) depicts the convergence results for the structural brain networks of representative subjects. Once more, in all cases the FDPG method is faster, but for these smaller graphs the performance gap appears to narrow. The gains can also be quantified in terms of wallclock time. For instance, for Subject the time in seconds for the algorithms to reach a suboptimality of are: s for FDPG, s for PD, s for ADMM and s for DPG. Results for the Minnesota road network are depicted in Fig. 1 (c), where the superiority of the proposed method is also apparent.
V Conclusion
We developed a fast and scalable algorithm to learn the graph structure of signals subject to a smoothness prior. Leveraging this cardinal property of network data is central to various statistical learning tasks, such as graph smoothing and semisupervised node classification. We brought to bear a fast dualbased PG method to derive lightweight graphlearning iterations that come with global convergence rate guarantees. The merits of the proposed algorithm are showcased via experiments using several random and realworld graphs.
References
 [1] (200903) A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Img. Sci. 2 (1), pp. 183–202. Cited by: §IIIA, §IIIC, Theorem 1.
 [2] (2014) A fast dual proximal gradient algorithm for convex minimization and applications. Operations Research Letters 42 (1), pp. 1–6. Cited by: §I, §IIB, §IIIA, §IIIB, §IIIC, §III, §III, Proof of Lemma 1, Proof of Proposition 1, Theorem 2.
 [3] (2018) Firstorder methods in optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA. Cited by: §IIIA.
 [4] (2020) Efficient graph learning from noisy and incomplete data. IEEE Trans. Signal Inf. Process. Netw. 6 (), pp. 105–119. Cited by: §I.
 [5] (201703) Learning sparse graphs under smoothness prior. In IEEE Intl. Conf. Acoust., Speech and Signal Process. (ICASSP), pp. 6508–6512. Cited by: §I.
 [6] (2017) Learning sparse graphs under smoothness prior. In IEEE Intl. Conf. Acoust., Speech and Signal Process. (ICASSP), Vol. , pp. 6508–6512. Cited by: §IIA.
 [7] (201112) The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38 (1). Cited by: §IVB.
 [8] (1972) Covariance selection. Biometrics 28 (1), pp. 157–175. Cited by: §I.
 [9] (2016) Learning Laplacian matrix in smooth graph signal representations. IEEE Trans. Signal Process. 64 (23), pp. 6160–6173. Cited by: §I, §IIB, §IVA.
 [10] (2019) Learning graphs from data: a signal representation perspective. IEEE Signal Process. Mag. 36 (3), pp. 44–63. Cited by: §I, §IV.
 [11] (201709) Graph learning from data under Laplacian and structural constraints. IEEE J. Sel. Topics Signal Process. 11 (6), pp. 825–841. Cited by: §I.
 [12] (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), pp. 432–441. Cited by: §I.
 [13] (2018) Topology identification and learning over graphs: accounting for nonlinearities and dynamics. Proc. IEEE 106 (5), pp. 787–807. Cited by: §I.
 [14] (2008) Mapping the structural core of human cerebral cortex. PLoS Biol 6 (7), pp. e159. Cited by: §IVB.
 [15] (2018) Rating prediction via graph signal processing. IEEE Trans. Signal Process. 66 (19), pp. 5066–5081. Cited by: §IIA.
 [16] (2017) Learning time varying graphs. In IEEE Intl. Conf. Acoust., Speech and Signal Process. (ICASSP), Vol. , pp. 2826–2830. Cited by: §I.
 [17] (2019) Large scale graph learning from smooth signals. In Int. Conf. Learning Representations (ICLR), Cited by: §I, §I, §IIB, §IIIB, §IV.
 [18] (2016) How to learn a graph from smooth signals. In Artif. Intel. and Stat. (AISTATS), pp. 920–929. Cited by: §I, §I, §IIB, §IIB, §IIB, §IIB, §III, §IVB, §IV, Proof of Proposition 1.
 [19] (2009) Statistical analysis of network data: methods and models. SpringerVerlag, New York, NY. Cited by: §I, §I, §IIA.
 [20] (2020) A unified framework for structured graph learning via spectral constraints. J. Mach. Learn. Res. 21 (22), pp. 1–60. Cited by: §I.
 [21] (2010) Discovering structure by learning sparse graphs. In Annual Cognitive Sc. Conf., pp. 778 – 783. Cited by: §I.
 [22] (2019) Learning Laplacian matrix from bandlimited graph signals. In IEEE Intl. Conf. Acoust., Speech and Signal Process. (ICASSP), pp. 2937–2941. Cited by: §I.
 [23] (2020) Signal processing on directed graphs: the role of edge directionality when processing and learning from network data. IEEE Signal Process. Mag. 37 (6), pp. 99–116. Cited by: §II.
 [24] (2019) Connecting the dots: identifying network structure via graph signal processing. IEEE Signal Process. Mag. 36 (3), pp. 16–43. Cited by: §I, §IV.
 [25] (201504) Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics 15, pp. 715–732. Cited by: §IV.
 [26] (2018) Graph signal processing: overview, challenges, and applications. Proc. IEEE 106 (5), pp. 808–828. Cited by: §I, §IIA.
 [27] (2014) Proximal algorithms. Foundations and Trends in optimization 1 (3), pp. 127–239. Cited by: §IIIB.
 [28] (201709) Characterization and inference of graph diffusion processes from observations of stationary signals. IEEE Trans. Signal Inf. Process. Netw. PP (99), pp. . Cited by: §I.
 [29] (201805) Learning graphs with monotone topology properties and multiple connected components. IEEE Trans. Signal Process. 66 (9), pp. 2399–2413. Cited by: §I.
 [30] (201703) Inferring sparse graphs from smooth signals with theoretical guarantees. In IEEE Intl. Conf. Acoust., Speech and Signal Process. (ICASSP), pp. 6533–6537. Cited by: §I.
 [31] (2021 (accepted; see also arXiv:2103.03762 [cs.LG])) Online graph learning under smoothness priors. In European Signal Process. Conf. (EUSIPCO), Dublin, Ireland, pp. . Cited by: §I, §IIB, §IV.
 [32] (201304) Discrete signal processing on graphs. IEEE Trans. Signal Process. 61 (7), pp. 1644–1656. Cited by: §I.
 [33] (2019) Graph topology inference based on sparsifying transform learning. IEEE Trans. Signal Process. 67 (7), pp. 1712–1727. Cited by: §I.
 [34] (201708) Network topology inference from spectral templates. IEEE Trans. Signal Inf. Process. Netw. 3 (3), pp. 467–483. Cited by: §I.
 [35] (2016) Network topology identification from spectral templates. In IEEE Wrkshp. Statistical Signal Process. (SSP), Vol. , pp. 1–5. Cited by: §IVB.
 [36] (202009) Online topology inference from streaming stationary graph signals with partial connectivity information. Algorithms 13 (9), pp. 1–19. Cited by: §I.

[37]
(2013)
The emerging field of signal processing on graphs: extending highdimensional data analysis to networks and other irregular domains
. IEEE Signal Process. Mag. 30 (3), pp. 83–98. Cited by: §I.  [38] (2021) An efficient alternating direction method for graph learning from smooth signals. In IEEE Intl. Conf. Acoust., Speech and Signal Process. (ICASSP), Toronto, Canada, pp. 5380–5384. Cited by: §I, §IIB, §IIIB, §IIIC, §III, §IV, Proof of Proposition 1.
 [39] (2007) Model selection and estimation in the Gaussian graphical model. Biometrika 94 (1), pp. 19–35. Cited by: §I.
 [40] (2004) A regularization framework for learning from graph data. In Int. Conf. Mach. Learning (ICML), Cited by: §IIA.
Supplementary Material
Proof of Lemma 1
A couple preliminary calculations are required to derive an explicit expression for the Lipschitz constant of .
Lemma 2
The function is strongly convex with constant .
Proof : The strong convexity of with parameter follows because
is a convex function.
Lemma 3
Going back to (4), recall defined so that . Then, .
Proof : Because maps the uppertriangular adjacency matrix entries in to the degree sequence , then has ones in each row while all other entries are zero. Hence, the diagonal entries of are all and the offdiagonal entries are equal to
. The eigenvalues
of are the roots of the characteristic polynomialTo obtain the third equality we leveraged the ShermanMorrison formula, where stands for the adjugate matrix of . From the final factorization of the polynomial, the eigenvalues are . Because , the result follows.
Proof of Proposition 1
Leveraging the result in [2, Lemma 3.2], it follows that for all , the dual variable update iteration in (9) can be equivalently rewritten as , with
(18)  
(19) 
Starting with (18), we have from the definition of that
as desired [cf. (13)]. The last equality follows from the fact that the proximal operator of is the projection onto the nonnegative orthant .
Dual suboptimality
Recall one of the test cases in Section IVA, where the goal was to recover a block Stochastic Block Model (SBM) with nodes from syntheticallygenerated smooth signals. In Fig. 2 we depict the evolution of the dual suboptimality for the FDPG (Algorithm 1) and DPG iterations. As expected, the proposed FDPG solver markedly outperforms its nonaccelerated counterpart in terms of convergence rate. Similar behavior can be observed for random graphs drawn from the ErdősRényi (ER) model and for different values of and . The corresponding plots are not included due to lack of space.
Comments
There are no comments yet.