# Distributed Cartesian Power Graph Segmentation for Graphon Estimation

We study an extention of total variation denoising over images to over Cartesian power graphs and its applications to estimating non-parametric network models. The power graph fused lasso (PGFL) segments a matrix by exploiting a known graphical structure, G, over the rows and columns. Our main results shows that for any connected graph, under subGaussian noise, the PGFL achieves the same mean-square error rate as 2D total variation denoising for signals of bounded variation. We study the use of the PGFL for denoising an observed network H, where we learn the graph G as the K-nearest neighborhood graph of an estimated metric over the vertices. We provide theoretical and empirical results for estimating graphons, a non-parametric exchangeable network model, and compare to the state of the art graphon estimation methods.

## Authors

• 1 publication
• 19 publications
• 19 publications
• ### Adaptive Non-Parametric Regression With the K-NN Fused Lasso

The fused lasso, also known as total-variation denoising, is a locally-a...

• ### Information in additional observations of a non-parametric experiment that is not estimable

Given n independent and identically distributed observations and measuri...
07/19/2020 ∙ by Tilo Wiklund, et al. ∙ 0

• ### Total Variation Classes Beyond 1d: Minimax Rates, and the Limitations of Linear Smoothers

We consider the problem of estimating a function defined over n location...
05/26/2016 ∙ by Veeranjaneyulu Sadhanala, et al. ∙ 0

• ### New Risk Bounds for 2D Total Variation Denoising

2D Total Variation Denoising (TVD) is a widely used technique for image ...
02/04/2019 ∙ by Sabyasachi Chatterjee, et al. ∙ 0

• ### Application of Bounded Total Variation Denoising in Urban Traffic Analysis

While it is believed that denoising is not always necessary in many big ...
08/04/2018 ∙ by Shanshan Tang, et al. ∙ 0

• ### Graphon estimation via nearest neighbor algorithm and 2D fused lasso denoising

We propose a class of methods for graphon estimation based on exploiting...

• ### Parallel Cut Pursuit For Minimization of the Graph Total Variation

We present a parallel version of the cut-pursuit algorithm for minimizin...
05/07/2019 ∙ by Hugo Raguet, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Total variation (TV) denoising, also known as the fused lasso, is a classical method for image denoising Chambolle and Lions (1997) that groups pixels that are adjacent to one another and have similar pixel values, a process known as segmentation. For a network, , the analogous task is to segment all possible vertex pairs by segmenting the adjacency matrix of the network. While it does not make sense to segment based on the ordering of the vertices, as in TV denoising, if we have some other graph structure, , over the vertices of , then there is some hope of segmenting vertex pairs based on proximity in . This paper studies the natural generalization of TV denoising over images to networks, using a known or learned graph to provide the structure. (Throughout, we will call the response graph, , a network, and the predictor graph, , a graph.) To this end, we will introduce the power graph fused lasso (PGFL), and discuss learning the graph, , for graphon models, a non-parametric network model Bickel and Chen (2009).

### 1.1 Methodological overview: denoising a network with KNN-PGFL

Power graph fused lasso is one approach to denoising a response matrix with a known graph , where the th row and column of correspond to the th vertex in (). For example, the underlying graph structure may be based on individuals’ spatial proximity while the response matrix, may be binary indicating whether or not the individuals are friends on a social network. Our approach will partition the set of all possible dyads, , based on the friendship status of these pairs of individuals. Throughout, we will call elements of vertices, and elements of dyads. We will approach this problem by constructing a graph over the set of all dyads, , called the Cartesian power graph of degree 2, or the C2-power graph for short. Thus is a label for the dyad which is a node in the C2-power graph, and we will study the fused lasso over the C2-power graph to denoise .

Specifically, define the Cartesian power graph of degree 2 (C2-power), , where two dyads have an edge connecting them if there is an edge between vertices in one coordinate and the vertices are equal in the other. Specifically, the C2-power graph edge set is

 E□2:={((i0,j0),(i1,j1))∈V×2:(i0=i1 and (j0,j1)∈E) or ((i0,i1)∈E and j0=j1)}

This is consistent with the well-known notion of a graph Cartesian product and the C2-power graph is the Cartesian product of with itself, (commonly denoted ). Throughout, we will let be our response matrix (supervising variable), which may be binary or continuous; for graphon estimation it is an adjacency matrix. Throughout, we will consider a directed supervising variable in which

is an ordered pair and

may be asymmetric, but much of the results and methods can be extended to the case in which denotes unordered pairs.

A natural approach to segmenting a column based on the graph is to use the fused lasso, also known as TV denoising. The fused lasso seeks to minimize the data driven loss while also maintaining a small total variation, and over a graph it can be written,

 minβ∥y−β∥22+λ∥∇β∥1, (1)

where , , , and is the matrix such that ( is known as the edge incidence matrix). The effect of the TV norm is to attract nearby vertex values to one another. Due to the nature of -type norms, the solutions will tend to be piecewise constant over the graph, where for large enough, there will be just a few clusters within which the values of are identical. Because we would like to simultaneously denoise the rows and columns of , we cannot use the fused lasso individually over each row. Instead, we will denoise the entire matrix by applying the fused lasso over the C2-power graph. Power graph fused lasso (PGFL) is the solution to the following program,

 minP∈Rn×n∥A−P∥2F+λ(∥∇P∥1+∥∇P⊤∥1), (2)

where we abuse notation to allow to be the matrix 1-norm and is the Frobenius 2-norm. To see that the RHS of (2) is the TV penalty over the C2-power graph, notice that

 ∥∇P∥1+∥∇P⊤∥1=n∑k=1∥∇Pk∥1+∥∇(P⊤)k∥1=∑(i,j)∈E,k∈V(|Pk,i−Pk,j|+|Pi,k−Pj,k|),

where is the th column of . You can see a vignette of the C2-power graph in Figure 1.

Graphons are network models that provide a non-parametric representation of exchangeable graph models (see Diaconis and Janson (2007) for a thorough introduction). Let be a graphon and let be iid draws from a uniform distribution. The graphon model assumes that conditional on the latent variables, , the adjacency matrix is a Bernoulli ensemble, where . This forms the network,

, and the model is exchangeable because the probability distribution is invariant under permutation of the vertices. This highlights the fundamental challenge that is inherent in estimating graphon models: in order to estimate

, we must account for the nuisance parameters .

We will approach the graphon estimation problem by constructing the graph from the network and applying the PGFL to get an estimated probability matrix, . Recently, Zhang et al. (2015) proposed to estimate a metric between vertices based on the adjacency matrix, , and then apply a neighborhood smoother on each column, , separately. The metric that they construct is

 ^d2∞(i,j):=maxk≠i,j|(Ai−Aj)⊤Ak|/n, (3)

which they show will approximately bound the desired but unknown metric, under Lipschitz assumptions. We will introduce a similar metric, , and use it to form the K-nearest neighbor graph, , between the vertices of . The learned graph is our proxy for the unknown latent parameters , and the KNN-PGFL is the application of the PGFL, (2), to using the C2-power graph, .

### 1.2 Related Work and Contributions

Graph signal processing refers to methods that denoise, localize, detect, and predict signals over graphs. For example, each vertex corresponds to a low-powered sensor, and we would like to denoise sensor measurements, and we use the graph structure is based on communication between the sensors or spatial proximity. The driving assumption is that there is some underlying signal that in some way ‘respects’ the graph topology, and specifies the distribution of the observations. Many of the tools in signal processing and supervised learning can be extended to the graph case, such as Fourier analysis

Sandryhaila and Moura (2013); Hu et al. (2015), wavelets Crovella and Kolaczyk (2003); Sharpnack et al. (2013); Irion and Saito (2014), graph kernels Smola and Kondor (2003), and convolutional networks Kipf and Welling (2016); Henaff et al. (2015). Graph structure has previously been used in matrix completion and network denoising problems (see for example, Cai et al. (2011); Gu et al. (2010); Liu and Yang (2015); Brunner et al. (2012)

), but these methods require some predetermined graph structure, such as knowledge graphs, so are not well suited to estimating graphons, and they do not perform segmentation, which is the focus of this work.

There is an extensive body of literature on solving the fused lasso, (1). Algorithms for solving the fused lasso can be divided into two categories: solvers for a fixed , and path algorithms that find the solution for every within a range. The fused lasso for a fixed has a quadratic program dual form, and some popular algorithms for this are the projected Newton algorithm of Bertsekas (1982); Barbero and Sra (2011), first-order primal-dual algorithm Chambolle and Pock (2011), and split-Bregman iteration Goldstein and Osher (2009). Some path algorithms include the generalized lasso path algorithm of Arnold and Tibshirani (2016), and a max-flow version for the fused lasso in Hoefling (2010). If applied directly to C2-power graphs, these methods would have computational and memory complexity that scale with the number of dyads, .

Contribution 1. We provide a distributed implementation of the power graph fused lasso (PGFL), (2), based on a novel formulation using the alternating direction method of multipliers.

Recent theoretical studies have examined statistical rate guarantees for the fused lasso over graphs, (1). In these works, it is assumed that the true signal is either of bounded variation or is piecewise constant , for some constant (although all discrete signals are technically piecewise constant, we refer to signals with bounded number of changepoints as piecewise constant signals). Sharpnack et al. (2012) provided conditions under which one could exactly localize changepoints, edges across which the underlying signal () changes, under the piecewise constant assumption. These conditions were too strict to be realistic and it was discovered that for many graphs, the mean-square error (MSE) of the fused lasso could diminish even though the changepoints are not precisely recovered. For 1D chain graphs of length the mean square error (MSE) was shown to be diminishing like for functions of bounded variation Wang et al. (2016), and for piecewise constant functions under mild conditions, Lin et al. (2017); Guntuboyina et al. (2017) (throughout plog will refer a poly-logarithmic term). Padilla et al. (2016) demonstrated that the MSE scales like for all connected graphs, and not just the 1D chain graph, for functions of bounded variation. For the 2D grid graph of size , Hütter and Rigollet (2016) demonstrated that the MSE diminishes like for both signals of bounded variation and piecewise constant signals. The 2D grid graph is the C2-power graph of the 1D chain graph, and so it is reasonable to hope that the 2D grid graph actually has the slowest convergence of any C2-power graph.

Contribution 2. We prove that for any connected C2-power graph, , the mean square error of the PGFL diminishes like when the signal is of bounded variation and is subGaussian.

We next turn our attention to graphon estimation using the PGFL on a learned graph. The statistical limits of graphon estimation have been well characterized for smooth graphons, and it was found that computationally intractable profile likelihood maximization is minimax optimal for Hölder graphons, Wolfe and Olhede (2013); Gao et al. (2015). One tractable approach to graphon estimation is to order the vertices according to some graph statistics, such as the degrees of , and then treat the resulting re-ordered matrix as an image and applying image segmentation tools ( is the permutation associated with this sorting). This methodology is called sorting and smoothing (SAS), and in Chan and Airoldi (2014) they use TV denoising to perform the image segmentation. The implicit assumption is that the degree is a decent proxy for the latent variable, , which does not hold for most graphons.

Another related approach to segment the dyads is to group the vertices via a community detection method. The stochastic block model is a special instance of the graphon model which assumes that there are latent communities for the vertices and the probability of attachment between two vertices is a function only of the communities to which the vertices belong. This can be thought of as segmenting the dyads by taking the Cartesian product of the vertex communities, but this type of segmentation is restrictive because of this specialized structure. Heuristic or greedy methods for fitting the SBM for graphon estimation have been proposed in

Airoldi et al. (2013); Cai et al. (2014), but little is known about the statistical performance and whether these can achieve minimax performance. In another approach, Chatterjee et al. (2015)

proposed a spectral method that thresholds singular values and provided some MSE consistency guarantees. Currently, the best rate guarantee for a computationally tractable estimator of Lipschitz graphons is achieved by the aforementioned neighborhood smoothing method of

Zhang et al. (2015), and the MSE scales like , which is significantly worse than the minimax rate of .

Contribution 3. We propose the K-nearest neighbors power graph fused lasso (KNN-PGFL) for graphon estimation, compare its empirical performance to other graphon estimators, and provide theoretical guarantees under a bounded variation assumption on the graphon and additional conditions.

## 2 Method

### 2.1 Distributed power graph fused lasso

In this section, we provide a distributed method for solving PGFL, (2), by iterating parallel row-wise and column-wise operations. Our method uses the alternating direction method of multipliers to separate the two terms of the TV penalty on the C2-power graph. If we make the substitution then we can reformulate (2) as

 minP,Q∈Rn×n12∥A−P∥2F+12∥A⊤−Q∥2F+λ∥∇P∥1+λ∥∇Q∥1 s.t. P =QT.

The augmented Lagrangian, with multiplier , for this primal problem is

 12∥A−P∥2F+12∥A⊤−Q∥2F+λ∥∇P∥1+λ∥∇Q∥1+⟨U,P−QT⟩+η2||P−QT||2F,

where is the trace inner product. When is fixed, the minimization wrt takes the form of the separable minimization,

 minP∈Rn×n1+η2∥P−~A∥2F+λ∥∇P∥1=n∑i=1minPi∈Rn1+η2∥Pi−~Ai∥22+λ∥∇Pi∥1,

for some matrix (and vice versa for fixed). The inner minimization of the RHS is the fused lasso on the graph (the prox operator for the graph total variation), which we can take to be an algorithmic primitive. Let be the proximal operator, then we can summarize the resulting ADMM algorithm in Algorithm 1.

We use projected Newton iteration to compute the proximal operator (see the Appendix for the exact specification), which requires a Laplacian system solver. Projected Newton maintains an active set of edges , such that if is in the active set then (when prox is applied to the th row/column), and similarly for with a different active set. Hence, the denoised matrix, , will have regions of constant value that are connected by elements of the active sets, and in this way it will segment the matrix . This methodology works for any graph and response matrix , in the next section we outline the application of the PGFL for graphon estimation.

### 2.2 Fused graphon estimation

Suppose that the response matrix is the adjacency matrix for an observed network , and we are tasked with estimating the underlying probability matrix . A natural approach is to begin with a metric that is extracted from , then forming the K-nearest neighbor (KNN) graph for this metric. The idea is that if the underlying graphon is of sufficiently controlled variation with respect to the metric, then the variation between KNNs will be likewise controlled.

Constructing a meaningful metric over the vertices of the graphon is challenging because there are only a few statistics of the graphon that can be reliably estimated. Particularly, Zhang et al. (2015) observed that the inner product

has the unbiased estimator

when , but (the degree of the vertex divided by ) has expectation . So, estimating the norm between the graphon cross-sections (and most other common norms), is exceedingly difficult. Zhang et al. (2015) approached this problem by approximating the metric with , (3). We propose the use of a similar metric, which we empirically observe to be a more stable variant,

 ^d21(i,j):=1n(n−2)∑k≠i,j|(Ai−Aj)⊤Ak|.

We then generate the KNN graph, , which is defined to be symmetric and undirected, by connecting edges if either of the incident vertices is a K-nearest neighbor of the other. By applying the PGFL to and , then we obtain a which will be piecewise constant. Finally, we obtain a partition of the dyads, , which are those regions of the C2-power graph, , over which is constant.

## 3 Theory

We will begin our theoretical analysis with a mean-square error guarantee for the PGFL on any graph . This will give us corollaries for graphon estimation according to Algorithm 2.

### 3.1 Guarantees for general power graphs

Recall that Hütter and Rigollet (2016) demonstrated that the MSE of total variation denoising of a 2D image scales like under a bounded variation assumption. 2D total variation denoising is the PGFL when is the 1D chain graph, and for our main result, we find that this rate guarantee holds for any connected graph, . To prove this result we use a proof technique pioneered in Padilla et al. (2016), where the depth-first search algorithm is used to reorder the vertices in a way that approximately preserves total variation. We modify this technique to work for C2-power graphs, and arrive at our desired conclusion.

###### Theorem 1 (PGFL for general G).

Suppose that has expectation , and that has connected components. Let be the solution to (2), let , and assume each entry, , is an independent and subGaussian(. Then for some choice of the MSE decays,

 1n2∥^P−P0∥2F=OP(q2logqn2+q2∥^P−P0∥∞lognn+lognn2(∥∇P0∥1+∥∇P⊤0∥1)).

We note that if the graph is connected, then , and from the proof we see that the term involving disappears from the upper bound.

### 3.2 Graphon estimation

Algorithm 2, is predicated on the notion that if you consider the K-nearest neighbors of vertex in , then these will have similar graphon cross-sections, namely, for neighbor . When is sufficiently smooth, then will imply that the corresponding graphon cross-sections are similar. In this work, the notion of smoothness that we will assume is that the cross-sections are of bounded variation.

###### Assumption 1.

There exists a constant , such that for any and for we have that the graphon satisfies,

 s−1∑l=1|f(us,v)−f(us+1,v)|≤B,s−1∑l=1|f(v,us)−f(v,us+1)|≤B.

Our proposed Algorithm 2 is a departure from the neighborhood smoother of Zhang et al. (2015) in two ways: we use the metric, , instead of , and the PGFL provides a segmentation of the entire adjacency matrix, (as opposed to smoothing in a row-wise fashion). We find in this section that the performance of the KNN-PGFL is very dependent on the quality of our underlying metric , which we find to be more stable than . This is consistent with the theoretical results in Zhang et al. (2015). Roughly, speaking, the statistical rate bottleneck in their analysis lies with variability of their metric . One can imagine that because is based on the average of

independent random variables, it will have a standard error of around

. Notably this error is additive, meaning that even when , we may have that is on the order of . This measurement error means that the resolution for estimating a smooth graphon using will be at the scale of , which is significantly different from the optimal resolution of —we can smooth at bandwidths that are on this order and obtain optimal graphon estimators. This additive error term, , is made precise in the following assumptions which apply to any choice of metric.

###### Assumption 2.

The distance is lower Lipschitz wrt with constant , and additive error , if for ,

 L1|ξi−ξj|−Δn≤^d2(i,j).
###### Assumption 3.

The distance is piecewise Lipschitz wrt with constant , and additive error , if the following holds. There exists a constant and a partition and sets with , and for , and with , such that for ,

 ξi,ξj∈Al,l∈{1,…,m−1},% implies^d2(i,j)≤L2|ξi−ξj|+Δn.

Assumption 2 is a statement that the cross-sections do not repeat themselves in the sense that if is far from then the the corresponding cross-sections are sufficiently different in the metric. Assumption 3 will hold for if we assume that the graphon is piecewise Lipschitz where (see the Appendix of Zhang et al. (2015) for a similar derivation). If we have a metric and graphon that satisfies these assumptions, then we can obtain an MSE rate bound that is dependent on .

###### Corollary 1.

Suppose that is drawn from a graphon model with graphon that satisfies Assumption 1, and let be the conditional edge probability. Let be the output of Algorithm 2 applied to with a metric that satisfies Assumptions 2, 3, and for and . Suppose that the KNN graph, , has connected components, then there is a choice of such that

 1n2∥^P−P0∥2F=OP(q2lognn+(K2+nKΔn)⋅lognn).

The MSE bound of Theorem 1 is dependent on the additive error of the metric, . In the event that we find a metric with additive error , and the KNN graph is connected, then the KNN-PGFL can achieve near minimax rates (unfortunately, all known metrics have an error ). Instead of making our assumptions about , we make these assumptions about the population level version of the metric,

 d21(i,j):=∫10∣∣∣∫10(f(ξi,v)−f(ξj,v))f(u,v)dv∣∣∣du.

We now consider these assumptions placed on instead of .

###### Corollary 2.

Suppose that Assumptions 2, 3 hold for with then they also hold for with . With these assumptions on and under the remaining conditions of Corollary 1 and assume we set , the KNN-PGFL with metric has MSE bound

 1n2∥^P−P0∥2F=OP(plog n√n).

This result is consistent (up to logarithmic terms) with what was found in Zhang et al. (2015), although under somewhat different conditions. It is outside of the scope of this work to comprehensively study the construction of better metric, and we believe that a significant departure from and is needed.

## 4 Experiments

To test the empirical performance of KNN-PGFL, we simulate from five graphon models and evaluate the mean-square error of some important graphon estimators. In addition to our own, four other methods were used for comparison, neighborhood smoothing (NS), Zhang et al. (2015), sorting and smoothing (SAS), Chan and Airoldi (2014), the stochastic block model (SBM), Airoldi et al. (2013), and USVT, Chatterjee et al. (2015). For each graphon function and each repetition, a graph with 1000 nodes was generated. NS, SAS, USVT and KNN-PGFL were applied to the same graph. The penalty parameter in KNN-PGFL was chosen as for all graphon functions, and 2-Nearest Neighborhoods was used (this gave a well connected KNN graph with only a few connected components). The stopping criterion for KNN-PGFL was , and the resulting was used as the estimated probability matrix. For the SAS method, the bandwidth parameter was chosen as . For SBM, at least two observed graphs are needed, so to make the comparison fair, for each repetition, 4 graphs with 500 nodes were generated according to the graphon function. SBM was applied to the four observed graphs, and the tuning parameter ( in their paper) was chosen by cross-validation. The MSEs were averaged over 30 repetitions, multiplied by , are shown in Table 1.

The performance of each estimator is bound to be highly dependent on the structure of the graphon (see Figure 2 for the graphons and their estimates). Graphon A has monotonic node degrees, and is of low rank; as a result SAS and USVT perform well in this case, but KNN-PGFL works similarly to these as well. Graphon B is a graph with blocks, and also a piecewise constant function; KNN-PGFL performs best, followed by SBM which is designed for this situation. Graphon C is a smooth graphon function with local structure, and the best result is obtained by USVT, followed by NS and KNN-PGFL, but none of the methods are especially well suited to this graphon. Graphon D and E are both piecewise constant graphon functions. Due to the lack of monotonicity here, SAS fails to recover the probability matrix. KNN-PGFL gives the best MSE results, followed by SBM. For all five graphons, KNN-PGFL performs well and does not catastrophically fail, and in all but one case, it significantly outperforms other segmentation methods, SAS and SBM.

## 5 Conclusion

We proposed the power graph fused lasso for denoising a matrix with a known graph over the rows and columns. Our main theorem, 1, demonstrates that it achieves the same mean-square error guarantee as 2D total variation denoising under a subGaussian error model. We proposed its use for graphon estimation with the K-nearest neighbors graph, and studied its performance both theoretically and empirically. We find that it is empirically competitive with existing methods and significantly outperforms the other graphon segmentation methods, SAS and SBM. Theoretically and experimentally, we find that the performance of KNN-PGFL is limited by the quality of the distance metric , due to the additive error characterized in Assumptions 2, 3 (a similar problem shared by neighborhood smoothing, Zhang et al. (2015)). We hope that future work can discover better vertex metrics for graphon models that can be used in conjunction with the proposed methodology.

### Acknowledgements

JS is partially supported by NSF DMS-1712996.

## References

• Airoldi et al. (2013) Airoldi, E. M., Costa, T. B., and Chan, S. H. (2013). Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems, pages 692–700.
• Arnold and Tibshirani (2016) Arnold, T. B. and Tibshirani, R. J. (2016). Efficient implementations of the generalized lasso dual path algorithm. Journal of Computational and Graphical Statistics, 25(1):1–27.
• Barbero and Sra (2011) Barbero, A. and Sra, S. (2011). Fast newton-type methods for total variation regularization. In

Proceedings of the 28th International Conference on Machine Learning (ICML-11)

, pages 313–320. Citeseer.
• Bertsekas (1982) Bertsekas, D. P. (1982). Projected newton methods for optimization problems with simple constraints. SIAM Journal on control and Optimization, 20(2):221–246.
• Bickel and Chen (2009) Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068–21073.
• Brunner et al. (2012) Brunner, C., Fischer, A., Luig, K., and Thies, T. (2012).

Pairwise support vector machines and their application to large scale problems.

Journal of Machine Learning Research, 13(Aug):2279–2292.
• Cai et al. (2014) Cai, D., Ackerman, N., and Freer, C. (2014). An iterative step-function estimator for graphons. arXiv preprint arXiv:1412.2129.
• Cai et al. (2011) Cai, D., He, X., Han, J., and Huang, T. S. (2011). Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1548–1560.
• Chambolle and Lions (1997) Chambolle, A. and Lions, P.-L. (1997). Image recovery via total variation minimization and related problems. Numerische Mathematik, 76(2):167–188.
• Chambolle and Pock (2011) Chambolle, A. and Pock, T. (2011). A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120–145.
• Chan and Airoldi (2014) Chan, S. and Airoldi, E. (2014). A consistent histogram estimator for exchangeable graph models. In International Conference on Machine Learning, pages 208–216.
• Chatterjee et al. (2015) Chatterjee, S. et al. (2015). Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1):177–214.
• Crovella and Kolaczyk (2003) Crovella, M. and Kolaczyk, E. (2003). Graph wavelets for spatial traffic analysis. In INFOCOM 2003. Twenty-Second Annual Joint Conference of the IEEE Computer and Communications. IEEE Societies, volume 3, pages 1848–1857. IEEE.
• Diaconis and Janson (2007) Diaconis, P. and Janson, S. (2007). Graph limits and exchangeable random graphs. arXiv preprint arXiv:0712.2749.
• Gao et al. (2015) Gao, C., Lu, Y., Zhou, H. H., et al. (2015). Rate-optimal graphon estimation. The Annals of Statistics, 43(6):2624–2652.
• Goldstein and Osher (2009) Goldstein, T. and Osher, S. (2009). The split bregman method for l1-regularized problems. SIAM journal on imaging sciences, 2(2):323–343.
• Gu et al. (2010) Gu, Q., Zhou, J., and Ding, C. (2010). Collaborative filtering: Weighted nonnegative matrix factorization incorporating user and item graphs. In Proceedings of the 2010 SIAM International Conference on Data Mining, pages 199–210. SIAM.
• Guntuboyina et al. (2017) Guntuboyina, A., Lieu, D., Chatterjee, S., and Sen, B. (2017). Spatial adaptation in trend filtering. arXiv preprint arXiv:1702.05113.
• Henaff et al. (2015) Henaff, M., Bruna, J., and LeCun, Y. (2015). Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163.
• Hoefling (2010) Hoefling, H. (2010). A path algorithm for the fused lasso signal approximator. Journal of Computational and Graphical Statistics, 19(4):984–1006.
• Hu et al. (2015) Hu, W., Cheung, G., Ortega, A., and Au, O. C. (2015).

Multiresolution graph fourier transform for compression of piecewise smooth images.

IEEE Transactions on Image Processing, 24(1):419–433.
• Hütter and Rigollet (2016) Hütter, J.-C. and Rigollet, P. (2016). Optimal rates for total variation denoising. In Conference on Learning Theory, pages 1115–1146.
• Irion and Saito (2014) Irion, J. and Saito, N. (2014). The generalized haar-walsh transform. In Statistical Signal Processing (SSP), 2014 IEEE Workshop on, pages 472–475. IEEE.
• Kipf and Welling (2016) Kipf, T. N. and Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
• Lin et al. (2017) Lin, K., Sharpnack, J. L., Rinaldo, A., and Tibshirani, R. J. (2017). A sharp error analysis for the fused lasso, with application to approximate changepoint screening. In Advances in Neural Information Processing Systems, pages 6887–6896.
• Liu and Yang (2015) Liu, H. and Yang, Y. (2015). Bipartite edge prediction via transductive learning over product graphs. In International Conference on Machine Learning, pages 1880–1888.
• Padilla et al. (2016) Padilla, O. H. M., Scott, J. G., Sharpnack, J., and Tibshirani, R. J. (2016). The dfs fused lasso: Linear-time denoising over general graphs. arXiv preprint arXiv:1608.03384.
• Sandryhaila and Moura (2013) Sandryhaila, A. and Moura, J. M. (2013). Discrete signal processing on graphs: Graph fourier transform. In ICASSP, pages 6167–6170.
• Sharpnack et al. (2013) Sharpnack, J., Singh, A., and Krishnamurthy, A. (2013). Detecting activations over graphs using spanning tree wavelet bases. In Artificial Intelligence and Statistics, pages 536–544.
• Sharpnack et al. (2012) Sharpnack, J., Singh, A., and Rinaldo, A. (2012). Sparsistency of the edge lasso over graphs. In Artificial Intelligence and Statistics, pages 1028–1036.
• Smola and Kondor (2003) Smola, A. J. and Kondor, R. (2003). Kernels and regularization on graphs. In Learning theory and kernel machines, pages 144–158. Springer.
• Von Luxburg et al. (2010) Von Luxburg, U., Radl, A., and Hein, M. (2010). Hitting and commute times in large graphs are often misleading. arXiv preprint arXiv:1003.1266.
• Wang et al. (2016) Wang, Y.-X., Sharpnack, J., Smola, A. J., and Tibshirani, R. J. (2016). Trend filtering on graphs. Journal of Machine Learning Research, 17(105):1–41.
• Wolfe and Olhede (2013) Wolfe, P. J. and Olhede, S. C. (2013). Nonparametric graphon estimation. arXiv preprint arXiv:1309.5936.
• Zhang et al. (2015) Zhang, Y., Levina, E., and Zhu, J. (2015). Estimating network edge probabilities by neighborhood smoothing. arXiv preprint arXiv:1509.08588.

## 6 Appendix for “Distributed Cartesian Power Graph Segmentation for Graphon Estimation”

###### Proof of Theorem 1..

We will follow a standard derivation of MSE bound for penalized estimation with minor modifications. See, for example, Wang et al. (2016) for many of these tools. Let us also denote by , the connected components of . We also write for the incidence matrix corresponding to , and .

Throughout for a matrix . If , we denote by the matrix in such that for , .

Now, back to the proof, we recall the basic inequality,

 ∥^P−P0∥2F≤2q∑l′=1q∑l=1∣∣⟨RVl′,Vl,(^P−P0)Vl′,Vl⟩F∣∣+λq∑l′=1∑i∈V′lq∑l=1[∥∇l(P0)i,⋅∥1+∥∇l(P0)⋅,i∥1]−λq∑l′=1∑i∈Vl′q∑l=1[∥∇l^Pi,⋅∥1+∥∇l^P⋅,i∥1] (4)

Consider now running depth first search (DFS) on , and let be the ordering of such that is the th vertex that the DFS visits (let the DFS start from an arbitrary node). Let denote the edge incidence matrix for the 1D chain graph that connects to for . Then by Lemma 1 in Padilla et al. (2016), for any we have that

 ∑i∈Vl′∥∇1D,lPi,⋅∥1≤2∑i∈Vl′∥∇lPi,⋅∥1. (5)

And we notice that

 ∑i∈Vl′∥∇1D,lPi,⋅∥1+∑i∈Vl∥∇1D,l′P⋅,i∥1:=∥∇2D,l′,lPl′,l∥1 (6)

where the latter is the total variation of , along an appropriately constructed 2D grid graph of size . Here, denotes the vectorization.

Now for , we define for the matrix as

 Sl′,l(x)i,j=1if (i,j)∈Vl′×Vl,Sl′,l(x)i,j=0otherwise.

Let us consider the ordering obtained by concatenating the DFS orderings associated with the graphs, see above. Let be the incidence operator associated with the 2D grid graph using such ordering . We also write for the orthogonal projection on the span of . Then by (5) and (6), Cauchy–Schwarz inequality, and Hölder inequality,

 ∣∣⟨RVl′,Vl,(^P−P0)Vl′,Vl⟩F∣∣=∣∣⟨Sl′,l(R),Sl′,l(^P−P0)⟩F∣∣≤|⟨Πvec(Sl′,l(R)),vec(Sl′,l(^P−P0))⟩F|+|⟨(∇+2D)Tvec(Sl,l′(R)),∇2Dvec(Sl,l′(^P−P0))⟩F|≤∥Πvec(Sl′,l(R))∥2∥(^P−P0)Vl′,Vl∥2+∥(∇+2D)Tvec(Sl′,l(R))∥∞∥∇2Dvec(Sl′,l(^P−P0))∥1≤∥Πvec(Sl′,l(R))∥2∥(^P−P0)Vl′,Vl∥2+2∥(∇+2D)Tvec(Sl′,l(R))∥∞[∑i∈Vl′∥∇l(^P−P0)i,⋅∥1+∑i∈Vl∥∇l′(^P−P0)⋅,i∥1+2n∥^P−P0∥∞].

On the other hand, because the entries of are iid subGaussian (aside from those that were set to ), for any ,

 maxl,l′∈[q]∥Πvec(Sl′,l(R))∥2≤2σ√2log(eq2/u),

and

with probability at least .

Moreover, by by Prop. 1 of Hütter and Rigollet (2016) there exists a positive constant such that

 maxj∥(∇+2D)⋅,j∥2≤C√logn.

Therefore, with probability at least ,

 12∥^P−P0∥2F≤q∑l′=1q∑l=1[4σ√2log(eq2/u)∥(^P−P0)Vl′,Vl∥2−12∥(^P−P0)Vl′,Vl∥22]4σC√2lognlog(2eq2n2u)q∑l′=1q∑l=1[∑i∈Vl′∥∇l^Pi,⋅∥1+∑i∈Vl∥∇l′^P⋅,i∥1∑i∈Vl′∥∇l(P0)i,⋅∥1+∑i∈Vl∥∇l′(P0)⋅,i∥1+2n∥^P−P0∥∞]+λq∑l′=1∑i∈Vl′q∑l=1[∥∇l(P0)i,⋅∥1+∥∇l(P0)⋅,i∥1]−λq∑l′=1∑i∈Vl′q∑l=1[∥∇l^Pi,⋅∥1+∥∇l^P⋅,i∥1]

and so by the inequality , and choosing as

 λ=4σC√2lognlog(2eq2n2u),

we obtain

 12∥^P−P0∥2F≤8q2σ2log(eq2/u)+[8σC√2lognlog(2eq2n2u)]nq2∥^P−P0∥∞+8σC√2lognlog(2eq2n2u)q∑l′=1∑i∈Vl′q∑l=1[∥∇l(P0)i,⋅∥1+∥∇l(P0)⋅,i∥1]

with probability at least .

###### Proof of Corollary 1..

First, we observe that for all . If not, then both the loss and the objective in the definition of can be improve by setting as

 ~Pi,j=⎧⎪ ⎪⎨⎪ ⎪⎩1if ^Pi,j>10if ^Pi,j<0^Pi,jotherwise.

Hence, .

Next, let and for some .

Let and and

 ~B(i,δ)={j≠i:ξj∈Bδ(ξi)}.

Notice that the have the same distribution for all . By Proposition 27 from Von Luxburg et al. (2010),

 P{|~B(1,δ)|≤δn4}=∫10P(∣∣{j>2:ξj∈Bδ(x)}∣∣≤δn4)dx ≤∫10P(∣∣{j>2:ξj∈Bδ(x)∩Al(x)}∣∣≤n−12Vol(Bδ(x)))dx ≤∫10exp(−nVol(Bδ(x))24)dx =exp(−nδ12).

Let us consider the event

 Ω(δ)=n⋂i=1{|~B(i,δ)|≥δn4}.

Set . By the union bound if (which is satisfied under our assumptions for ) so henceforth assume . Let , with then by Assumption 3,

 ^d(i,i′)≤L2|ξi−ξi′|+Δn≤4L2Kn+(L2+1)Δn,

and notice that on there are at least such vertices .

On the other hand, let such that

 δ′:=C(Kn+Δn)>1L1(4L2Kn+(L2+2)Δn).

Hence, by Assumption 2, if , then