 # How to learn a graph from smooth signals

We propose a framework that learns the graph structure underlying a set of smooth signals. Given X∈R^m× n whose rows reside on the vertices of an unknown graph, we learn the edge weights w∈R_+^m(m-1)/2 under the smoothness assumption that trX^ LX is small. We show that the problem is a weighted ℓ-1 minimization that leads to naturally sparse solutions. We point out how known graph learning or construction techniques fall within our framework and propose a new model that performs better than the state of the art in many settings. We present efficient, scalable primal-dual based algorithms for both our model and the previous state of the art, and evaluate their performance on artificial and real data.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider a matrix , where each row resides on one of nodes of a graph . Then each of the columns of

can be seen as a signal on the same graph. A simple assumption about data residing on graphs, but also the most widely used one is that it changes smoothly between connected nodes. An easy way to quantify how smooth is a set of vectors

on a given weighted undirected graph is through the function

 12∑i,jWij∥xi−xj∥2,

where denotes the weight of the edge between nodes and . In words, if two vectors and from a smooth set reside on two well connected nodes (so is big), they are expected to have a small distance. Using the graph Laplacian matrix , where is the diagonal degree matrix with , this function can be written in matrix form as

 tr(X⊤LX).

The importance of the graph Laplacian has long been known as a tool for embedding, manifold learning, clustering and semisupervised learning, see e.g. belkin2001laplacian, zhu2003semi,belkin2006manifold. More recently we find an abundance of methods that exploit this notion of smoothness to regularize various machine learning tasks, solving problems of the form

 minimizeX  g(X)+tr(X⊤LX). (1)

zhang2006linear use it to enhance web page categorization with graph information, zheng2011graph for graph regularized sparse coding. cai2011graph use the same term to regularize NMF, jiang2013graph for PCA and kalofolias2014matrix for matrix completion. Having good quality graphs is key to the success of the above methods.

The goal of this paper is to solve the complementary problem of learning a good graph:

 minimizeL∈L  tr(X⊤LX)+f(L), (2)

where denotes the set of valid graph Laplacians.

Why is this problem important? Firstly because it enables us to directly learn the hidden graph structure behind our data. Secondly because in most problems that can be written in the form of eq. (1), we are often given a noisy graph, or no graph at all. Therefore, starting from the initial graph and alternating between solving problems (1) and (2) we can at the same time get a better quality graph and solve the task of the initial problem.

##### Related Work.

dempster1972covariance was one of the first to propose the problem of finding connectivity from measurements, under the name “covariance selection”. Years later, banerjee2008model proposed solving an

-1 penalized log-likelihood problem to estimate a sparse inverse covariance with unknown pattern of zeros. However, while a lot of work has been done on inverse covariance estimation, the latter differs substantially from a graph Laplacian. For instance, the off-diagonal elements of a Laplacian must be non-positive, while it is not invertible like the inverse covariance.

wang2008label learn a graph with normalized degrees by minimizing the objective , but they assume a fixed k-NN edge pattern. daitch2009fitting considered the similar objective and they approximately minimized it with a greedy algorithm and a relaxation.

zhang2010transductive alternate between problems (1) and a variation of (2). However, while they start from an initial graph Laplacian , they finally learn a s.p.s.d. matrix that is not necessarily a valid Laplacian.

The works most relevant to ours are the ones by lake2010discovering and by dong2015learning,dong2015laplacian. In the first one, the authors consider a problem similar to the one of the inverse covariance estimation, but impose additional constraints in order to obtain a valid Laplacian. However, their final objective function contains many constraints and a computationally demanding log-determinant term that makes it difficult to solve. To the best of our knowledge, there is no scalable algorithm in the literature to solve their model. dong2015learning propose a model that outperforms the one by lake2010discovering, but still do not provide a scalable algorithm. This work is complementary to theirs, as we not only compare against their model, but also provide an analysis and a scalable algorithm to solve it.

##### Contributions.

In this paper we make the link between smoothness and sparsity. We show that the smoothness term can be equivalently seen as a weighted -1 norm of the adjacency matrix, and minimizing it leads to naturally sparse graphs (Section 2). Based on this, we formulate our objective as a weighted -1 problem that we propose as a general framework for solving problem (2). Using this framework we propose a new model for learning a graph. We prove that our model has effectively one parameter that controls how sparse is the learnt graph (Section 4).

We show how our framework includes the standard Gaussian kernel weight construction, but also the model by dong2015learning. We simplify their model and prove fundamental properties (Section 4).

We provide a fast, scalable and convergent primal-dual algorithm to solve our proposed model, but also the one by dong2015learning. To the best of our knowledge, these are the first scalable solutions in the literature to learn a graph under smoothness assumption (2) (Section 5).

To evaluate our model, we first review different definitions of smooth signals in the literature. We show how they can be unified under the notion of graph filtering (Section 3). We compare the models under artificial and real data settings. We conclude that our model is superior in many cases and achieves better connectivity when sparse graphs are sought (Section 6).

## 2 Properties of the Laplacian

Throughout this paper we use the combinatorial graph Laplacian defined as , where and . The space of all valid combinatorial graph Laplacians, is by definition

 L= {L∈Rm×m: (∀i≠j)   Lij=Lji≤0,  Lii=−∑j≠iLij}.

In order to learn a valid graph Laplacian, we might be tempted to search in the above space, as is done e.g. by lake2010discovering, dong2015learning. We argue that it is more intuitive to search for a valid weighted adjacency matrix from the space

 Wm={W∈Rm×m+:  W=W⊤,  diag(W)=0},

leading to simplified problems. Even more, when it comes to actually solving the problem by optimization techniques, we should consider the space of all valid edge weights for a graph

 Wv={w∈Rm(m−1)/2+},

so that we do not have to deal with the symmetricity of explicitly. The spaces , and are equivalent, and connected by bijective linear mappings. In this paper we use to analyze the problem in hand and when we solve the problem. Table 1 exhibits some of the equivalent forms in the three spaces.

### 2.1 Smooth manifold means graph sparsity

Let us define the pairwise distances matrix :

 Zi,j=∥∥xi−xj∥∥2.

Using this, we can rewrite the trace term as

 tr(X⊤LX)=12tr(WZ)=12∥W∘Z∥1,1, (3)

where is the elementwise norm-1 of and is the Hadamard product (see Appendix). In words, the smoothness term is a weighted -1 norm of , encoding weighted sparsity, that penalizes edges connecting distant rows of . The interpretation is that when the given distances come from a smooth manifold, the corresponding graph has a sparse set of edges, preferring only the ones associated to small distances in .

Explicitly adding a sparsity term to the objective function is a common tactic for inverse covariance estimation. However, it brings little to our problem, as here it can be translated as merely adding a constant to the squared distances in :

 (4)

Note that all information of conveyed by the trace term is contained in the pairwise distances matrix , so that the original could be omitted. Moreover, using the last term of eq. (3) instead of the trace enables us to define other kinds of distances instead of Euclidean.

Note finally that the separate rows of do not have to be smooth signals in some sense. Two non-smooth signals can have a small distance between them, and therefore a small entry .

## 3 What Is a Smooth Signal?

Given a graph, different definitions of what is a smooth signal have been used in different contexts. In this section we unify these different definitions using the notion of filtering on graphs. For more information about signal processing on graphs we refer to the work of shuman2013emerging. Filtering of a graph signal by a filter is defined as the operation111We denote by both the function and its matrix counterpart

acting on the matrix’s eigenvalues.

 y=h(L)x=∑iuih(λi)u⊤ix=∑iuih(λi)^xi, (5)

where

are eigenvector-eigenvalue pairs of

, and is the graph Fourier representation of containing its graph frequencies . Low frequencies correspond to small eigenvalues, and low-pass or smooth filters correspond to decaying functions .

In the sequel we show how different models for smooth signals in the literature can be written as smoothing problems of an initial non-smooth signal. We give an example of three different filters applied on the same signal in Figure 4 (Appendix).

#### Smooth signals by Tikhonov regularization.

Solving problem (1) leads to smooth signals. By setting we have a Tikhonov regularization problem, that given an arbitrary as input gives its graph-smooth version . Equivalently, we can see this as filtering by

 h(λ)=11+αλ, (6)

where big values result in smoother signals.

#### Smooth signals from a probabilistic generative model.

dong2015laplacian proposed that smooth signals can be generated from a colored Gaussian distribution as

, where and denotes the pseudoinverse. Therefore follows the distribution

 x∼N(¯x,L†).

To sample from the above, it suffices to draw an initial non-smooth signal and then compute

 x=¯x+h(L)x0, (7)

with , or equivalently filter it by

 h(λ)={√λ−1,λ>00,λ=0 (8)

and add the mean . We point out here that using eq. (7) on any and for any filter would yield samples from

 x∼N(¯x,h(L)2),

therefore the probabilistic generative model can be used for any filter . However, it does not cover cases where the initial is not white Gaussian.

### Smooth signals by heat diffusion on graphs

Another type of smooth signals in the literature results from the process of heat diffusion on graphs. See for example the work by zhang2008graph for an application on image denoising by heat diffusion smoothing on the pixels graph. Given an initial signal , the result of the heat diffusion on a graph after time is , therefore the corresponding filter is

 h(λ)=exp(−tλ), (9)

where bigger values of result in smoother signals.

## 4 Learning a Graph From Smooth Signals

In order to learn a graph from smooth signals, we propose, as explained in Section 2, to rewrite problem (2) using the weighted adjacency matrix and the pairwise distance matrix instead of :

 minimizeW∈Wm   ∥W∘Z∥1,1 + f(W). (10)

Since is positive we could replace the first term by , but we prefer this notation to keep in mind that our problem already has a sparsity term on . This means that has to play two important roles: (1) prevent from going to the trivial solution and (2) impose further structure using prior information on . This said, depending on the solution is expected to be sparse, that is important for large scale applications.

In order to motivate this general graph learning framework, we show that the most standard weight construction, as well as the state of the art graph learning model are special cases thereof.

### 4.1 Classic Laplacian computations

In the literature one of the most common practices is to construct edge weights given from the Gaussian function

 wij=exp(−∥xi−xj∥222σ2). (11)

It turns out that this choice of weights can be seen as the result of solving problem (10) with a specific prior on the weights :

###### Proposition 1.

The solution of the problem

 minimizeW∈Wm  ∥W∘Z∥1,1 + 2σ2∑ijWij(log(Wij)−1)

is given by eq. (11).

###### Proof.

The problem is edge separable and the objective can be written as . Deriving w.r.t. we obtain the optimality condition , or , that proves the proposition. ∎

Note that here, the logarithm in prevents the weights from going to , leading to full matrices, and sparsification has to be imposed explicitly afterwards.

### 4.2 Our proposed model

Based on our framework (10) our goal is to give a general purpose model for learning graphs, when no prior information is available. In order to obtain meaningful graphs, we want to make sure that each node has at least one edge with another node. It is also desirable to have control of how sparse is the resulting graph. To meet these expectations, we propose the following model with parameters and controlling the shape of the edges:

 minimizeW∈Wm ∥W∘Z∥1,1 − α1⊤log(W1) + β∥W∥2F. (12)

The logarithmic barrier acts on the node degree vector , unlike the model of Proposition 1 that has a similar barrier on the edges. This means that it forces the degrees to be positive, but does not prevent edges from becoming zero. This improves the overall connectivity of the graph, without compromising sparsity. Note however, that adding solely a logarithmic term () leads to very sparse graphs, and changing only changes the scale of the solution and not the sparsity pattern (Proposition 2 for ). For this reason, we add the third term.

We showed in eq. (4) that adding an -1 norm to control sparsity is not very useful. On the other hand, adding a Frobenius norm we penalize the formation of big edges but do not penalize smaller ones. This leads to more dense edge patterns for bigger values of . An interesting property of our model is that even if it has two terms shaping the weights, if we fix the scale we then need to search for only one parameter:

###### Proposition 2.

Let denote the solution of our model (12) for input distances and parameters , . Then the following property holds for any :

 F(Z,α,β)=γF(Z,αγ,βγ)=αF(Z,1,αβ). (13)
###### Proof.

See appendix.∎

This means that for example if we want to obtain a with a fixed scale (for any norm), we can solve the problem with , search only for a parameter that gives the desired edge density and then normalize the graph by the norm we have chosen.

The main advantage of our model over the method by dong2015laplacian, is that it promotes connectivity by putting a log barrier directly on the node degrees. Even for , we obtain the sparsest solution possible, that assigns at least one edge to each node. In this case, the distant nodes will have smaller degrees (because of the first term), but still be connected to their closest neighbour similarly to a 1-NN graph.

### 4.3 Fitting the state of the art in our framework

dong2015laplacian proposed the following model for learning a graph:

 minimizeL∈L tr(X⊤LX)+α∥L∥2F, s.t.,tr(L)=s.

Parameter controls the scale (dong2015laplacian set it to ), and parameter controls the density of the solution. This formulation has two weaknesses. First, using a Frobenius norm on the Laplacian has a reduced interpretability: the elements of are not only of different scales, but also linearly dependent. Secondly, optimizing it is difficult as it has 4 constraints on : 3 in order to constrain in space , and one to keep the trace constant. We propose to solve their model using our framework: Using transformations of Table 1, we obtain the equivalent simplified model

 minimizeW∈Wm ∥W∘Z∥1,1+α∥W1∥2+α∥W∥2F, s.t.,∥W∥1,1=s. (14)

Using this parametrization, solving the problem becomes much simpler, as we show in Section 5. Note that for

we have a linear program that assigns weight

to the edge corresponding to the smallest pairwise distance in , and zero everywhere else. On the other hand, setting to big values, we penalize big degrees (through the second term), and in the limit we obtain a dense graph with constant degrees across nodes. We can also prove some interesting properties of (14):

###### Proposition 3.

Let denote the solution of model (14) for input distances and parameters and . Then for the following properties hold:

 H(Z+γ,α,s)=H(Z,α,s) (15) H(Z,α,s)=γH(Z,αγ,sγ)=sH(Z,αs,1) (16)
###### Proof.

See appendix. ∎

In other words, model (14) is invariant to adding any constant to the squared distances. The second property means that similarly to our model, the scale of the solution does not change the shape of the connectivity. If we fix the scale to , we obtain the whole range of edge shapes given by only by changing parameter .

## 5 Optimization

An advantage of using the formulation of problem (10) is that it can be solved efficiently for a wide range of choices of . We use primal dual techniques that scale, like the ones reviewed by komodakis2014playing to solve the two state of the art models: the one we propose and the one by dong2015learning. Using these as examples, it is easy to solve many interesting models from the general framework (10).

In order to make optimization easier, we use the vector form representation from space (see Table 1), so that the symmetricity does not have to be imposed as a constraint. We write the problem as a sum of three functions in order to fit it to primal dual algorithms reviewed by komodakis2014playing. The general form of our objective is

 minimizew∈Wv f1(w)+f2(Kw)+f3(w), (17)

where and are functions for which we can efficiently compute proximal operators, and is differentiable with gradient that has Lipschitz constant . is a linear operator, so is defined on the dual variable . In the sequel we explain how this general optimization framework can be applied to the two models of interest, leaving the details in the Appendix. For a better understanding of primal dual optimization or proximal splitting methods we refer the reader to the works of combettes2011proximal, komodakis2014playing.

In our model, the second term acts on the degrees of the nodes, that are a linear function of the edge weights. Therefore we use , where is the linear operator that satisfies if is the vectorform of . In the first term we group the positivity constraint of and the weighted -1, and the second and third terms are the priors for the degrees and the edges respectively. In order to solve our model we define

 f1(w)=1{w≥0}+2w⊤z, f2(d)=−α1⊤log(d), f3(w)=β∥w∥2, with ζ=2β,

where is the indicator function that becomes zero when the condition in the brackets is satisfied, infinite otherwise. Note that the second function is defined on the dual variable , that here is very conveniently the vector of the node degrees.

For model (14) we can define in a similar way

 f1(w)=1{w≥0}+2w⊤z, f2(c)=1{c=s}, f3(w)=α(2∥w∥2+∥Sw∥2), with ζ=2α(m+1),

and use so that the dual variable is , constrained by to be equal to .

Using these functions, the final algorithm for our model is given as Algorithm 1, and for the model by dong2015laplacian as Algorithm 2 in the Appendix. Vector is the vector form of , and parameter is the stepsize.

### 5.1 Complexity and Convergence

Both algorithms that we propose have a complexity of per iteration, for nodes graphs, and they can easily be parallelized. As the objective functions of both models are proper, convex, and lower-semicontinuous, our algorithms are guaranteed to converge to the minimum ([komodakis2014playing]).

## 6 Experiments

We compare our model against the state of the art model by dong2015laplacian solved by our Algorithm 2 for both artificial and real data. Comparing to the model by lake2010discovering was not possible even for the small graphs of our artificial experiments, as there is no scalable algorithm in the literature and the use of CVX with the log-determinant term is prohibitive. Other models based on the log-det term, for which scalable algorithms exist, are irrelevant to our problem as a sparse inverse covariance is not a valid Laplacian and are known to not perform well for our setting (see dong2015laplacian for a comparison).

### 6.1 Artificial data

The difficulty of solving problem (10) depends both on the quality of the graph behind the data and on the type of smoothness of the signals. We test 4 different types of graphs using 3 different types of signals.

##### Graph Types.

We use two 2-D manifold based graphs, one uniformly and one non-uniformly sampled, and two graphs that are not manifold structured:

1. Random Geometric Graph (RGG): We sample uniformly from and connect nodes using eq. (11) with , then threshold weights .

2. Non-uniform: We sample in

from a non-uniform distribution

and connect nodes using eq. (11) with . We threshold weights smaller than the best connection of the most distant node ().

3. Erdős Rényi: Random graph as proposed by gilbert1959random (=).

4. Barabási-Albert: Random scale-free graph with preferential attachment as proposed by barabasi1999emergence (=, =).

##### Signal Types.

To create a smooth signal we filter a Gaussian i.i.d. by eq. (5), using one of the three filter types of Section 3. We normalize the Laplacian () so that the filters are defined for . See Table 3 (Appendix) for a summary.

1. Tikhonov: as in eq. (6).

2. Generative Model: if , from model of eq. (8) ().

3. Heat Diffusion: as eq. (9).

For all cases we use nodes, smooth signals of length , and add (-2 sense) noise before computing pairwise distances. We perform grid search to find the best parameters for each model. We repeat the experiment 20 times for each case and report the average result of the parameter value that performs best for each of the different metrics.

##### Metrics.

Since we have the ground truth graphs for each case, we can measure directly the relative edge error in the - and - sense. We also report the relative error of the weighted degrees . This is important because both models are based on priors on the degrees as we show in section 4

. We also report the F-measure (harmonic mean of edge precision and recall), that only takes into account the binary pattern of existing edges and not the weights.

##### Baselines.

The baseline for the relative errors is a classic graph construction using equation (11) with a grid search for the best . Note that this exact equation was used to create the two first artificial datasets. However, using a fully connected graph with the F-measure does not make sense. For this metric the baseline is set to the best edge pattern found by thresholding (11) with different thresholds.

Table 2 summarizes all the results for different combinations of graphs/signals. In most of them, our model performs better for all metrics. We can see that the signals constructed following the generative model (7) do not yield better results in terms of graph reconstruction. Using smoother “Tikhonov” signals from eq. (6) or “Heat Diffusion” signals from (9) by setting yielded slightly worse results in both cases (not reported here). It also seems that the results are slightly better for the manifold related graphs than for the Erdős Rényi and Barabási-Albert models, an effect that is more prevalent when we use signals of length smooth signals instead of (c.f. Table 4 of Appendix). This would be interesting to investigate theoretically.

### 6.2 Real data

We also evaluate the performance of our model on real data. In this case, the actual ground truth graph is not known. We therefore measure the performance of different models on spectral clustering and label propagation, two algorithms that depend solely on the graph. Note that an explicit Laplacian normalization is not needed for the learned models (it is even harmful as found experimentally), since this role is already played by the regularization. Figure 1: Graph learned from 1001 USPS images. Left: Clustering quality. Middle: Label propagation quality. Right: Number of completely disconnected components (continuous lines) and number of disconnected nodes for model by Dong etal. (blue dashed line). Our model and k-NN have no disconnected nodes. Figure 2: Label propagation for the problem “1” vs. “2” of MNIST with different class size proportions: 1 to 4 (left), 1 to 1 (middle) or 4 to 1 (right). Missclassification rate for different number of edges per node.

#### Learning the graph of USPS digits

We first learn the graph connecting 1001 different images of the USPS dataset, that are images of digits from 0 to 9 (10 classes). We follow zhu2003semi and sample the class sizes non-uniformly. For each class we take round() images, resulting to classes with sizes from 3 to 260 images each. We learn graphs of different densities using both models. As baseline we use a k-Nearest Neighbors (k-NN) graph for different .

For each of the graphs, we run standard spectral clustering (as in the work of [ng2002spectral]

but without normalizing the Laplacian) with k-means 100 times. We also run label propagation we choose 100 times a different subset of

known labels.

In Fig. 1 we plot the behavior of different models for different density levels. The horizontal axis is the average number of non-zero edges per node. In the left plot we see the clustering quality. Even though the best result of both algorithms is almost the same (0.24 vs 0.25), our model is more robust in terms of the graph density choice. A similar behavior is exhibited for label propagation plotted in the middle. The classification quality is better for our model in the sparser graph density levels.

The robustness of our model for small graph densities can be explained by the connectivity quality plotted in the right. The continuous lines are the number of different connected components in the learned graphs, that is a measure of connectivity: the less components there are, the better connected is the graph. The dashed blue line is the number of disconnected nodes of model [dong2015laplacian]. The latter fails to assign connections to the most distant nodes, unless the density of the graph reaches a fairly high level. If we want a graph with 6 edges per node, our model returns a graph with 3 components and no disconnected nodes. The model by dong2015laplacian returns a graph with 35 components out of which 22 are disconnected nodes.

Note that in real applications where the best density level is not known a priori, it is important for a graph learning model to perform well for sparse levels. This is especially the case for large scale applications, where more edges mean more computations.

##### Time:

Algorithm 1 implemented in Matlab222Code for both models is available as part of the open-source toolbox GSPBox by perraudin2014gspbox using code from UNLocBoX, perraudin2014unlocbox. learned a 10-edge/node graph of 1001 USPS images in 5 seconds (218 iterations) and Algorithm 2 in 1 minute (2043 iterations) on a standard PC for tolerance -4.

#### Learning the graph of MNIST 1 vs 2

To demonstrate the different behaviour of the two models for non-uniform sampling cases, we use the problem of classification between digits 1 and 2 of the MNIST dataset. This problem is particular because digits “1” are close to each other (average square distance of 45), while digits “2” differ more from each other (average square distance of 102). In Figure 2 we report the average miss-classification rate for different class size proportions, with 40 1’s and 160 2’s (left), 100 1’s and 100 2’s (middle) or 160 1’s and 40 2’s (right)

. Results are averaged over 40 random draws. The dashed lines denote the number of nodes contained in components without labeled nodes, that can not be classified. In this case, the model of

[dong2015laplacian] fails to recover edges between different digits “2” unless the returned graph is fairly dense, unlike our model that even for very sparse graph levels treats the different classes more fairly. The effect is stronger when the set of 2’s is also the smallest of the two.

## 7 Conclusion

We introduce a new way of addressing the problem of learning a graph under the assumption that is small. We show how the problem can be simplified into a weighted sparsity problem, that implies a general framework for learning a graph. We show how the standard Gaussian weight construction from distances is a special case of this framework. We propose a new model for learning a graph, and provide an analysis of the state of the art model of dong2015laplacian that also fits our framework. The new formulation enables us to propose a fast and scalable primal dual algorithm for our model, but also for the one of [dong2015laplacian] that was missing from the literature. Our experiments suggest that when sparse graphs are to be learned, but connectivity is crucial, our model is expected to outperform the current state of the art.

We hope not only that our solution will be used for many applications that require good quality graphs, but also that our framework will trigger defining new graph learning models targeting specific applications.

#### Acknowledgements

The author would like to especially thank Pierre Vandergheynst and Nikolaos Arvanitopoulos for their constructive comments on the organization of the paper and the experimental evaluation. He is also grateful to the authors of dong2015laplacian for sharing their code, to Nathanael Perraudin and Nauman Shahid for discussions when developing the initial idea, and to Andreas Loukas for his comments on the final version. Figure 3: The filters of Table 3 for α=10. Figure 4: Different smooth signals on the Non Uniform graph used for our artificial data experiments. All signals are obtained by smoothing the same initial x0∼N(0,I) with three different filters. This instance of the graph is disconnected with 2 components.

## Appendix A Derivations and proofs

### a.1 Detailed explanation of eq. (3)

 ∥W∘Z∥1 =m∑i=1m∑j=1Wij∥xi−xj∥22 =m∑i=1m∑j=1(xi−xj)⊤Wij(xi−xj) =2m∑i=1m∑j=1x⊤iWijxi−2m∑i=1m∑j=1x⊤iWijxj =2m∑i=1x⊤ixim∑j=1Wij−2tr(X⊤WX) =2tr(X⊤DX)−2tr(X⊤WX) =2tr(X⊤LX),

where is the diagonal matrix with elements .

### a.2 Proof of proposition 2

###### Proof.

We change variable to obtain

 F(Z,α,β)= =γargmin~W∥γ~W∘Z∥1,1−α1⊤log(γ~W1)+β∥γ~W∥2F =γargmin~Wγ∥~W∘Z∥1,1−α1⊤log(~W1)+βγ2∥~W∥2F =γargmin~W∥~W∘Z∥1,1−αγ1⊤log(~W1)+βγ∥~W∥2F =γF(Z,αγ,βγ),

where we used the fact that . The second equality is obtained from the first one for . ∎

### a.3 Proof of proposition 3

###### Proof.

For equation (15)

 H(Z+γ,α,s)= =argminW∈Wm∥W∘Z+γW∥1,1+α∥W∥2F+α∥W1∥2 s.t.,∥W∥1,1=s =argminW∈Wm∥W∘Z∥1,1+γ∥W∥1,1+α∥W∥2F+α∥W1∥2 s.t.,∥W∥1,1=s =argminW∈Wm∥W∘Z∥1,1+γs+α∥W∥2F+α∥W1∥2 s.t.,∥W∥1,1=s =H(Z,α,s),

because for positive , .

For equation (16) we change variable in the optimization and use to obtain

 H(Z,α,s)= =γargmin~W∈Wm∥γ~W∘Z∥1,1+α∥γ~W∥2F+α∥γ~W1∥2 s.t.,∥γ~W∥1,1=s =γargmin~W∈Wmγ∥~W∘Z∥1,1+γ2α∥~W∥2F+γ2α∥~W1∥2 s.t.,∥~W∥1,1=sγ =γargmin~W∈Wm∥~W∘Z∥1,1+γα∥~W∥2F+γα∥~W1∥2 s.t.,∥~W∥1,1=sγ =γH(Z,αγ,sγ).

The second equality follows trivially for .

## Appendix B Optimization details and algorithm for model of [dong2015laplacian]

To obtain Algorithm 1 (for our model), we need the following:

 K=S   (∥S∥2=√2(m−1)) proxλf1(y)=max(0,y−λz), proxλf2(y)=yi+√y2i+4αλ2, ∇f3(w)=2βw, ζ=2β(Lipschitz constant of gradient of f3),

where is the number of nodes of the graph.

To obtain Algorithm 2 (for model by [dong2015laplacian]), we need the following:

 K=21   (∥21∥2=2√m(m−1)/2) proxλf1(y)=max(0,y−λz), proxλf2(y)=s, ∇f3(w)=α(4w+2S⊤Sw), ζ=2α(m+1) (Lipschitz constant of gradient of f3).

## Appendix C More real data experiments Figure 5: Distribution of class sizes for one of the random instances of the COIL 20 experiments.

### c.1 Learning the graph of COIL 20 images

We randomly sample the classes so that the average size increases non-linearly from around to around samples per class. The distribution for one of the instances of this experiment is plotted in fig. 5. We sample from the same distribution 20 times and measure the average performance of the models for different graph densities. For each of the graphs, we run standard spectral clustering (as in the work of [ng2002spectral] but without normalizing the Laplacian) with k-means 100 times. For label propagation we choose 100 times a different subset of known labels. We set a baseline by using the same techniques with a k-Nearest neighbors graph (k-NN) with different choices of .

In Fig. 6 we plot the behavior of different models for different density levels. The horizontal axis is the average number of non-zero edges per node.

The dashed lines of the middle plot denote the number of nodes contained in components without labeled nodes, that can not be classified. Figure 6: Graph learned from non-uniformly sampled images from COIL 20. Average over 20 different samples from the same non-uniform distribution of images. Left: Clustering quality. Middle: Label propagation quality. Dashed lines are the number of nodes in components without labeled nodes. Right: Number of disconnected components and number of disconnected nodes (Our model and k-NN have no disconnected nodes).