1. Introduction
In many fields of science, one often encounters observations represented as highdimensional vectors sampled from unknown distributions. It is sometimes difficult to directly analyze data in the original space, and is desirable to perform data dimensionality reduction or associate data with some structured objects for further exploratory analysis. One example is the study of human cancer, which is a dynamic disease that develops over an extended time period through the accumulation of a series of genetic alterations
[41]. The delineation of this dynamic process would provide critical insights into molecular mechanisms underlying the disease process, and inform the development of diagnostics, prognostics and targeted therapeutics. The recently developed highthroughput genomics technology has made it possible to measure the expression levels of all genes in tissues from thousands of tumor samples simultaneously. However, the delineation of the cancer progression path embedded in a highdimensional genomics space remains a challenging problem [38].Principal component analysis (PCA) [20] is one of the most commonly used methods to visualize data in a lowdimensional space, but its linear assumption limits its general applications. Several nonlinear approaches based on the kernel trick have been proposed [35], but they remain suboptimal for detecting complex structures. Alternatively, if data dimensionality is high, manifold learning based on the local information of data can be effective. Examples include locally linear embedding (LLE) [34] and Laplacian eigenmaps [2]. However, these methods generally require to construct a carefully tuned neighborhood graph as their performance heavily depends on the quality of constructed graphs.
Another approach is principal curve, which was initially proposed as a nonlinear generalization of the first principal component line [18]. Informally, a principal curve is an infinitely differentiable curve with a finite length that passes through the middle of data. Several principal curve approaches have been proposed, including those that minimize certain types of risk functions such as the quantization error [18, 22, 36, 33, 15] and the negative loglikelihood function [40, 3]. To overcome the overfitting issue, regularization is generally required. Kégl et al. [22]
bounded the total length of a principal curve, and proved that the principal curve with a bounded length always exists if the data distribution has a finite second moment. Similar results were obtained by bounding the turns of a principal curve
[33]. Recently, the elastic maps approach [15] was proposed to regularize the elastic energy of a membrane. An alternative definition of a principal curve based on a mixture model was considered in [18], where the model parameters are learned through maximum likelihood estimation and the regularization is achieved using the smoothness of coordinate functions. Generative topographic mapping (GTM)
[3]was proposed to maximize the posterior probability of the data which is generated by a lowdimensional discrete grid mapped into the original space and corrupted by additive Gaussian noise. GTM provides a principled alternative to the selforganizing map (SOM)
[23] for which it is impossible to define an optimality criterion [12].Methods for learning a principal curve have been widely studied, but they are generally limited to learn a structure that does not intersect itself [18]. Only a few methods can handle complex principal objects. Kégl and Krzyzak [21] extended their polygonal line method [22] for skeletonization of handwritten digits. The principal manifold approach [14] extends the elastic maps approach [15] to learn a graph structure generated by graph grammar. A major drawback of the two methods is that they require either a set of predefined rules (specifically designed for handwritten digits [21]) or grammars with many parameters difficult to be tuned, which makes their implementations complicated and their adaptations to new datasets difficult. More importantly, their convergences are not guaranteed. Recently, a subspace constrained mean shift (SCMS) method [30] was proposed that can obtain principal points for any given secondorder differentiable density function, but it is still not trivial to transform a set of principal points to an explicit structure. In many real applications, an explicit structure uncovered from the given data is helpful for downstream analysis. This critical point will be illustrated in Section 2 and Section 6 in detail with various real world datasets.
Another line of research relevant to this work is structure learning, which has made a great success on constructing or learning explicit structures from data. The graph structures that are commonly used in graphbased clustering and semisupervised learning are
nearest neighbor graph and neighborhood graph [2]. Dramatic influences of two graphs on clustering techniques have been studied in [27]. Since the neighborhood graph could result in disconnected components or subgraphs in the dataset or even isolated singleton vertices, the bmatching method is applied to learn a better nearest neighbor graph via loopy belief propagation [19]. However, it is improper to use a fixed neighborhood size since the curvature of manifold and the density of data points may be different in different regions of the manifold [11]. In order to alleviate these issues, a method for simultaneous clustering and embedding of data lying in multiple manifolds [11] was proposed using norm over the errors that measure the linear representation of every data point by using its neighborhood information. Similarly, graph was learned for image analysis using norm over the errors for enhancing the robustness of the learned graph [7]. Instead of learning directed graphs by using the above two methods, an integrated model for an undirected graph by imposing a sparsity penalty on a symmetric similarity matrix and a positive semidefinite constraint on Laplacian matrix was proposed [26]. Although these methods have demonstrated the effectiveness on various problems, it is challenging to apply them for moderatesize data, let alone largescale data, due to the high computational complexity. Moreover, it might be suboptimal by heuristically transforming a directed graph to an undirected graph for clustering and dimensionality reduction
[7, 11].
valign=b

This paper is an extension of our preliminary work [29], where we demonstrated the effectiveness of learning a minimumcost spanning tree as a showcase for datasets of tree structures in certain applications. We move forward to take into account more complicated structures that might exist in most of real world datasets. Moreover, we propose two strategies to specifically overcome the issue of highcomplexity of the proposed method for largescale data. The main contributions of this paper are summarized as follows:

We propose a novel regularized principal graph learning framework that addresses the aforementioned limitations by learning a set of principal points and an explicit graph structure from data, simultaneously. To the best of our knowledge, this is the first work to represent principal objects using an explicit graph that is learned in a systematic way with a guaranteed convergence and is practical for largescale data.

Our novel formulation called reversed graph embedding for the representation of a principal graph facilitates the learning of graph structures, generalizes several existing methods, and possesses many advantageous properties.

In addition to spanning trees, a weighted undirected graph is proposed for modeling various types of graph structures, including curves, bifurcations, selfintersections, loops, and multiple disconnected components. This facilitates the proposed learning framework for datasets with complicated graphs.

We further propose two strategies to deal with largescale data. One is to use sideinformation as a priori to reduce the complexity of graph learning, the other is to learn a representative graph over a set of landmarks. Both strategies can be simultaneously incorporated into our framework for efficiently handling largescale data.

Extensive experiments are conducted for unveiling the underlying graph structures on a variety of datasets including various synthetic datasets and six real world applications consisting of various structures: hierarchy of facial expression images, progression path of breast cancer of microarray data, rotation circle of teapot images, smoothing skeleton of optical characters, and similarity patterns of digits on two largescale handwritten digits databases.
The rest of the paper is organized as follows. We first illustrate the learning problem by using various real world datasets in Section 2. In Section 3, we present three key building blocks for the representation of principal graphs. Based on these key building blocks, we propose a new regularized principal graph learning framework in Section 4. In Section 5, we incorporate two strategies into the proposed framework to deal with largescale datasets. Extensive experiments are conducted in Section 6. Finally, we conclude this work in Section 7.
2. Motivation of Structure Learning
In many fields of science, experimental data resides in a highdimensional space. However, the distance between two data points may not directly reflect the distance measured on the intrinsic structure of data. Hence, it is desirable to uncover the intrinsic structure of data before conducting further analysis.
It has been demonstrated that many highdimensional datasets generated from realworld problems contain special structures embedded in their intrinsic dimensional spaces. One example is a collection of teapot images viewed from different angles [43]. Each image contains RGB pixels, so the pixel space has a dimensionality of
, but the intrinsic structure has only one degree of freedom: the angle of rotation. As shown in Fig.
1(a), distances computed by following an intrinsic circle are more meaningful than distances computed in the original space. Another example is given in [37] where it is demonstrated that a collection of facial expression images ( RGB pixels) contains a twolayer hierarchical structure (Fig. 3(b) in [37]). The images from three facial expressions of one subject are grouped together to form the first layer, while all images from three subjects form the second layer. In other words, images of one subject should be distant from images of all other subjects, but images from three expressions of the same subject should be close. More complicated structures including loops, bifurcations, and multiple disconnected components as shown in Fig. 1(b) can be observed in optical character templates for identifying a smoothing skeleton of each character [21, 30]. Other examples with specific intrinsic structures are also discussed in [39, 32].We are particularly interested in studying human cancer, a dynamic disease that develops over an extended time period. Once initiated from a normal cell, the advance to malignancy can to some extent be considered a Darwinian process  a multistep evolutionary process  that responds to selective pressure [17]. The disease progresses through a series of clonal expansions that result in tumor persistence and growth, and ultimately the ability to invade surrounding tissues and metastasize to distant organs. As shown in Fig. 1(c), the evolution trajectories inherent to cancer progression are complex and branching [17]. Due to the obvious necessity for timely treatment, it is not typically feasible to collect time series data to study human cancer progression [28]. However, as massive molecular profile data from excised tumor tissues (static samples) accumulates, it becomes possible to design integrative computation analyses that can approximate disease progression and provide insights into the molecular mechanisms of cancer. We have previously shown that it is indeed possible to derive evolutionary trajectories from static molecular data, and that breast cancer progression can be represented by a highdimensional manifold with multiple branches [38].
These concrete examples convince us that many datasets in a highdimensional space can be represented by a certain intrinsic structure in a low dimensional space. Existing methods exploit the intrinsic structure of data implicitly either by learning a parametric mapping function or approximating a manifold via local geometric information of the observed data. However, these methods do not aim to directly learn an explicit form of an intrinsic structure in a lowdimensional space. In contrast, our proposed method is designed for this purpose to express the intrinsic structure of data by an explicit graph representation, which will be discussed in the following sections.
3. Principal Graph Representation
Let be the input space and be a set of observed data points. We consider learning a projection function such that latent points in dimensional space can trustfully represent , where is a set of functions defined on a graph , and function maps to . Let be a weighted undirected graph, where is a set of vertices, is a set of edges, and is an edge weight matrix with the th element denoted by as the weight of edge . Suppose that every vertex has onetoone correspondence with latent point , which resides on a manifold with an intrinsic dimension . Next, we introduce three key components that form the building blocks of the proposed framework.
3.1. Reversed Graph Embedding
The most important component of the proposed framework is to model the relationship between graph and latent points . Given , weight measures the similarity (or connection indicator) between two latent points and in an intrinsic space . Since the corresponding latent points are generally unknown in advance, we coin a new graphbased representation, namely Reversed Graph Embedding, in order to bridge the connection between graph structure and corresponding data points in the input space. Specifically, our intuition is that if any two latent variables and are neighbors on with high similarity , two corresponding points and in the input space should be close to one another. To capture this intuition, we propose to minimize the following objective function as an explicit representation of principal graphs, given by
(1) 
where and are variables to be optimized for a given . At this moment, we cannot obtain interesting solutions by directly solving problem (1) with respect to due to multiple trivial solutions leading to zero objective value. However, objective function (1) possesses many interesting properties and can be used as a graphbased regularization for unveiling the intrinsic structure of the given data, which is detailed below.
Relationship to Laplacian Eigenmap
The objective function (1) can be interpreted from a reverse perspective of the wellknown Laplacian eigenmap [2]. Denote . The Laplacian eigenmap solves the following optimization problem,
where the similarity between and is computed in the input space in order to capture the locality information of the manifold modeled by neighbors of the input data points, e.g., using the heat kernel with bandwidth parameter if is the neighbor of and is also the neighbor of , and otherwise, and is a diagonal matrix. The Euclidean distance between and , i.e., , are computed in the latent space . Consequently, two multiplication terms of the objective functions of reverse graph embedding and Laplacian eigenmap are calculated in a different space: (i) weights and are computed in and , respectively; (ii) distances and are computed in and , respectively.
Our formulation can directly model the intrinsic structure of data, while Laplacian eigenmap approximates the intrinsic structure by using neighborhood information of observed data points. In other words, our proposed reversed graph embedding representation facilitates the learning of a graph structure from data. The weight encodes the similarity or connectivity between vertices and on . For example, if , it means that edge is absent from . In most cases, a dataset is given, but graph is unknown. In these cases, it is necessary to automatically learn from data. The objective function (1) is linear with respect to weight matrix . This linearity property facilitates the learning of the graph structure. We will discuss different representations of graphs based on linearity property in Section 3.3.
Another important difference is that the number of the latent variables is not necessary equal to the number of input data points , i.e., . This is useful to obtain a representative graph over a set of landmark points by compressing a large amount of input data points, and the representative graph can still trustfully represent the whole data. We will explore this property for largescale principal graph learning in Section 5.2.
Harmonic Function of a General Graph
We have discussed the properties of the reversed graph embedding by treating variables and as a single integrated variable . Actually, the optimal function obtained by solving (1) is related to harmonic or pluriharmonic functions. This can be further illustrated by the following observations. Let be the neighbors of point . For any given , problem (1) can be rewritten as
which has an analytic solution by fixing the rest of variables given by
Extension of Principal Curves to General Graphs
Equation (1) generalizes various existing methods. It is worth noting that the quantity can be considered as the length of a principal graph in terms of the square of norm. In the case where is a linear chain structure, is same as the length of a polygonal line defined in [22]. However, general graphs or trees are more flexible than principal curves since the graph structure allows selfintersection. For principal graph learning, elastic map [15] also defines a penalty based on a given graph. However, based on the above discussion, it is difficult to solve problem (3) with respect to both the function and the graph weights within the elasticmaps framework. In contrast, the proposed reversed graph embedding leads to a simple and efficient algorithm to learn a principal tree or a weighted undirected graph with guaranteed convergence. This will be clarified in Section 4.
3.2. Data Grouping
The second important component of the proposed framework is to measure the fitness of latent variables to a given data in terms of a given graph and projection function . As the number of latent variables is not necessarily equal to the number of input data points, we assume that projected point is a centroid of the th cluster of so that input data points with high similarity form a cluster. The empirical quantization error [36] is widely used as the fitting criterion to be minimized for the optimal cluster centroids, and it is also frequently employed in principal curve learning methods [18, 22, 36, 33, 15], given by
Data Reconstruction Error: If , we can reformulate the equation (4) by reordering as
(5) 
This formulation can be interpreted as the negative loglikelihood of data
that are i.i.d drawn from a multivariate normal distribution with mean
and covariance matrix .Kmeans: If , we introduce an indicator matrix with the th element if is assigned to the th cluster with centroid , and otherwise. Consequently, we have the following equivalent optimization problem
(6) 
where and . This is the same as the optimization problem of means method that minimizes the objective function (4).
Generalized Empirical Quantization Error: If
, a right stochastic matrix
with each row summing to one is introduced, that is, . This variant is equivalent to the above representation of indicator matrix if an integer solution is obtained. When is relatively large, means that minimizes (6) might generate many empty clusters. To avoid this issue, we introduce the soft assignment strategy by adding negative entropy regularization as(7) 
where is a regularization parameter.
In this paper, we use since it is a generalized version of the other two cases and also has a close relationship with the mean shift clustering method [8]
, Gaussian mixture model, and the Kmeans method. Below, we discuss these interesting properties in detail. The proofs of Proposition
1 and Proposition 2 are given in Appendix.Proposition 1.
Proof.
Mean shift clustering [8]
is a nonparametric model for locating modes of a density function. Let
be modes ofthat we aim to find. The kernel density estimator is used to estimate the density function from
, for any , given bywhere Gaussian kernel with covariance matrix is used and . Since data points are independent and identically distributed, we have a joint density function over given by
An optimization problem is naturally formed to find the modes by maximizing the joint density function with respect to as
(8) 
On the other hand, we introduce the left stochastic matrix , where
Let and . For each , minimizing equation of generalized empirical quantization error can be reformulated as the following optimization problem
s.t. 
By Lagrange duality theorem [4], the KKT conditions for optimal solution are given by
By combining above two equalities, we have the optimal solution as
(9) 
and the objective function to be minimized is given by
By substituting (9) into (7), it becomes Problem (8). The proof is completed. ∎
Proposition 2.
Minimizing objective function (7) with respect to a right stochastic matrix with each row summing to one can be interpreted as the Gaussian mixture model with uniform weights.
Proof.
Let be a right stochastic matrix where , and where , and . Following the proof of Proposition 1, we have the optimal solution of minimizing (7) with respect to as
(10) 
where the optimal solution is achieved by solving the following optimization problem
(11) 
For a fixed , the above objective function is equivalent to the maximum likelihood function of
which are i.i.d. drawn from the mixture of Gaussian distributions consisting of
components each which is a Gaussian distributionand discrete uniform distribution for each component, i.e.
. ∎Corollary 1.
According to the aforementioned results, we can briefly summarize the merits of function (7). First, empty clusters will never be created according to Proposition 2 for any . Second, the loss function takes the density of input data into account. In other words, the centroids obtained by minimizing the loss function should reside in the high density region of the data. Third, the loss function makes the learning of graph structures computationally feasible for largescale data in the case of , which will be discussed in Section 5.
3.3. Latent Sparse Graph Learning
The third important component of the proposed framework is to learn a latent graph from data. To achieve this goal, we investigate two special graph structures that can be learned from data by formulating the learning problems as linear programming. One is a tree structure, represented as a minimumcost spanning tree, the other is a weighted undirected graph that is assumed to be sparse. Let
be a dataset. Our task is to construct a weighted undirected graph with a cost associated with edge by optimizing similarity matrix based on the assumption of the specific graph structure.Minimum Cost Spanning Tree
Let be a tree with the minimum total cost and be the edges forming a tree. In order to represent and learn a tree structure, we consider as binary indicator matrix where if , and otherwise. The integer linear programming formulation of a minimum spanning tree (MST) can be written as, , where and . The first constraint of enforces the symmetric connection of undirected graph, e.g. . The second constraint states that the spanning tree only contains edges. The third constraint imposes the acyclicity and connectivity properties of a tree. It is difficult to solve an integer programming problem directly. Thus, we resort to a relaxed problem by letting , that is,
Weighted Undirected Graph
A complete graph with imposed sparse constraints over edge weights called graph is considered for learning a sparse graph. An graph is motivated from the intuition that one data point can usually be represented by a linear combination of other data points if they are similar to this data point. In other words, two data points and are similar if edge weight is a positive value, which can be treated as a similarity measurement between two data points. Consequently, the coefficient of linear combination is coincident with the edge weight on a graph. The system of linear equations , where is the vector to be approximated, is the edge weight matrix, and is the overcomplete dictionary with bases. In practice, there may exist noises on certain elements of , and a natural way is to estimate the edge weights by tolerating these errors. We introduce error to each linear equation and the equality constraints are formulated as . In order to learn a similarity measurement for an undirected graph, we impose nonnegative and symmetric constraints on , i.e., . Generally, a sparse solution is more robust and facilitates the consequent identification of the test sample . Following the recent results on the sparse representation problem, the convex norm minimization can effectively recover the sparse solution [11, 7]. Let . We propose to learn by solving the following linear programming problem
(13)  
s.t.  
where is a parameter for weighting all errors and with . The problem (13) is different from methods [11, 7] that learn directed graphs, and different from the method [26] that learns undirected graphs using the probabilistic model with Gaussian Markov random fields. Moreover, data points can be different from the data points that are used to compute costs.
Generalized Graph Structure Learning
For the ease of representation, we use a unified formulation for learning a similarity graph, where is a given dataset, is the similarity matrix of a graph over the given dataset, and is a cost matrix with the th entry as . The feasible space of is denoted as . Specifically, we solve the following generalized graph structure learning problem such that any graph learning problem that can be formulated as linear programming with constraints represented as can be used in the following proposed framework, given by
4. Regularized Principal Graph
By combining the three building blocks discussed in Section 3, we are ready to propose a unified framework for learning a principal graph. We use the alternate convex search algorithm to solve the proposed formulations by simultaneously learning a set of principal points and an undirected graph with guaranteed convergence.
4.1. Proposed Formulation
Given an input dataset , we aim to uncover the underlying graph structure that generates . Since the input data may be drawn from a noise model, it is improper to learn directly from . To unveil the hidden structure, we assume that (i) the underlying graph can be recovered from the learning model represented in the form of (14), e.g., the graph is an undirected weighted graph with specific structures; (ii) graph satisfies the reversed graph embedding assumption where the cost is defined by , that is, if two vertices and are close on with high similarity , points and should be close to each other; (iii) data points are considered as the denoised points of so that the movement of from the denoised point can be measured by data grouping properly.
For clarity, we illustrate the relations of various variables in Fig. 2. Based on the above assumptions, we propose a novel regularized principal graph learning framework as
(15)  
s.t.  
where is a parameter. It is worth noting that (14) is the key component of (15) for learning a graph given and , and also contains reversed graph embedding objective (1) as a crucially important element for two special graph learning formulations (12) and (13). In addition, we measure the noise of each input data by the Euclidean distance between the input data point to its denoised counterpart.
Instead of manually setting parameters , we assume that the total noise over all input data points is minimal. The optimization problem (15) is then reformulated to penalize the total noise where is a tradeoff parameter between graphbased regularizer and the total noises of the input data. We rewrite the above problem as the following optimization problem
s.t. 
where the second term in the objective function is equivalent to (5). As we have discussed in Section 3.2, the loss function (7) is more flexible than (5). Consequently, we propose the new regularized principal graph learning framework by simultaneously optimizing the graph weight matrix , latent variables , projection function , and softassignment matrix as
(16)  
s.t. 
In the following of this section, we propose a simple method for solving problem (16) and present its convergence and complexity analysis.
4.2. Optimization Algorithm
We focus on learning the underlying graph structure from the data. Instead of learning and separately^{1}^{1}1 To learn , we should define properly, e.g., a linear projection function might be possible as studied in our previous work [28]., we optimize as a joint variable . Let where , and . In the case of weighted undirected graph learning, the optimization problem (16) with respect to variables are reformulated as
(17)  
s.t.  
In the case of learning a MST, the feasible space is , and the objective function is similar to (17) by removing the second term from the objective function. For simplicity, we do not explicitly show the formulation of learning a spanning tree.
Problem (17) is a biconvex optimization problem [16]: for fixed and , optimizing is a convex optimization problem; for fixed , optimizing and is also convex. We propose to solve problem (17) by using the alternate convex search, a minimization method to solve a biconvex problem where the variable set can be divided into disjoint blocks [16]. The blocks of variables defined by convex subproblems are solved cyclically by optimizing the variables of one block while fixing the variables of all other blocks. In this way, each convex subproblem can be solved efficiently by using a convex minimization method. Below, we discuss each subproblem in details and the pseudocode of the proposed method is given in Algorithm 1.
Fix and Solve
Given , Problem (17) with respect to and can be solved independently. Due to the decoupled variables of in rows, we can solve each row independently. According to Proposition 2, we have the analytic solution of given by
(18) 
For Problem (17) with respect to , we have to solve the following optimization problem
(19)  
s.t. 
where with the th element . For Problem (16) to learn a spanning tree structure, we solve the following problem
Fix and Solve
Given and , Problem (17) with respect to can be rewritten as
(21) 
where the graph Laplacian matrix and diagonal matrix . Problem (21) is an unconstrained quadratic programming. We have the analytic solution given by
(22) 
where the inverse of matrix always exists since is a positive semidefinite matrix and diagonal matrix is always positive definite according to (18).
4.3. Convergence and Complexity Analysis
Since problem (17) is nonconvex, there may exist many local optimal solutions. Following the initialization strategy of the mean shift clustering, we initialize to be the original input data as shown in Algorithm 1. The theoretical convergence analysis of Algorithm 1 is presented in the following theorem.
Theorem 1.
Let be the solution of Problem (17) in the th iteration, and be the corresponding objective function value, then we have:
(i) is monotonically decreasing;
(ii) Sequences and converge.
Proof.
Let be a solution obtained in the th iteration. By Algorithm 1, at the th iteration, since each subproblem is solved exactly, we have
So sequence is monotonically decreasing. Furthermore, function is lowerbounded, and then by Monotone Convergence Theorem, there exists , such that converges to .
Next, we prove that the sequence generated by Algorithm 1 also converges. Due to the compactness of feasible sets and , we have the sequence converges to as . Since , converges to , where and . ∎
According to Theorem 1, we define the stopping criterion of Algorithm 1 in terms of the relative increase in the function value compared to the last iteration, and fix it as in all experiments. The empirical convergence results are shown in Section 6.1.
The time complexity of Algorithm 1 for learning a tree structure is determined by three individual parts. The first part is the complexity of running Kruskal’s algorithm to construct a minimum spanning tree. It requires for computing a fully connected graph and for finding a spanning tree. The second part is dominated by computing the soft assignments of samples, which has a complexity of . The third part is dominated by the inverse of a matrix of size that takes operations and matrix multiplication that takes operations. Therefore, the total complexity for each iteration is . For learning an undirected weighted graph by solving Problem (19), the only difference is the complexity of linear programming, which can be solved efficiently by Mosek [1]. In Section 5, we will extend the proposed algorithm for dealing with largescale data.
5. Regularized Principal Graph Learning on LargeScale Data
In order to reduce the high computational complexity of Algorithm 1 for largescale data, we propose to incorporate two strategies into the proposed model for fast learning by using landmarks and side information.
5.1. Graph Learning with Side Information
Instancelevel constraints are useful to express a priori knowledge about which instances should or should not be grouped together, which have been successfully applied to many learning tasks, such as clustering [42], metric learning [45], and kernel learning [47]. In the area of clustering, the most prevalent form of advice are conjunctions of pairwise instance levelconstraints of the form mustlink (ML) and cannotlink (CL) which state that pairs of instances should be in the same or different clusters respectively [42]. Given a set of points to cluster and a set of constraints, the aim of clustering with constraints is to use the constraints to improve the clustering results.
We consider certain structurespecific side information for guidance of learning the similarity matrix and computational reduction instead of optimizing the full matrix. For this purpose, we take into account the CL constraints. Let be a set of data points which might be linked to data point . On the contrary, data points that are not in will belong to CL set. By incorporating these CL side information into the proposed framework, we can come out the following optimization problem for learning graph representation given by
(23)  
s.t.  
which can be equivalently transformed into linear programming optimization problem and can be solved accurately for largescale data by using the offtheshelf toolbox.
5.2. Landmarkbased Regularized Principal Graph
In order to handle largescale data, we extend the proposed regularized principal graph learning framework by using a landmarkbased technique. Landmarks have been widely used in clustering methods to deal with largescale data [13, 6]
. Random landmark selection method is widely used for spectral clustering
[13, 6]. It is well known that selecting landmarks can further improve clustering performance with a certain adaptive strategy, such as means [46], a variety of fixed and adaptive sampling schemes, and a family of ensemblebased sampling algorithms [25].Our regularized principal graph learning framework can inherently take the landmark data points into account through the data grouping. Taking the weighted undirected graph based framework as an example, the optimization problem by considering landmarks is given by
(24)  
s.t.  
where is a set of landmarks of size . Moreover, our proposed methods can automatically adjust centroids of landmarks during the optimization procedure. Due to the nonconvexity of above objective function, we have to properly initialize these landmarks. As shown in the literature [46, 6], means can generally obtain better clustering performance for spectral clustering methods. In this paper, we follow this idea and use centroids obtained from means to initialize . The pseudocode is shown in Algorithm 2. By following the same analysis procedure as Section 4.3, the computational complexity of Algorithm 2 is which is significantly smaller than of Algorithm 1 if . Hence, Algorithm 2 is practical for largescale data with a large .
Comments
There are no comments yet.