A Novel Regularized Principal Graph Learning Framework on Explicit Graph Representation

12/09/2015 ∙ by Qi Mao, et al. ∙ University of Illinois at Chicago 0

Many scientific datasets are of high dimension, and the analysis usually requires visual manipulation by retaining the most important structures of data. Principal curve is a widely used approach for this purpose. However, many existing methods work only for data with structures that are not self-intersected, which is quite restrictive for real applications. A few methods can overcome the above problem, but they either require complicated human-made rules for a specific task with lack of convergence guarantee and adaption flexibility to different tasks, or cannot obtain explicit structures of data. To address these issues, we develop a new regularized principal graph learning framework that captures the local information of the underlying graph structure based on reversed graph embedding. As showcases, models that can learn a spanning tree or a weighted undirected ℓ_1 graph are proposed, and a new learning algorithm is developed that learns a set of principal points and a graph structure from data, simultaneously. The new algorithm is simple with guaranteed convergence. We then extend the proposed framework to deal with large-scale data. Experimental results on various synthetic and six real world datasets show that the proposed method compares favorably with baselines and can uncover the underlying structure correctly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 18

page 20

page 22

page 24

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In many fields of science, one often encounters observations represented as high-dimensional vectors sampled from unknown distributions. It is sometimes difficult to directly analyze data in the original space, and is desirable to perform data dimensionality reduction or associate data with some structured objects for further exploratory analysis. One example is the study of human cancer, which is a dynamic disease that develops over an extended time period through the accumulation of a series of genetic alterations

[41]. The delineation of this dynamic process would provide critical insights into molecular mechanisms underlying the disease process, and inform the development of diagnostics, prognostics and targeted therapeutics. The recently developed high-throughput genomics technology has made it possible to measure the expression levels of all genes in tissues from thousands of tumor samples simultaneously. However, the delineation of the cancer progression path embedded in a high-dimensional genomics space remains a challenging problem [38].

Principal component analysis (PCA) [20] is one of the most commonly used methods to visualize data in a low-dimensional space, but its linear assumption limits its general applications. Several nonlinear approaches based on the kernel trick have been proposed [35], but they remain sub-optimal for detecting complex structures. Alternatively, if data dimensionality is high, manifold learning based on the local information of data can be effective. Examples include locally linear embedding (LLE) [34] and Laplacian eigenmaps [2]. However, these methods generally require to construct a carefully tuned neighborhood graph as their performance heavily depends on the quality of constructed graphs.

Another approach is principal curve, which was initially proposed as a nonlinear generalization of the first principal component line [18]. Informally, a principal curve is an infinitely differentiable curve with a finite length that passes through the middle of data. Several principal curve approaches have been proposed, including those that minimize certain types of risk functions such as the quantization error [18, 22, 36, 33, 15] and the negative log-likelihood function [40, 3]. To overcome the over-fitting issue, regularization is generally required. Kégl et al. [22]

bounded the total length of a principal curve, and proved that the principal curve with a bounded length always exists if the data distribution has a finite second moment. Similar results were obtained by bounding the turns of a principal curve

[33]. Recently, the elastic maps approach [15] was proposed to regularize the elastic energy of a membrane. An alternative definition of a principal curve based on a mixture model was considered in [18]

, where the model parameters are learned through maximum likelihood estimation and the regularization is achieved using the smoothness of coordinate functions. Generative topographic mapping (GTM)

[3]

was proposed to maximize the posterior probability of the data which is generated by a low-dimensional discrete grid mapped into the original space and corrupted by additive Gaussian noise. GTM provides a principled alternative to the self-organizing map (SOM)

[23] for which it is impossible to define an optimality criterion [12].

Methods for learning a principal curve have been widely studied, but they are generally limited to learn a structure that does not intersect itself [18]. Only a few methods can handle complex principal objects. Kégl and Krzyzak [21] extended their polygonal line method [22] for skeletonization of handwritten digits. The principal manifold approach [14] extends the elastic maps approach [15] to learn a graph structure generated by graph grammar. A major drawback of the two methods is that they require either a set of predefined rules (specifically designed for handwritten digits [21]) or grammars with many parameters difficult to be tuned, which makes their implementations complicated and their adaptations to new datasets difficult. More importantly, their convergences are not guaranteed. Recently, a subspace constrained mean shift (SCMS) method [30] was proposed that can obtain principal points for any given second-order differentiable density function, but it is still not trivial to transform a set of principal points to an explicit structure. In many real applications, an explicit structure uncovered from the given data is helpful for downstream analysis. This critical point will be illustrated in Section 2 and Section 6 in detail with various real world datasets.

Another line of research relevant to this work is structure learning, which has made a great success on constructing or learning explicit structures from data. The graph structures that are commonly used in graph-based clustering and semi-supervised learning are

-nearest neighbor graph and -neighborhood graph [2]. Dramatic influences of two graphs on clustering techniques have been studied in [27]. Since the -neighborhood graph could result in disconnected components or subgraphs in the dataset or even isolated singleton vertices, the b-matching method is applied to learn a better -nearest neighbor graph via loopy belief propagation [19]. However, it is improper to use a fixed neighborhood size since the curvature of manifold and the density of data points may be different in different regions of the manifold [11]. In order to alleviate these issues, a method for simultaneous clustering and embedding of data lying in multiple manifolds [11] was proposed using norm over the errors that measure the linear representation of every data point by using its neighborhood information. Similarly, graph was learned for image analysis using norm over the errors for enhancing the robustness of the learned graph [7]. Instead of learning directed graphs by using the above two methods, an integrated model for an undirected graph by imposing a sparsity penalty on a symmetric similarity matrix and a positive semi-definite constraint on Laplacian matrix was proposed [26]

. Although these methods have demonstrated the effectiveness on various problems, it is challenging to apply them for moderate-size data, let alone large-scale data, due to the high computational complexity. Moreover, it might be suboptimal by heuristically transforming a directed graph to an undirected graph for clustering and dimensionality reduction

[7, 11].

valign=b
(a) rotation
valign=b
(b) complicated structures
valign=b
(c) cancer progression path
Figure 1. Real world problems exhibiting a variety of graph structures. (a) rotation of teapot images forming a circle. (b) Optical character templates for different digits containing loops, bifurcations, and multiple disconnected components. (c) Branching architecture of cancer evolution. Selective pressures allow some tumor clones to expand while others become extinct or remain dormant.

This paper is an extension of our preliminary work [29], where we demonstrated the effectiveness of learning a minimum-cost spanning tree as a showcase for datasets of tree structures in certain applications. We move forward to take into account more complicated structures that might exist in most of real world datasets. Moreover, we propose two strategies to specifically overcome the issue of high-complexity of the proposed method for large-scale data. The main contributions of this paper are summarized as follows:

  • We propose a novel regularized principal graph learning framework that addresses the aforementioned limitations by learning a set of principal points and an explicit graph structure from data, simultaneously. To the best of our knowledge, this is the first work to represent principal objects using an explicit graph that is learned in a systematic way with a guaranteed convergence and is practical for large-scale data.

  • Our novel formulation called reversed graph embedding for the representation of a principal graph facilitates the learning of graph structures, generalizes several existing methods, and possesses many advantageous properties.

  • In addition to spanning trees, a weighted undirected graph is proposed for modeling various types of graph structures, including curves, bifurcations, self-intersections, loops, and multiple disconnected components. This facilitates the proposed learning framework for datasets with complicated graphs.

  • We further propose two strategies to deal with large-scale data. One is to use side-information as a priori to reduce the complexity of graph learning, the other is to learn a representative graph over a set of landmarks. Both strategies can be simultaneously incorporated into our framework for efficiently handling large-scale data.

  • Extensive experiments are conducted for unveiling the underlying graph structures on a variety of datasets including various synthetic datasets and six real world applications consisting of various structures: hierarchy of facial expression images, progression path of breast cancer of microarray data, rotation circle of teapot images, smoothing skeleton of optical characters, and similarity patterns of digits on two large-scale handwritten digits databases.

The rest of the paper is organized as follows. We first illustrate the learning problem by using various real world datasets in Section 2. In Section 3, we present three key building blocks for the representation of principal graphs. Based on these key building blocks, we propose a new regularized principal graph learning framework in Section 4. In Section 5, we incorporate two strategies into the proposed framework to deal with large-scale datasets. Extensive experiments are conducted in Section 6. Finally, we conclude this work in Section 7.

2. Motivation of Structure Learning

In many fields of science, experimental data resides in a high-dimensional space. However, the distance between two data points may not directly reflect the distance measured on the intrinsic structure of data. Hence, it is desirable to uncover the intrinsic structure of data before conducting further analysis.

It has been demonstrated that many high-dimensional datasets generated from real-world problems contain special structures embedded in their intrinsic dimensional spaces. One example is a collection of teapot images viewed from different angles [43]. Each image contains RGB pixels, so the pixel space has a dimensionality of

, but the intrinsic structure has only one degree of freedom: the angle of rotation. As shown in Fig.

1(a), distances computed by following an intrinsic circle are more meaningful than distances computed in the original space. Another example is given in [37] where it is demonstrated that a collection of facial expression images ( RGB pixels) contains a two-layer hierarchical structure (Fig. 3(b) in [37]). The images from three facial expressions of one subject are grouped together to form the first layer, while all images from three subjects form the second layer. In other words, images of one subject should be distant from images of all other subjects, but images from three expressions of the same subject should be close. More complicated structures including loops, bifurcations, and multiple disconnected components as shown in Fig. 1(b) can be observed in optical character templates for identifying a smoothing skeleton of each character [21, 30]. Other examples with specific intrinsic structures are also discussed in [39, 32].

We are particularly interested in studying human cancer, a dynamic disease that develops over an extended time period. Once initiated from a normal cell, the advance to malignancy can to some extent be considered a Darwinian process - a multistep evolutionary process - that responds to selective pressure [17]. The disease progresses through a series of clonal expansions that result in tumor persistence and growth, and ultimately the ability to invade surrounding tissues and metastasize to distant organs. As shown in Fig. 1(c), the evolution trajectories inherent to cancer progression are complex and branching [17]. Due to the obvious necessity for timely treatment, it is not typically feasible to collect time series data to study human cancer progression [28]. However, as massive molecular profile data from excised tumor tissues (static samples) accumulates, it becomes possible to design integrative computation analyses that can approximate disease progression and provide insights into the molecular mechanisms of cancer. We have previously shown that it is indeed possible to derive evolutionary trajectories from static molecular data, and that breast cancer progression can be represented by a high-dimensional manifold with multiple branches [38].

These concrete examples convince us that many datasets in a high-dimensional space can be represented by a certain intrinsic structure in a low dimensional space. Existing methods exploit the intrinsic structure of data implicitly either by learning a parametric mapping function or approximating a manifold via local geometric information of the observed data. However, these methods do not aim to directly learn an explicit form of an intrinsic structure in a low-dimensional space. In contrast, our proposed method is designed for this purpose to express the intrinsic structure of data by an explicit graph representation, which will be discussed in the following sections.

3. Principal Graph Representation

Let be the input space and be a set of observed data points. We consider learning a projection function such that latent points in -dimensional space can trustfully represent , where is a set of functions defined on a graph , and function maps to . Let be a weighted undirected graph, where is a set of vertices, is a set of edges, and is an edge weight matrix with the th element denoted by as the weight of edge . Suppose that every vertex has one-to-one correspondence with latent point , which resides on a manifold with an intrinsic dimension . Next, we introduce three key components that form the building blocks of the proposed framework.

3.1. Reversed Graph Embedding

The most important component of the proposed framework is to model the relationship between graph and latent points . Given , weight measures the similarity (or connection indicator) between two latent points and in an intrinsic space . Since the corresponding latent points are generally unknown in advance, we coin a new graph-based representation, namely Reversed Graph Embedding, in order to bridge the connection between graph structure and corresponding data points in the input space. Specifically, our intuition is that if any two latent variables and are neighbors on with high similarity , two corresponding points and in the input space should be close to one another. To capture this intuition, we propose to minimize the following objective function as an explicit representation of principal graphs, given by

(1)

where and are variables to be optimized for a given . At this moment, we cannot obtain interesting solutions by directly solving problem (1) with respect to due to multiple trivial solutions leading to zero objective value. However, objective function (1) possesses many interesting properties and can be used as a graph-based regularization for unveiling the intrinsic structure of the given data, which is detailed below.

Relationship to Laplacian Eigenmap

The objective function (1) can be interpreted from a reverse perspective of the well-known Laplacian eigenmap [2]. Denote . The Laplacian eigenmap solves the following optimization problem,

where the similarity between and is computed in the input space in order to capture the locality information of the manifold modeled by neighbors of the input data points, e.g., using the heat kernel with bandwidth parameter if is the neighbor of and is also the neighbor of , and otherwise, and is a diagonal matrix. The Euclidean distance between and , i.e., , are computed in the latent space . Consequently, two multiplication terms of the objective functions of reverse graph embedding and Laplacian eigenmap are calculated in a different space: (i) weights and are computed in and , respectively; (ii) distances and are computed in and , respectively.

Our formulation can directly model the intrinsic structure of data, while Laplacian eigenmap approximates the intrinsic structure by using neighborhood information of observed data points. In other words, our proposed reversed graph embedding representation facilitates the learning of a graph structure from data. The weight encodes the similarity or connectivity between vertices and on . For example, if , it means that edge is absent from . In most cases, a dataset is given, but graph is unknown. In these cases, it is necessary to automatically learn from data. The objective function (1) is linear with respect to weight matrix . This linearity property facilitates the learning of the graph structure. We will discuss different representations of graphs based on linearity property in Section 3.3.

Another important difference is that the number of the latent variables is not necessary equal to the number of input data points , i.e., . This is useful to obtain a representative graph over a set of landmark points by compressing a large amount of input data points, and the representative graph can still trustfully represent the whole data. We will explore this property for large-scale principal graph learning in Section 5.2.

Harmonic Function of a General Graph

We have discussed the properties of the reversed graph embedding by treating variables and as a single integrated variable . Actually, the optimal function obtained by solving (1) is related to harmonic or pluriharmonic functions. This can be further illustrated by the following observations. Let be the neighbors of point . For any given , problem (1) can be rewritten as

which has an analytic solution by fixing the rest of variables given by

(2)

If equalities (2) hold for all , function is a harmoinc function on since its value in each nonterminal vertex is the mean of the values in the closest neighbors of this vertex [15]. The plurihamonic graphs defined in [15] impose penalty only on a subset of -stars as,

(3)

where . In contrast, our formulation (1) allows to flexibly incorporate any neighborhood structure existing in . The connection of to harmonic or pluriharmonic functions enriches the target function, which has been previously discussed in [15].

Extension of Principal Curves to General Graphs

Equation (1) generalizes various existing methods. It is worth noting that the quantity can be considered as the length of a principal graph in terms of the square of norm. In the case where is a linear chain structure, is same as the length of a polygonal line defined in [22]. However, general graphs or trees are more flexible than principal curves since the graph structure allows self-intersection. For principal graph learning, elastic map [15] also defines a penalty based on a given graph. However, based on the above discussion, it is difficult to solve problem (3) with respect to both the function and the graph weights within the elastic-maps framework. In contrast, the proposed reversed graph embedding leads to a simple and efficient algorithm to learn a principal tree or a weighted undirected graph with guaranteed convergence. This will be clarified in Section 4.

3.2. Data Grouping

The second important component of the proposed framework is to measure the fitness of latent variables to a given data in terms of a given graph and projection function . As the number of latent variables is not necessarily equal to the number of input data points, we assume that projected point is a centroid of the th cluster of so that input data points with high similarity form a cluster. The empirical quantization error [36] is widely used as the fitting criterion to be minimized for the optimal cluster centroids, and it is also frequently employed in principal curve learning methods [18, 22, 36, 33, 15], given by

(4)

Based on equation (4

), we have the following loss functions.

Data Reconstruction Error: If , we can reformulate the equation (4) by reordering as

(5)

This formulation can be interpreted as the negative log-likelihood of data

that are i.i.d drawn from a multivariate normal distribution with mean

and covariance matrix .

K-means: If , we introduce an indicator matrix with the th element if is assigned to the th cluster with centroid , and otherwise. Consequently, we have the following equivalent optimization problem

(6)

where and . This is the same as the optimization problem of -means method that minimizes the objective function (4).

Generalized Empirical Quantization Error: If

, a right stochastic matrix

with each row summing to one is introduced, that is, . This variant is equivalent to the above representation of indicator matrix if an integer solution is obtained. When is relatively large, -means that minimizes (6) might generate many empty clusters. To avoid this issue, we introduce the soft assignment strategy by adding negative entropy regularization as

(7)

where is a regularization parameter.

In this paper, we use since it is a generalized version of the other two cases and also has a close relationship with the mean shift clustering method [8]

, Gaussian mixture model, and the K-means method. Below, we discuss these interesting properties in detail. The proofs of Proposition

1 and Proposition 2 are given in Appendix.

Proposition 1.

The mean shift clustering [8] is equivalent to minimizing objective function (7) with respect to a left stochastic matrix with each column summing to one.

Proof.

Mean shift clustering [8]

is a non-parametric model for locating modes of a density function. Let

be modes of

that we aim to find. The kernel density estimator is used to estimate the density function from

, for any , given by

where Gaussian kernel with covariance matrix is used and . Since data points are independent and identically distributed, we have a joint density function over given by

An optimization problem is naturally formed to find the modes by maximizing the joint density function with respect to as

(8)

On the other hand, we introduce the left stochastic matrix , where

Let and . For each , minimizing equation of generalized empirical quantization error can be reformulated as the following optimization problem

s.t.

By Lagrange duality theorem [4], the KKT conditions for optimal solution are given by

By combining above two equalities, we have the optimal solution as

(9)

and the objective function to be minimized is given by

By substituting (9) into (7), it becomes Problem (8). The proof is completed. ∎

Proposition 2.

Minimizing objective function (7) with respect to a right stochastic matrix with each row summing to one can be interpreted as the Gaussian mixture model with uniform weights.

Proof.

Let be a right stochastic matrix where , and where , and . Following the proof of Proposition 1, we have the optimal solution of minimizing (7) with respect to as

(10)

where the optimal solution is achieved by solving the following optimization problem

(11)

For a fixed , the above objective function is equivalent to the maximum likelihood function of

which are i.i.d. drawn from the mixture of Gaussian distributions consisting of

components each which is a Gaussian distribution

and discrete uniform distribution for each component, i.e.

. ∎

Corollary 1.

We have the following relationships among three loss functions:

  1. If , minimizing (7) is equivalent to minimizing (6) so that .

  2. If and , minimizing (7) is equivalent to minimizing (5).

According to the aforementioned results, we can briefly summarize the merits of function (7). First, empty clusters will never be created according to Proposition 2 for any . Second, the loss function takes the density of input data into account. In other words, the centroids obtained by minimizing the loss function should reside in the high density region of the data. Third, the loss function makes the learning of graph structures computationally feasible for large-scale data in the case of , which will be discussed in Section 5.

3.3. Latent Sparse Graph Learning

The third important component of the proposed framework is to learn a latent graph from data. To achieve this goal, we investigate two special graph structures that can be learned from data by formulating the learning problems as linear programming. One is a tree structure, represented as a minimum-cost spanning tree, the other is a weighted undirected graph that is assumed to be sparse. Let

be a dataset. Our task is to construct a weighted undirected graph with a cost associated with edge by optimizing similarity matrix based on the assumption of the specific graph structure.

Minimum Cost Spanning Tree

Let be a tree with the minimum total cost and be the edges forming a tree. In order to represent and learn a tree structure, we consider as binary indicator matrix where if , and otherwise. The integer linear programming formulation of a minimum spanning tree (MST) can be written as, , where and . The first constraint of enforces the symmetric connection of undirected graph, e.g. . The second constraint states that the spanning tree only contains edges. The third constraint imposes the acyclicity and connectivity properties of a tree. It is difficult to solve an integer programming problem directly. Thus, we resort to a relaxed problem by letting , that is,

(12)

where the set of linear constraints is given by . Problem (12) can be readily solved by the Kruskal’s algorithm [24, 9].

Weighted Undirected Graph

A complete graph with imposed sparse constraints over edge weights called graph is considered for learning a sparse graph. An graph is motivated from the intuition that one data point can usually be represented by a linear combination of other data points if they are similar to this data point. In other words, two data points and are similar if edge weight is a positive value, which can be treated as a similarity measurement between two data points. Consequently, the coefficient of linear combination is coincident with the edge weight on a graph. The system of linear equations , where is the vector to be approximated, is the edge weight matrix, and is the overcomplete dictionary with bases. In practice, there may exist noises on certain elements of , and a natural way is to estimate the edge weights by tolerating these errors. We introduce error to each linear equation and the equality constraints are formulated as . In order to learn a similarity measurement for an undirected graph, we impose nonnegative and symmetric constraints on , i.e., . Generally, a sparse solution is more robust and facilitates the consequent identification of the test sample . Following the recent results on the sparse representation problem, the convex -norm minimization can effectively recover the sparse solution [11, 7]. Let . We propose to learn by solving the following linear programming problem

(13)
s.t.

where is a parameter for weighting all errors and with . The problem (13) is different from methods [11, 7] that learn directed graphs, and different from the method [26] that learns undirected graphs using the probabilistic model with Gaussian Markov random fields. Moreover, data points can be different from the data points that are used to compute costs.

Generalized Graph Structure Learning

For the ease of representation, we use a unified formulation for learning a similarity graph, where is a given dataset, is the similarity matrix of a graph over the given dataset, and is a cost matrix with the th entry as . The feasible space of is denoted as . Specifically, we solve the following generalized graph structure learning problem such that any graph learning problem that can be formulated as linear programming with constraints represented as can be used in the following proposed framework, given by

(14)

It is easy to identify the correspondences of problem (14) to problems (12) and (13).

4. Regularized Principal Graph

By combining the three building blocks discussed in Section 3, we are ready to propose a unified framework for learning a principal graph. We use the alternate convex search algorithm to solve the proposed formulations by simultaneously learning a set of principal points and an undirected graph with guaranteed convergence.

4.1. Proposed Formulation

Figure 2. A cartoon illustrating the proposed formulation on the teapot images. Each circle marker represents one teapot image. Our assumption of the data generation process is that a graph exists in the latent space (e.g., a rotation circle in the left subplot), and each image is then mapped to point in the input space by maintaining the graph structure through the reversed graph embedding, and finally image is observed conditioned on according to certain noise model.

Given an input dataset , we aim to uncover the underlying graph structure that generates . Since the input data may be drawn from a noise model, it is improper to learn directly from . To unveil the hidden structure, we assume that (i) the underlying graph can be recovered from the learning model represented in the form of (14), e.g., the graph is an undirected weighted graph with specific structures; (ii) graph satisfies the reversed graph embedding assumption where the cost is defined by , that is, if two vertices and are close on with high similarity , points and should be close to each other; (iii) data points are considered as the denoised points of so that the movement of from the denoised point can be measured by data grouping properly.

For clarity, we illustrate the relations of various variables in Fig. 2. Based on the above assumptions, we propose a novel regularized principal graph learning framework as

(15)
s.t.

where is a parameter. It is worth noting that (14) is the key component of (15) for learning a graph given and , and also contains reversed graph embedding objective (1) as a crucially important element for two special graph learning formulations (12) and (13). In addition, we measure the noise of each input data by the Euclidean distance between the input data point to its denoised counterpart.

Instead of manually setting parameters , we assume that the total noise over all input data points is minimal. The optimization problem (15) is then reformulated to penalize the total noise where is a tradeoff parameter between graph-based regularizer and the total noises of the input data. We rewrite the above problem as the following optimization problem

s.t.

where the second term in the objective function is equivalent to (5). As we have discussed in Section 3.2, the loss function (7) is more flexible than (5). Consequently, we propose the new regularized principal graph learning framework by simultaneously optimizing the graph weight matrix , latent variables , projection function , and soft-assignment matrix as

(16)
s.t.

In the following of this section, we propose a simple method for solving problem (16) and present its convergence and complexity analysis.

4.2. Optimization Algorithm

We focus on learning the underlying graph structure from the data. Instead of learning and separately111 To learn , we should define properly, e.g., a linear projection function might be possible as studied in our previous work [28]., we optimize as a joint variable . Let where , and . In the case of weighted undirected graph learning, the optimization problem (16) with respect to variables are reformulated as

(17)
s.t.

In the case of learning a MST, the feasible space is , and the objective function is similar to (17) by removing the second term from the objective function. For simplicity, we do not explicitly show the formulation of learning a spanning tree.

Problem (17) is a bi-convex optimization problem [16]: for fixed and , optimizing is a convex optimization problem; for fixed , optimizing and is also convex. We propose to solve problem (17) by using the alternate convex search, a minimization method to solve a biconvex problem where the variable set can be divided into disjoint blocks [16]. The blocks of variables defined by convex subproblems are solved cyclically by optimizing the variables of one block while fixing the variables of all other blocks. In this way, each convex subproblem can be solved efficiently by using a convex minimization method. Below, we discuss each subproblem in details and the pseudo-code of the proposed method is given in Algorithm 1.

Fix and Solve

Given , Problem (17) with respect to and can be solved independently. Due to the decoupled variables of in rows, we can solve each row independently. According to Proposition 2, we have the analytic solution of given by

(18)

For Problem (17) with respect to , we have to solve the following optimization problem

(19)
s.t.

where with the th element . For Problem (16) to learn a spanning tree structure, we solve the following problem

(20)

As discussed in Section 3.3, Problem (20) can be solved by the Kruskal’s algorithm. Problem (19) is a linear programming problem, which can be solved efficiently by off-the-shelf linear programming solver for small- or moderate-sized datasets, such as Mosek [1].

Fix and Solve

Given and , Problem (17) with respect to can be rewritten as

(21)

where the graph Laplacian matrix and diagonal matrix . Problem (21) is an unconstrained quadratic programming. We have the analytic solution given by

(22)

where the inverse of matrix always exists since is a positive semi-definite matrix and diagonal matrix is always positive definite according to (18).

1:  Input: Data , , , and
2:  Initialize
3:  repeat
4:     
5:     
6:     Solve the following linear programming problem: - (19) for a weighted undirected graph - (20) for a spanning tree
7:     
8:  until Convergence
Algorithm 1 Principal Graph Learning

4.3. Convergence and Complexity Analysis

Since problem (17) is non-convex, there may exist many local optimal solutions. Following the initialization strategy of the mean shift clustering, we initialize to be the original input data as shown in Algorithm 1. The theoretical convergence analysis of Algorithm 1 is presented in the following theorem.

Theorem 1.

Let be the solution of Problem (17) in the th iteration, and be the corresponding objective function value, then we have:

(i) is monotonically decreasing;

(ii) Sequences and converge.

Proof.

Let be a solution obtained in the th iteration. By Algorithm 1, at the th iteration, since each subproblem is solved exactly, we have

So sequence is monotonically decreasing. Furthermore, function is lower-bounded, and then by Monotone Convergence Theorem, there exists , such that converges to .

Next, we prove that the sequence generated by Algorithm 1 also converges. Due to the compactness of feasible sets and , we have the sequence converges to as . Since , converges to , where and . ∎

According to Theorem 1, we define the stopping criterion of Algorithm 1 in terms of the relative increase in the function value compared to the last iteration, and fix it as in all experiments. The empirical convergence results are shown in Section 6.1.

The time complexity of Algorithm 1 for learning a tree structure is determined by three individual parts. The first part is the complexity of running Kruskal’s algorithm to construct a minimum spanning tree. It requires for computing a fully connected graph and for finding a spanning tree. The second part is dominated by computing the soft assignments of samples, which has a complexity of . The third part is dominated by the inverse of a matrix of size that takes operations and matrix multiplication that takes operations. Therefore, the total complexity for each iteration is . For learning an undirected weighted graph by solving Problem (19), the only difference is the complexity of linear programming, which can be solved efficiently by Mosek [1]. In Section 5, we will extend the proposed algorithm for dealing with large-scale data.

5. Regularized Principal Graph Learning on Large-Scale Data

In order to reduce the high computational complexity of Algorithm 1 for large-scale data, we propose to incorporate two strategies into the proposed model for fast learning by using landmarks and side information.

5.1. Graph Learning with Side Information

Instance-level constraints are useful to express a priori knowledge about which instances should or should not be grouped together, which have been successfully applied to many learning tasks, such as clustering [42], metric learning [45], and kernel learning [47]. In the area of clustering, the most prevalent form of advice are conjunctions of pairwise instance level-constraints of the form must-link (ML) and cannot-link (CL) which state that pairs of instances should be in the same or different clusters respectively [42]. Given a set of points to cluster and a set of constraints, the aim of clustering with constraints is to use the constraints to improve the clustering results.

We consider certain structure-specific side information for guidance of learning the similarity matrix and computational reduction instead of optimizing the full matrix. For this purpose, we take into account the CL constraints. Let be a set of data points which might be linked to data point . On the contrary, data points that are not in will belong to CL set. By incorporating these CL side information into the proposed framework, we can come out the following optimization problem for learning graph representation given by

(23)
s.t.

which can be equivalently transformed into linear programming optimization problem and can be solved accurately for large-scale data by using the off-the-shelf toolbox.

5.2. Landmark-based Regularized Principal Graph

In order to handle large-scale data, we extend the proposed regularized principal graph learning framework by using a landmark-based technique. Landmarks have been widely used in clustering methods to deal with large-scale data [13, 6]

. Random landmark selection method is widely used for spectral clustering

[13, 6]. It is well known that selecting landmarks can further improve clustering performance with a certain adaptive strategy, such as -means [46], a variety of fixed and adaptive sampling schemes, and a family of ensemble-based sampling algorithms [25].

Our regularized principal graph learning framework can inherently take the landmark data points into account through the data grouping. Taking the weighted undirected graph based framework as an example, the optimization problem by considering landmarks is given by

(24)
s.t.

where is a set of landmarks of size . Moreover, our proposed methods can automatically adjust centroids of landmarks during the optimization procedure. Due to the non-convexity of above objective function, we have to properly initialize these landmarks. As shown in the literature [46, 6], -means can generally obtain better clustering performance for spectral clustering methods. In this paper, we follow this idea and use centroids obtained from -means to initialize . The pseudo-code is shown in Algorithm 2. By following the same analysis procedure as Section 4.3, the computational complexity of Algorithm 2 is which is significantly smaller than of Algorithm 1 if . Hence, Algorithm 2 is practical for large-scale data with a large .

1:  Input: Data