Principal Boundary on Riemannian Manifolds

10/21/2017 ∙ by Zhigang Yao, et al. ∙ National University of Singapore Zhejiang University 0

We revisit the classification problem and focus on nonlinear methods for classification on manifolds. For multivariate datasets lying on an embedded nonlinear Riemannian manifold within the higher-dimensional space, our aim is to acquire a classification boundary between the classes with labels. Motivated by the principal flow [Panaretos, Pham and Yao, 2014], a curve that moves along a path of the maximum variation of the data, we introduce the principal boundary. From the classification perspective, the principal boundary is defined as an optimal curve that moves in between the principal flows traced out from two classes of the data, and at any point on the boundary, it maximizes the margin between the two classes. We estimate the boundary in quality with its direction supervised by the two principal flows. We show that the principal boundary yields the usual decision boundary found by the support vector machine, in the sense that locally, the two boundaries coincide. By means of examples, we illustrate how to find, use and interpret the principal boundary.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

page 18

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most of the classification methodology in high dimensional data analysis is deeply rooted in methods relying on linearity. Modern data sets often consist of a large number of observations, each of which is made up of many features. Manifold data arises in the sense that the sample space of data is fundamentally nonlinear. Instead of viewing the features of any observation as a point in a high-dimensional space, it is more convenient to assume the data points lie on an embedded lower-dimensional non-linear manifold within the higher-dimensional space. The lower-dimensional manifold structure can usually be interpreted from at least two scenarios: 1) the physical data space is an actual manifold; 2) the underlying data structure can be approximated by a close manifold. In the former scenario, the data space is usually known, and it can be further seen as data in shape space

[10], e.g., earthquake coordinates, leaf growth pattern, and data with nonlinear constraints thus forced to lie on a manifold. In the latter scenario, the manifold is uncovered from the data set by a non-linear dimensionality reduction technique referred to as the manifold learning method [17, 1, 20], and, thus, it is considered unknown.

In this work, we consider the classification problem and design the nonlinear methods which perform as a boundary for classifying data sets lying on manifolds. Throughout, we mainly focus on the known “manifold” case; that is, the embedding is known. This problem has become increasingly relevant, as many real applications such as medical imaging

[5, 18]

and computer vision produce data

[15, 16] in such forms. This encourages researchers to do the analysis directly on the manifold rather than in the Euclidean space or the enlarged Euclidean space using basis expansions. The rationale behind this is that the metric on the manifold is much more reasonable than that in the Euclidean space, if the data resides on manifolds. However, the methodology defined upon manifold space for classification is still lacking. To perform reliable classification for data points on manifolds, a strategy of developing statistical tools, such as the nonlinear classification boundary, in parallel with their Euclidean-counterparts, is significantly relevant.

Though we have seen tremendous effort in developing statistical procedures for classification problems, the major part has been focused on constructing the separating hyperplane between two classes in the Euclidean space. The optimality is essentially restricted to finding

linear

(affine) hyperplanes that separate the data points as best as possible. Among them, linear discriminant analysis (LDA), or the slightly different logistic regression method, manifest themselves in the case of seeking the hyperplane by minimising the so-called discriminant function, and thus they are able to trace out a linear boundary separating the different classes. Further to linear boundary, the support vector machine (SVM) finds a seemingly different separating hyperplane; that is, the hyperplane is actually found (up to some loss function), not in its original feature space, but in an enlarged space, by transforming the feature space into an unknown space via basis functions. In this sense, SVM is only capable of producing the namely nonlinear boundary with respect to the original space. This being said, a direct extension of SVM while retaining its use as a classifier when it comes to the manifold is not straightforward.

There have been a number of pieces of research on the statistical methods on manifolds over the past decades, centring around finding the main modes of variation [4, 6] in the data, or finding a manifold version of the principal components for the data [9, 3, 7, 12, 2, 11, 8], in terms of dimension reduction. However, none of them seem to be adaptable in deriving a boundary on a manifold, due to their “non curve-fitting” nature. Recently, the principal flow [14] has been proposed as a one-dimensional curve, defining on the manifold, such that it attempts to follow the main direction of the data locally, while still being able to accommodate the “curve fitting” characteristic on the manifold. The variational principal flow [13] incorporates the level set method to obtain a fully implicit formulation of the problem. The principal sub-manifold [19] extends the principal flow to a higher dimensional sub-manifold. It is natural to raise the question as follows: whether anything interesting can be found that can separate the data directly on the manifold, without attempting to formulate the same problem in another unknown space?

We explore the limitations inherent in the problem when trying to find such a boundary. Inspired by the principal flow, we consider generalisations of the classification boundary on Riemannian manifolds. We do not intend to merely search a nonlinear boundary, directly on the manifold, that satisfies a certain optimisation condition with respect to the data points. Rather, our idea is to trace a boundary out of the two principal flows from the two classes, and at the same time retain some canonical interpretation for the boundary. Our intuition is that, as the two principal flows represent the mean trend of the two classes, in order to classify the points, it is enough to separate the two flows in some optimal way. This means that one does not need to consider the data points beyond the two flows on each side anymore, as they are irrelevant to the classification, provided that we can separate the flows well. Naturally, because of the two principal flows, the process of constructing the boundary can be supervised, in the sense that the boundary grows itself by borrowing strength from the two principal flows. To achieve this, the key insight is the margin, a measure of distance between the target boundary and the two principal flows, subject to the presence of noise originating from each class. In principal, an optimal boundary can be framed by maximising the margin between the target boundary and the two corresponding principal flows. The optimisation involved therein can be relaxed by fine-tuning the subspace of the vector field from the two principal flows, up to their parallel transport on the manifold. From this perspective, the boundary retains the characteristic of being principal, in the sense that at each point of the boundary, it points to the direction calculated over the two directions of the vector filed from the two principal flows. This finally results in a classification boundary which is named as the principal boundary, drawing a relation to the principal flow.

We demonstrate how the problem of obtaining the principal boundary can be transformed to a well-defined integration problem in Section 3.2, with a motivation and an introduction of the margin in Section 3.1, which plays a key role in defining the boundary. The formal definition of a principal boundary is given in Section 3.2. We show that on the manifold, the principal boundary reduces to the SVM boundary, at least locally; that is, the segment of the principal boundary coincides with that of SVM. Section 5 contains the property of the boundary with a detailed analysis of the relation between the boundary and SVM. Generally speaking, our formulation of the boundary is feasible for any Riemannian manifold, provided that the geodesic is unique, although an unknown geodesic might increase the complexity in computation.

The remaining part of the paper is organised as follows. We start with a brief introduction of the principal flow (Section 2.2-2.3) with a modified vector field and a modified principal flow. Section 3 is the Methodology section. Section 4 investigates an implementable algorithm for determining the principal boundary. Section 5 gives all the technical details of the equivalence of the boundary and SVM. In Section 6, we illustrate the principal boundary by means of simulated examples. We end the paper with a discussion.

2 The Problem

We are interested in the following problem: let be a Riemannian manifold in with the dimension . Let () be the data points on with label () and () be the data points on with label (). We are looking for a classification boundary, say , on , that is as wide as possible between the data points for class 1 and -1.

The heuristic behind the principal boundary is that first we construct the two mean flows

and of the data points and , respectively. Each mean flow represents the principal direction of the data variation for each class on . Second, the classification problem can now be rephrased as finding a flow , lying between and , which separates the two as well as possible.

The two mean flows are in fact called principal flows. Before we continue, let us digress slightly and review the principal flow.

2.1 Preliminaries

Let be data points on a complete Riemannian manifold of dimension , where , embedded in the linear space .

We assume that a differentiable function always exists, such that

For each the tangent space at will be denoted by , then is characterised by the equation

Thus, is in fact a vector space, the set of all tangent vectors to at , which essentially provides a local vector space approximation of the manifold .

By equipping the manifold with the tangent space, we define two mappings back and forth between and : 1) the exponential map, well defined in terms of geodesics, is the map:

(1)

by with a geodesic starting from with initial velocity and , and 2) the logarithm map (the inverse of the exponential map), is locally defined at least in the neighborhood of ,

(2)

Here, the exp and log are defined on a local neighborhood of and such that they are all well-defined, away from the cut locus of on .

Let . Denote all (piecewise) smooth curves with endpoints such that and . The geodesic distance from to is defined as

(3)

where . Minimising (3) yields the shortest distance between the two points and in .

2.2 Definition of principal flow

Definition 2.1

The principal flow of () is defined as the union of the two curves, and , satisfying the following two variational problems, respectively

(4)
(5)

where is the starting point and is the a unit vector at . The point can be chosen as the Fréchet mean of () such that

or any other point of interest. Note that is the set of all non-intersecting differentiable curves on .

Technically, the principal flow incorporates two ingredients: a local covariance matrix and a vector field.

For the data point , choose a neighborhood of with a radius , defined as

Accordingly, the local covariance matrix is defined as

where , with a smooth non-increasing uni-variate kernel on .

Let be a connected open set covering () and such that is well defined for all . Assume that

has distinct first and second eigenvalues for all

. The vector field is defined in the way that the first eigenvector

(or eigenvalue ) of is extended to a vector field where ; that is, for any , we have

(6)

In the meanwhile, it has been proved that is a differentiable mapping with being independent of the local coordinates of the tangent space .

It can be seen that the curve starts at and follows the direction of the vector field and the curve starts at and goes in the opposite direction of the vector field. Thus, the integral for is negative, which explains why the infimum appears in its definition.

Under the principal flow definition, we can define of () as the union of and , and of () as the union of and . For convenience, we will only consider the flow in (1) of Definition (2.1), and re-name it as . By symmetry, the solution to the flow in (2) of Definition (2.1) can be carried out in the same way. Similarly, we will restrict the discussion to and re-name it as .

2.3 Modified principal flow

The principal flow relies heavily on the vector field. However, the original definition (see (6)) of the vector field strictly constructs the direction of every point in the field as solely pointing to the eigenvector of the local covariance matrix. This definition could be problematic when we need a much delicate field for the flow. To be exact, for each point belonging to a neighbourhood where the field is calculated, it can also be in other neighbours. It turns out that we will need to modify the vector field for the principal boundary, since the vector field plays quite a crucial role in the problem.

We will equip a vector field for each training sample (say ) in the data . For samples in each neighbourhood, say , determine a locally dominate or principal vector by the local tangent PCA.

A sample can be the neighbour of multiple points. Let be the index set of neighbour sets that holds as a neighbour. The modification of vector field amounts to overall effect of holding multiple neighbourships for a point. To achieve this, it is very natural to equip a vector for as a weighted sum of the locally principle vectors . Let be the mean of when we determine . Then, we assign the vector for , which is the projection of the weighted sum

onto the manifold at .

As soon as we have a vector field constructed above, a principal flow of the given data set is defined by

(7)

That is, at the point , the tangent should match the vector field of samples in the neighbourhood . To differentiate from the principal flow in Definition 2.1, we call the modified principal flow.

3 Methodology

Now we get back to the original question; that is, given two principal flows and determined from two data sets and , can we find a curve, say , that can be used as a classification boundary between the two classes?

Strictly speaking, many s in could exist, that separate two classes of data. Therefore, by using the term the classification boundary, we refer to the best one. We present the general idea of constructing such a boundary here, leaving the formal introduction in Section 3.2: let the start from and move infinitesimally in the direction of

. At this moment, we assume that both

and are carefully chosen so that the flow moves in the correct direction. Once the first move has been made, it may no longer make sense to continue moving in the same direction . One may ask, then, in what direction should we move? Obviously, we should not move either towards or , since this would cause to move close to either of the two flows. To update the direction, a natural strategy that plays an important role in building the boundary is to let move in a direction supervised by and ; that is, we follow the vector field inherited from and , then move by choosing the proportional amount of vector field from and each time. Indeed, the right amount of vector field to choose for the next move is essentially an optimisation problem, the derivation of which will be discussed in the next section. This being said, the intuitive version of such a boundary is not unique in the sense that a parallel curve satisfying the same condition always exists, and this can be seen by varying the initial point. To achieve the classification, let us view the problem slightly differently: note that it is only the points lying in between and that could influence , so a very straightforward approach is to choose the tangent vector for the next move along the direction that creates the biggest margin between the data points for class and . Under this rationale, iterating the process would approximately trace out an integral curve that is not only proportionally compatible to the vector fields of the two flows at each point, but more importantly, the curve will separate the margin, which therefore can be considered as a classification boundary.

3.1 Margin

At each point , the tangent vector of at should be the locally principal vector of samples in . Suppose that distinct first and second largest eigenvalues of the local covariance matrix of the centered samples for exist. The principal vector is the dominant eigenvector in , corresponding to its largest eigenvalue . The local PCA also determines the second largest eigenvalue value . The ratio

approximately indicates the largest distance of samples in the neighbourhood deviated from the mean along the eigenvector corresponding to .

The distance of a point of the manifold to the principal flow is defined as

where is the geodesic distance between and on . We assume that the distance is achievable, i.e., there is a point such that . Furthermore, we assume that the minimiser is unique. We call as the projection of onto , and denote it as as a function of .

Hence, if with , the gap

is a ‘soft’ margin of for classifying in the sense that the neighbour set locates in a side of , at least locally. More generally, given curve on the same manifold, if we always have positive value of for each , then is located on one side.

3.2 Principal boundary

Suppose that and are determined from the two data sets and respectively. We say and are separated if there is a curve such that and are on different sides of , conditioning on the margins to the two curves; that is, and for all . Clearly, such a curve can correctly classify the two data sets.

If point locates between and , we call the minimum of , and is the margin of with respect to and , i.e.,

Let be the set of classification curve with unit speed. A good classification curve should have the margin to the two principal flows, as large as possible at each . We call the ideal classification curve that has the largest value of as a principal boundary of the two data sets.

Definition 3.1

A unit-speed curve is called the principal boundary if it maximises the integral of the margin over . That is,

Definition 3.1 seems to give the idea of optimising the ideal principal boundary that separates two principal flows by maximising the margin over a class of flows . Theoretically, the class set contains all the curves that lie in between and . This is a much broader class than is necessary to sort out a boundary from a classification point of view. Our greater interest lies in achieving such a over a smaller set, which is more likely accessible. Ordinarily, to find such a set one may consider the alignment between the target principal boundary and the principal flows. Hence, we will restrict our attention to a class set , where the correspondence among and or can be explained by a corresponding geodesic.

Assume the the projections and are one-to-one for ; that is, a different yields a different projection onto or . Hence, the geodesic curve between its two projections and must go across the principal boundary at the original . The maximisation in Definition 3.1 means that should also be the middle point of the geodesic curve, i.e.,

if both and have been pre-determined. Hence, we can equivalently define the principal boundary in a more direct way, as follows.

Definition 3.2

A curve on is called the principal boundary of two principal flows and , if any satisfies

  • the geodesic curve between its two projections and onto and also contains the point , and

  • .

The condition (2) is equivalent to for any . Now let us discuss how to obtain such a as a parameterized flow

starting at an initial point that satisfies the condition (2) in Definition 3.2. The tangent vector will be carefully chosen, as follows.

Since the projections can be parameterized as

the principal flows can also be parameterized as

Hence, we are equipped with two tangent vectors and at and , respectively. Numerically, the tangent vector or can be estimated by the vector filed of at or the vector filed of at .

The two tangent vectors and (or their estimates) may not necessarily lie on the same tangent plane at of . A natural solution is to move the tangent vectors towards the tangent plane at under a parallel transport along the geodesic curves. Let and be the transported tangent vectors of and respectively. In Appendix 2, we give the details of the machinery Schild’s Ladder for an approximate implementation of the parallel transport.

As soon as the parallel transport is done at the current , the choice of is two-fold: 1) if lies in the plane spanned by and , then there is a satisfying the equation 2) otherwise, with

where is the norm.

Although the above discussion does not immediately yield an implementable estimation of the true vector ( is not available), it gives an updating rule of estimating , as follows. Prior to , we choose with a small and estimate as

(8)

Then, we check if the estimate is acceptable or not via testing whether , the projection of onto the manifold, satisfies the conditions in Definition 3.2 under a given accuracy. If this is not the case, we slightly tune to , and check again until convergence. The updating can be that

(9)

Initially, when tangent vector is not available for determining the initial , we can simply choose . In the next section, we will present a detailed algorithm for computing the principal boundary.

4 Algorithm

Finding the principal boundary in practice can be more challenging than finding a principal flow, in the sense that the former problem is more attached to the picking of the points on the boundary. Recalling that for a point ,

the minimum is achieved if and only if is as small as possible. However, one cannot simply identify a sequence of such s between the two flows to form an approximate boundary. The main reason is that we require the boundary to be a smooth curve. In this respect, we need a much more sophisticated mechanism to guarantee an equal margin on both sides of the boundary, including the choice of the initial point, whereas in the case of principal flow, picking the initial point can be very flexible. This is particularly true when the margins differ significantly between the two classes. In these cases, picking the mean or any symmetry of the data points is no longer meaningful. To facilitate the process of generating the boundary, we will require an initial point and then a process of fine-tuning the vector field along the way, which essentially has a direct impact on the principal boundary between the two flows.

We will now present a high-level description of the algorithm of computing the boundary (see Figure 1), the core of which is elaborated as follows.

Step 1(Initialising the boundary): The initialisation involves finding a matching pair on and , and calculating an initial point . Arbitrarily choose a point , and let be the projection of onto . Consider the geodesic curve between and . Obviously, for any point , we always have . Let be the projection of onto . Hence, identify such that,

Here, we call a warm start. Pick a matching pair as follows

Then we can identify a point such that . Obviously, satisfies the conditions in Definition 3.2.

Step 2 (Updating the boundary): Calculating for a from the previous point with a small . Initially, let , , and set as in (8). Then, estimate by

(10)

and find the projection as

(11)

If satisfies the conditions in Definition 3.2, let ; otherwise, update by (9), and re-calculate in (10) and in (11).

The algorithm will be executed for a period of time and it produces a sequence of . The constructed sequence is indeed the principal boundary since we always have that for every , , and ,

Supposing we may discretise the as , . The length of the principal boundary can be numerically approximated

Figure 1: Algorithm. (a) Step 1 (Initializing the boundary): is the warm start for finding the matching pair and ; is the initial point chosen from the projection satisfying the conditions of Definition 3.2, via alternating . (b) Step 2 (Updating the boundary): is used to find ; is chosen by from the projection satisfying the conditions of Definition 3.2, via alternating .

5 Property of the principal boundary

This section shows that the local segment of the principal boundary reduces to the boundary given by the support vector machine. We remark here that the SVM boundary we use here is a manifold extension of the usual SVM, which is essentially a geodesic curve. The same results hold in the context of Euclidean spaces, where is a linear subspace of . By making the notion of “local equivalence” precise, we provide a measure of distance between the principal boundary, obeying Definition 3.2 and the SVM boundary.

To study the relation, we start with a quantitative description of the segment of on by the following proposition.

Proposition 5.1

(Local configuration) Consider a small segment of on . Suppose that

  • The segment is discretised as , where .

  • Following the notation in Section 3.1, let be the set of samples in the -neighbourhood of . Clearly, the local neighbourhoods give a configuration of local data points from class as

    Similarly, we can define for the other class .

Figure 2: The principal boundary in the local configuration. The covering ellipse ball (in green) with the second radius , centered at contain the local points (in “”) for class ; the shortest distance between and the corresponding local configuration of class is approximated by (same for class ).

If is small enough, it is approximately a segment of a straight line , and ’s are located at the line, also approximately. Let the projections of onto and be

respectively. The SVM on the two classes determines a geodesic curve that separates the two sets, such that the margin of

is maximised. For , we define the margin as

To quantify the relation of and , we will basically need three approximations. We remark here that although a careful approximation of to can be bounded under some error assumptions, together with an estimation of , doing so would result in a complicated analysis. To simplify our discussion, we will first sketch an overall review of the local equivalence while highlighting the idea. A refined approximation is followed in terms of Theorem 1.

Let us consider samples in each neighbour set . Assume that these neighbours are covered by an ellipse ball of the second radius being , centred at . Since , the quantity

(12)

should approximately hold, up to some degree. From , we get that

Similarly, we also have that . Therefore, we have

On the other hand, let be the intersected point of the geodesic curve and the straight line . We also have the approximations that

(13)
(14)

It follows that the SVM margin

Let be a geodesic curve nearest by . Then , and

by definition, which suggests that approximately coincides with .

Obviously, it can be seen that the local equivalence holds if Approximation (12) and (13) (or (14)) are satisfactory. We introduce the following condition to guarantee the approximations up to some quantitative degree of uncertainty. This is done by linking the density of sample points in a local neighbour with a probability measure.

Condition 5.1

(Covering ellipse ball) For each , consider samples in each neighbour set . We assume that,

  • has neighbours that are covered by an ellipse ball of the second radius , centred at ,

  • when , with probability of at least

  • similar conditions apply to class .

Theorem 1

Let and be the separating boundary between the local samples of two classes, derived by the principal boundary and SVM, respectively. Let where and . Given Proposition 5.1 and Condition 5.1, and are equivalent such that , with probability of at least , for .

Theorem 1 gives an equivalence of and on the curved manifold. The proof of Theorem 1 is given in Appendix 1. Although we have potentially linked with with an interest of interpreting them locally, it does not necessarily mean that Theorem 1 is only valid when the locality is infinitesimal. Instead, it is governed by the spacings of the segment . In fact, when the locality of