Learning Data Manifolds with a Cutting Plane Method

05/28/2017 ∙ by SueYeon Chung, et al. ∙ Hebrew University of Jerusalem University of Pennsylvania Harvard University 0

We consider the problem of classifying data manifolds where each manifold represents invariances that are parameterized by continuous degrees of freedom. Conventional data augmentation methods rely upon sampling large numbers of training examples from these manifolds; instead, we propose an iterative algorithm called M_CP based upon a cutting-plane approach that efficiently solves a quadratic semi-infinite programming problem to find the maximum margin solution. We provide a proof of convergence as well as a polynomial bound on the number of iterations required for a desired tolerance in the objective function. The efficiency and performance of M_CP are demonstrated in high-dimensional simulations and on image manifolds generated from the ImageNet dataset. Our results indicate that M_CP is able to rapidly learn good classifiers and shows superior generalization performance compared with conventional maximum margin methods using data augmentation methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Handling object variability is a major challenge for machine learning systems. For example, in visual recognition tasks, changes in pose, lighting, identity or background can result in large variability in the appearance of objects

hinton1997modeling

. Techniques to deal with this variability has been the focus of much recent work, especially with convolutional neural networks consisting of many layers. The manifold hypothesis states that natural data variability can be modeled as lower-dimensional manifolds embedded in higher dimensional feature representations

bengio2013representation . A deep neural network can then be understood as disentangling or flattening the data manifolds so that they can be more easily read out in the final layer brahma2016deep . Manifold representations of stimuli have also been utilized in neuroscience, where different brain areas are believed to untangle and reformat their representations riesenhuber1999hierarchical ; serre2005object ; hung2005fast ; dicarlo2007untangling ; pagan2013signals .

This paper addresses the problem of classifying data manifolds that contain invariances with a number of continuous degress of freedom. These invariances may be modeled using prior knowledge, manifold learning algorithms tenenbaum1998mapping ; roweis2000nonlinear ; tenenbaum2000global ; belkin2003laplacian ; belkin2006manifold ; canas2012learning or as generative neural networks via adversarial training goodfellow2014generative . Based upon knowledge of these structures, other work has considered building group-theoretic invariant representations anselmi2013unsupervised or constructing invariant metrics simard1994memory . On the other hand, most approaches today rely upon data augmentation by explicitly generating “virtual” examples from these manifolds niyogi1998incorporating scholkopf1996incorporating . Unfortunately, the number of samples needed to successfully learn the underlying manifolds may increase the original training set by more than a thousand-fold krizhevsky2012imagenet .

Figure 1:

The maximum margin binary classification problem for a set of manifolds. The optimal linear hyperplane is parameterized by the weight vector

which separates positively labeled manifolds from negatively labeled manifolds. Conventional data augmentation techniques resort to sampling a large number of points from each manifold to train a classifier.

We propose a new method, called the Manifold Cutting Plane algorithm or , that uses knowledge of the manifolds to efficiently learn a maximum margin classifier. Figure 1 illustrates the problem in its simplest form, binary classification of manifolds with a linear hyperplane with extensions to this basic model discussed later. Given a number of manifolds embedded in a feature space, the algorithm learns a weight vector that separates positively labeled manifolds from negatively labeled manifolds with the maximum margin. Although the manifolds consist of uncountable sets of points, the algorithm is able to find a good solution in a provably finite number of iterations and training examples.

Support vector machines (SVM) can learn a maximum margin classifier given a finite set of training examples vapnik1998statistical ; however, with conventional data augmentation methods, the number of training examples increase exponentially rendering the standard SVM algorithm intractable. Methods such as shrinkage and chunking to reduce the complexity of SVM have been studied before in the context of dealing with large-scale datasets smola1998learning , but the resultant kernel matrix may still be very large. Other methods which subsample the kernel matrix lee2001rsvm or reduce the number of training samples wang2005training ,smola1998learning may result in suboptimal solutions that do not generalize well.

Our algorithm directly handles the uncountable set of points in the manifolds by solving a quadratic semi-infinite programming problem (QSIP). is based upon a cutting-plane method which iteratively refines a finite set of training examples to solve the underlying QSIP fang2001solving ; kortanek1993central ; liu2004new . The cutting-plane method was also previously shown to efficiently handle learning problems with a finite number of examples but an exponentially large number of constraints joachims2006training . We provide a novel analysis of the convergence of with both hard and soft margins. When the problem is realizable, the convergence bound explicitly depends upon the margin value whereas with a soft margin and slack variables, the bound depends linearly on the number of manifolds.

The paper is organized as follows. We first consider the hard margin problem and analyze the simplest form of the algorithm. Next, we introduce slack variables in , one for each manifold, and analyze its convergence with additional auxiliary variables. We then demonstrate the application of to both high-dimensional synthetic data manifolds and to feature representations of images undergoing a variety of warpings. We compare its performance, both in efficiency and generalization error, with conventional SVMs using data augmentation techniques. Finally, we discuss some natural extensions and potential future work on and its applications.

2 Manifolds Cutting Plane Algorithm with Hard Margin

In this section, we first consider the problem of classifying a set of manifolds when they are linearly separable. This allows us to introduce the simplest version of the algorithm along with the appropriate definitions and QSIP formulation. We analyze the convergence of the simple algorithm and prove an upper bound on the number of errors the algorithm can make in this setting.

2.1 Hard Margin QSIP

Formally, we are given a set of manifolds , with binary labels (all points in the same manifold share the same label). Each manifold is defined by ) where , is a compact, convex subset of representing the parameterization of the invariances and is a continuous function of so that the manifolds are bounded: by some . We would like to solve the following semi-infinite quadratic programming problem for the weight vector :

(1)

This is the primal formulation of the problem, where maximizing the margin is equivalent to minimizing the squared norm We denote the maximum margin attainable by , and the optimal solution as . For simplicity, we do not explicitly include the bias term here. A non-zero bias can be modeled by adding an additional feature of constant value as a component to all the . Note that the dual formulation of this QSIP is more complicated, involving optimization of non-negative measures over the manifolds. In order to solve the hard margin QSIP, we propose the following simple algorithm.

2.2 Algorithm

The algorithm is an iterative algorithm to find the optimal in (1), based upon a cutting plane method for solving the QSIP. The general idea behind is to start with a finite number of training examples, find the maximum margin solution for that training set, augment the training set by finding a point on the manifolds that violates the constraints, and iterating this process until a tolerance criterion is reached.

At each stage of the algorithm there is a finite set of training points and associated labels. The training set at the -th iteration is denoted by the set: with examples. For the -th pattern in , is the index of the manifold, and is its associated label.

On this set of examples, we solve the following finite quadratic programming problem:

(2)

to obtain the optimal weights on the training set . We then find a constraint-violating point from one of the manifolds such that

(3)

with a required tolerance If there is no such point, the algorithm terminates. If such a point exists, it is added to the training set, defining the new set . The algorithm then proceeds at the next iteration to solve to obtain . For , the set is initialized with at least one point from each manifold. The pseudocode for is shown in Alg. 1.

Input: (tolerance), manifolds and labels , .

1. Initialize , and the set with at least one sample from each manifold .

2. Solve for in : s.t. for all .

3. Find a point among the manifolds with a margin s.t. .

4. If there is no such point, then stop. Else, augment the point set: .

5. and go to 2.

Algorithm 1 Pseudocode for the algorithm.

In step 3 of the algorithm, a point among the manifolds that violates the margin constraint needs to be found. The use of a “separation oracle” is common in other cutting plane algorithms such as those used for structural SVM’s joachims2006training or linear mixed-integer programming marchand2002cutting . In our case, this requires determining the feasibility of over the -dimensional convex parameter set . When the manifold mapping is convex, feasibility can be determined sometimes analytically or more generally by a variety of modern convex optimization techniques. For non-convex mappings, a feasible separating point can be found with search techniques in the -dimensional parameter set using gradient information or finite differences or approximately via convex relaxation techniques. In our experiments below, we provide some specific examples of how separating points can be efficiently found.

2.3 Convergence of

The algorithm will converge asymptotically to an optimal solution when it exists. Here we show that the asymptotically converges to an optimal . Denote the change in the weight vector in the -th iteration as

. We present a set of lemmas and theorems leading up to the bounds on the number of iterations for convergence, and the estimation of the objective function. More detailed proofs can be found in Supplementary Materials (SM).

Lemma 1.

The change in the weights satisfies .

Proof.

Define . Then for all , satisfies the constraints on the point set : for all . However, if , there exists a such that , contradicting the fact that is a solution to .

Next, we show that the norm must monotonically increase by a finite amount at each iteration.

Theorem 2.

In the iteration of algorithm, the increase in the norm of is lower bounded by , where and .

Proof.

First, note that , otherwise the algorithm stops. We have: (Lemma 1). Consider the point added to set . At this point, , , hence . Then, from the Cauchy-Schwartz inequality,

(4)

Since the solution satisfies the constraints for , . Thus, the sequence of iterations monotonically increase norms and are upper bounded by . Due to convexity, there is a single global optimum and the algorithm is guaranteed to converge.

As a corollary, we see that this procedure is guaranteed to find a realizable solution if it exists in a finite number of steps.

Corollary 3.

The algorithm converges to a zero error classifier in less than iterations, where is the optimal margin and bounds the norm of the points on the manifolds.

Proof.

When there is an error, we have , and (See (4)). This implies the total number of possible errors is upper bounded by .

With a finite tolerance , we obtain a bound on the number of iterations required for convergence:

Corollary 4.

The algorithm for a given tolerance terminates in less than iterations where is the optimal margin and bounds the norm of the points on the manifolds.

Proof.

Again, and each iteration increases the squared norm by at least .

We can also bracket the error in the objective function after terminates:

Corollary 5.

With tolerance , after terminates with solution , the optimal value of is bracketed by: .

Proof.

The lower bound on is as before. Since has terminated, setting would make feasible for , resulting in the upper bound on .

3 with Slack Variables

In many classification problems, the manifolds may not be linearly separable due to their dimensionality, size, or correlation structure. In these situations, will not be able to find a feasible solution. To handle these problems, the classic approach is to introduce slack variables on each point (). Unfortunately, this approach requires integrations over entire manifolds with an appropriate measure defined by the infinite set of slack variables. Thus, we formulate an alternative version of the QSIP with slack variables below.

3.1 QSIP with Manifold Slacks

In this work, we propose using only one slack variable per manifold for classification problems with non-separable manifolds. This formulation demands that all the points on each manifold obey an inequality constraint with one manifold slack variable, . As we see below, solving for this constraint is tractable and the algorithm has good convergence guarantees.

However, a single slack requirement for each manifold by itself may not be sufficient for good generalization performance. Our empirical studies show that generalization performance can be improved if we additionally demand that some representative points on each manifold also obey the margin constraint: . In this work, we implement this intuition by specifying appropriate center points for each manifold . This center point could be the center of mass of the manifold, a representative point, or an exemplar used to generate the manifolds krizhevsky2012imagenet . Additional slack variables for these constraints could potentially be introduced; in the present work, we simply demand that these points strictly obey the margin inequalities corresponding to their manifold label. Formally, the QSIP optimization problem is summarized below, where the objective function is minimized over the weight vector and slack variables :

3.2 Algorithm

With these definitions, we introduce our algorithm with slack variables below.

Input: (tolerance), manifolds and labels , and centers

1. Initialize , and the set with at least one sample from each manifold .

2. Solve for ,: s.t. for all and for all.

3. Find a point among the manifolds with slack violation larger than the tolerance :

4. If there is no such point, then stop. Else, augment the point set: .

5. and go to 2.

Algorithm 2 Pseudocode for the algorithm.

The proposed algorithm modifies the cutting plane approach to solve a semi-infinite, semi-definite quadratic program. Each iteration involves a finite set: with examples that is used to define the following soft margin SVM:

to obtain the weights and slacks at each iteration. We then find a point from one of the manifolds so that:

(5)

where . If there is no such a point, the algorithm terminates. Otherwise, the point is added as a training example to the set and the algorithm proceeds to solve to obtain and .

3.3 Convergence of

Here we show that the objective function is guaranteed to increase by a finite amount with each iteration. This result is similar to tsochantaridis2005large , but here we present statements in the primal domain over an infinite number of examples. More detailed proofs can be found in SM.

Lemma 6.

The change in the weights and slacks satisfy:

(6)

where and .

Proof.

Define and . Then for all , and satisfy the constraints for . The resulting change in the objective function is given by:

(7)

If (6) is not satisfied, then there is some such that , which contradicts the fact that and are a solution to .

We derive that the added point at each iteration must be a support vector for the next weight:

Lemma 7.

In each iteration of algorithm, the added point must be a support vector for the new weights and slacks, s.t.

(8)
(9)
Proof.

Suppose for some . Then we can choose so that and satisfy the constraints for . But, from Lemma 6, we have which contradicts the fact that and are a solution to . Thus, and the point must be a support vector for (9) results from subtracting (5) from (8).

We also derive a bound on the following quadratic function over nonnegative values:

Lemma 8.

Given ,, then

(10)
Proof.

The minimum value occurs when . When , then and the minimum is . When , the minimum occurs at . Thus, the lower bound is the smaller of these two values.

Using the lemmas above, the lower bound on the change in the objective function can be found:

Theorem 9.

In each iteration of algorithm, the increase in the objective function for , defined as , is lower bounded by

(11)
Proof.

We first note that the change in objective function is strictly increasing:

(12)

This can be seen immediately from Lemma 6 when . On the other hand, if , we know that from Lemma 7 and since is the solution for . So for , . To compute a general lower bound on the increase in the objective function, we proceed as follows.

The added point comes from a particular manifold . If , from Lemma 7 we have . Then by the Cauchy-Schwarz inequality, which yields .

We next analyze and consider the finite set of points that come from the manifold. There must be at least one such point in by initialization of the algorithm. Each of these points obeys the constraints:

(13)
(14)
(15)

We consider the minimum value of the thresholds: . We have two possibilities: is positive so that none of the points are support vectors for , or so that at least one support vector lies in .

Case :

In this case, we define a linear set of slack variables:

(16)

and weights . Then for , and will satisfy the constraints for . Following similar reasoning in Lemma 6, this implies

(17)

Then in this case, we have

(18)
(19)
(20)

by applying (17) in (18), Lemma 7 and Cauchy-Schwarz in (19) and Lemma 8 in (20) .

Case :

In this case, we consider , i.e. the minimal increase among the support vectors. We then define

(21)

and weights . There will then be a finite range of for which and satisfy the constraints for so that

(22)
(23)

We also have a support vector so that

(24)
(25)

Using Lemma 7 and Cauchy-Schwarz, we get:

(26)
(27)

Thus, we have:

(28)
(29)
(30)
(31)

by applying (23) and (27) to obtain the final bound.

Since the algorithm is guaranteed to increase the objective by a finite amount, it will terminate in a finite number of iterations if we require for some positive .

Corollary 10.

The algorithm for a given will terminate after at most iterations where is the number of manifolds, L bounds the norm of the points on the manifolds.

Proof.

and is a feasible solution for . Therefore, the optimal objective function is upper-bounded by . The upper bound on the number of iterations is then provided by Theorem (9).

We can also bound the error in the objective function after terminates:

Corollary 11.

With , after terminates with solution , slack and value , then the optimal value of is bracketed by:

(32)
Proof.

The lower bound on is apparent since includes constraints for all . Setting the slacks will make the solution feasible for resulting in the upper bound.

4 Experiments

4.1 Ellipsoidal Manifolds

As an illustration of our method, we have generated -dimensional -norm ellipsoids with random radii, centers, and directions. The points on each manifold are parameterized as where the center and basis vectors are random Gaussian and , the ellipsoidal radii, are sampled from with mean .

We compare the performance of to the conventional Point SVM (, ) with samples drawn from the surface of the ellipsoids for training and test examples. Performance is measured by generalization error on the task of classifying points from positively labeled manifolds from negatively labeled ones, as a function of the number of training samples used during learning.

For these manifolds, the separation oracle of returns points that minimize