# Margins, Kernels and Non-linear Smoothed Perceptrons

We focus on the problem of finding a non-linear classification function that lies in a Reproducing Kernel Hilbert Space (RKHS) both from the primal point of view (finding a perfect separator when one exists) and the dual point of view (giving a certificate of non-existence), with special focus on generalizations of two classical schemes - the Perceptron (primal) and Von-Neumann (dual) algorithms. We cast our problem as one of maximizing the regularized normalized hard-margin (ρ) in an RKHS and in terms of a Mahalanobis dot-product/semi-norm associated with the kernel's (normalized and signed) Gram matrix. We derive an accelerated smoothed algorithm with a convergence rate of √( n)ρ given n separable points, which is strikingly similar to the classical kernelized Perceptron algorithm whose rate is 1ρ^2. When no such classifier exists, we prove a version of Gordan's separation theorem for RKHSs, and give a reinterpretation of negative margins. This allows us to give guarantees for a primal-dual algorithm that halts in {√(n)|ρ|, √(n)ϵ} iterations with a perfect separator in the RKHS if the primal is feasible or a dual ϵ-certificate of near-infeasibility.

## Authors

• 68 publications
• 2 publications
10/28/2021

### A first-order primal-dual method with adaptivity to local smoothness

We consider the problem of finding a saddle point for the convex-concave...
08/27/2021

### Learning primal-dual sparse kernel machines

Traditionally, kernel methods rely on the representer theorem which stat...
08/16/2021

### Role of New Kernel Function in Complexity Analysis of an Interior Point Algorithm for Semi definite Linear Complementarity Problem

In this paper, we introduce a new kernel function which differs from pre...
02/11/2019

### Efficient Primal-Dual Algorithms for Large-Scale Multiclass Classification

We develop efficient algorithms to train ℓ_1-regularized linear classifi...
06/20/2014

### Towards A Deeper Geometric, Analytic and Algorithmic Understanding of Margins

Given a matrix A, a linear feasibility problem (of which linear classifi...
11/17/2020

### Linear Separation via Optimism

Binary linear classification has been explored since the very early days...
10/28/2021

### Tractability from overparametrization: The example of the negative perceptron

In the negative perceptron problem we are given n data points ( x_i,y_i)...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We are interested in the problem of finding a non-linear separator for a given set of points with labels

. Finding a linear separator can be stated as the problem of finding a unit vector

(if one exists) such that for all

 yi(w⊤xi)≥0\ \ \ \ \ i.e. \ \ \ sign(w⊤xi)=yi. (1)

This is called the primal problem. In the more interesting non-linear setting, we will be searching for functions in a Reproducing Kernel Hilbert Space (RKHS) associated with kernel (to be defined later) such that for all

 yif(xi)≥0. (2)

We say that problems (1), (2) have an unnormalized margin , if there exists a unit vector , such that for all ,

 yi(w⊤xi)≥ρ \ \ or \ \ yif(xi)≥ρ.

True to the paper’s title, margins of non-linear separators in an RKHS will be a central concept, and we will derive interesting smoothed accelerated variants of the Perceptron algorithm that have convergence rates (for the aforementioned primal and a dual problem introduced later) that are inversely proportional to the RKHS-margin as opposed to inverse squared margin for the Perceptron.

The linear setting is well known by the name of linear feasibility problems - we are asking if there exists any vector which makes an acute angle with all the vectors , i.e.

 (XY)⊤w>0n, (3)

where . This can be seen as finding a vector inside the dual cone of .

When normalized, as we will see in the next section, the margin is a well-studied notion of conditioning for these problems. It can be thought of as the width of the feasibility cone as in Freund and Vera (1999), a radius of well-posedness as in Cheung and Cucker (2001), and its inverse can be seen as a special case of a condition number defined by Renegar (1995) for these systems.

### 1.1 Related Work

In this paper we focus on the famous Perceptron algorithm from Rosenblatt (1958) and the less-famous Von-Neumann algorithm from Dantzig (1992) that we introduce in later sections. As mentioned by Epelman and Freund (2000), in a technical report by the same name, Nesterov pointed out in a note to the authors that the latter is a special case of the now-popular Frank-Wolfe algorithm.

Our work builds on Soheili and Peña (2012, 2013b) from the field of optimization - we generalize the setting to learning functions in RKHSs, extend the algorithms, simplify proofs, and simultaneously bring new perspectives to it. There is extensive literature around the Perceptron algorithm in the learning community; we restrict ourselves to discussing only a few directly related papers, in order to point out the several differences from existing work.

We provide a general unified proof in the Appendix which borrows ideas from accelerated smoothing methods developed by Nesterov (2005) - while this algorithm and others by Nemirovski (2004), Saha et al. (2011) can achieve similar rates for the same problem, those algorithms do not possess the simplicity of the Perceptron or Von-Neumann algorithms and our variants, and also don’t look at the infeasible setting or primal-dual algorithms.

Accelerated smoothing techniques have also been seen in the learning literature like in Tseng (2008)

and many others. However, most of these deal with convex-concave problems where both sets involved are the probability simplex (as in game theory, boosting, etc), while we deal with hard margins where one of the sets is a unit

ball. Hence, their algorithms/results are not extendable to ours trivially. This work is also connected to the idea of -coresets by Clarkson (2010), though we will not explore that angle.

A related algorithm is called the Winnow by Littlestone (1991) - this works on the margin and is a saddle point problem over two simplices. One can ask whether such accelerated smoothed versions exist for the Winnow. The answer is in the affirmative - however such algorithms look completely different from the Winnow, while in our setting the new algorithms retain the simplicity of the Perceptron.

### 1.2 Paper Outline

Sec.2 will introduce the Perceptron and Normalized Perceptron algorithm and their convergence guarantees for linear separability, with specific emphasis on the unnormalized and normalized margins. Sec.3 will then introduce RKHSs and the Normalized Kernel Perceptron algorithm, which we interpret as a subgradient algorithm for a regularized normalized hard-margin loss function.

Sec.4 describes the Smoothed Normalized Kernel Perceptron algorithm that works with a smooth approximation to the original loss function, and outlines the argument for its faster convergence rate. Sec.5 discusses the non-separable case and the Von-Neumann algorithm, and we prove a version of Gordan’s theorem in RKHSs.

We finally give an algorithm in Sec.6 which terminates with a separator if one exists, and with a dual certificate of near-infeasibility otherwise, in time inversely proportional to the margin. Sec.7 has a discussion and some open problems.

## 2 Linear Feasibility Problems

### 2.1 Perceptron

The classical perceptron algorithm can be stated in many ways, one is in the following form

It comes with the following classic guarantee as proved by Block (1962) and Novikoff (1962): If there exists a unit vector such that , then a perfect separator will be found in iterations/mistakes.

The algorithm works when updated with any arbitrary point that is misclassified; it has the same guarantees when is updated with the point that is misclassified by the largest amount,

. Alternately, one can define the probability distribution over examples

 p(w)=argminp∈Δn⟨YX⊤w,p⟩, (4)

where is the -dimensional probability simplex.

Intuitively, picks the examples that have the lowest margin when classified by . One can also normalize the updates so that we can maintain a probability distribution over examples used for updates from the start, as seen below:

#### Remark.

Normalized Perceptron has the same guarantees as perceptron - the Perceptron can perform its update online on any misclassified point, while the Normalized Perceptron performs updates on the most misclassified point(s), and yet there does not seem to be any change in performance. However, we will soon see that the ability to see all the examples at once gives us much more power.

### 2.2 Normalized Margins

If we normalize the data points by the norm, the resulting mistake bound of the perceptron algorithm is slightly different. Let represent the matrix with columns . Define the unnormalized and normalized margins as

 ρ := sup∥w∥2=1infp∈Δn⟨YX⊤w,p⟩, ρ2 := sup∥w∥2=1infp∈Δn⟨YX⊤2w,p⟩.

#### Remark.

Note that we have in the definition, this is equivalent to iff .

Normalized Perceptron has the following guarantee on : If , then it finds a perfect separator in iterations.

#### Remark.

Consider the max-margin separator for (which is also a valid perfect separator for ). Then

 ρmaxi∥xi∥2 = mini(yix⊤iu∗maxi∥xi∥2)≤mini(yix⊤iu∗∥xi∥2) ≤ sup∥u∥2=1mini(yix⊤iu∥xi∥2)=ρ2.

Hence, it is always better to normalize the data as pointed out in Graepel et al. (2001). This idea extends to RKHSs, motivating the normalized Gram matrix considered later.

Example Consider a simple example in . Assume that points are located along the line , and the points along , for , where . The max-margin linear separator will be . If all the data were normalized to have unit Euclidean norm, then all the points would all be at and all the points at , giving us a normalized margin of . Unnormalized, the margin is and . Hence, in terms of bounds, we get a discrepancy of , which can be arbitrarily large.

Winnow The question arises as to which norm we should normalize by. There is a now classic algorithm in machine learning, called Winnow by Littlestone (1991) or Multiplicate Weights. It works on a slight transformation of the problem where we only need to search for . It comes with some very well-known guarantees - If there exists a such that , then feasibility is guaranteed in iterations. The appropriate notion of normalized margin here is

 ρ1:=maxw∈Δdminp∈Δn⟨YX⊤∞w,p⟩,

where is a matrix with columns . Then, the appropriate iteration bound is . We will return to this -margin in the discussion section. In the next section, we will normalize by using the kernel appropriately.

## 3 Kernels and RKHSs

The theory of Reproducing Kernel Hilbert Spaces (RKHSs) has a rich history, and for a detailed introduction, refer to Schölkopf and Smola (2002). Let be a symmetric positive definite kernel, giving rise to a Reproducing Kernel Hilbert Space with an associated feature mapping at each point called where i.e. . has an associated inner product . For any , we have .

Define the normalized feature map

 ~ϕx=ϕx√K(x,x)∈FK \ \ % and \ \ ~ϕX:=[~ϕxi]n1.

For any function , we use the following notation

 Y~f(X):=⟨f,Y~ϕX⟩K=[yi⟨f,~ϕxi⟩K]n1=[yif(xi)√K(xi,xi)]n1.

We analogously define the normalized margin here to be

 ρK := sup∥f∥K=1infp∈Δn⟨Y~f(X),p⟩. (5)

Consider the following regularized empirical loss function

 L(f)={supp∈Δn⟨−Y~f(X),p⟩}+12∥f∥2K. (6)

Denoting and writing , let us calculate the minimum value of this function

 inff∈FKL(f) = inft>0inf∥¯f∥K=1supp∈Δn⟨−⟨t¯f,Y~ϕX⟩K,p⟩+t22 (7) = inft>0{−tρK+12t2} = −12ρ2K \ \ when t=ρK>0.

Since is some empirical loss function on the data and is an increasing function of , the Representer Theorem (Schölkopf et al., 2001) implies that the minimizer of the above function lies in the span of s (also the span of the s). Explicitly,

 argminf∈FKL(f)=n∑i=1αiyi~ϕxi=⟨Y~ϕX,α⟩. (8)

Substituting this back into Eq.(6), we can define

 L(α) := {supp∈Δn⟨−α,p⟩\raisebox−3.0pt\emG}+12∥α∥2G, (9)

where is a normalized signed Gram matrix with ,

 Gji=Gij:=yiyjK(xi,xj)√K(xi,xi)K(xj,xj)=⟨yi~ϕxi,yj~ϕxj⟩K,

and , . One can verify that is a PSD matrix and the G-norm is a semi-norm, whose properties are of great importance to us.

### 3.1 Some Interesting and Useful Lemmas

The first lemma justifies our algorithms’ exit condition.

###### Lemma 1.

implies and there exists a perfect classifier iff .

###### Proof.

. is perfect since

 yjfα(xj)√K(xj,xj) = n∑i=1αiyiyjK(xi,xj)√K(xi,xi)K(xj,xj) = Gjα>0.

If a perfect classifier exists, then by definition and

 L(f∗) = L(α∗)=−12ρ2K<0   ⇒   Gα>0,

where are the optimizers of . ∎

The second lemma bounds the G-norm of vectors.

For any , .

###### Proof.

Using the triangle inequality of norms, we get

 √α⊤Gα = √⟨⟨α,Y~ϕX⟩,⟨α,Y~ϕX⟩⟩K = ∥∑iαiyi~ϕxi∥K≤∑i∥αiyi~ϕxi∥K ≤ ∑i|αi|∥∥ ∥∥yiϕxi√K(xi,xi)∥∥ ∥∥K=∑i|αi|,

where we used . ∎

The third lemma gives a new perspective on the margin.

###### Lemma 3.

When , maximizes the margin iff optimizes . Hence, the margin is equivalently

 ρK=sup∥α∥G=1infp∈Δn⟨α,p⟩\raisebox−3.0pt\emG≤∥p∥G\ \ \ \ for all p∈Δn.
###### Proof.

Let be any function with that achieves the max-margin . Then, it is easy to plug into Eq. (6) and verify that and hence minimizes .

Similarly, let be any function that minimizes , i.e. achieves the value . Defining , and examining Eq. (7), we see that cannot achieve the value unless and which means that must achieve the max-margin.

Hence considering only is acceptable for both. Plugging this into Eq. (5) gives the equality and

 ρK = infp∈Δnsup∥α∥G=1⟨α,p⟩\raisebox−3.0pt\emG≤sup∥α∥G=1⟨α,p⟩\raisebox−3.0pt\emG ≤ ∥p∥G \ \ by applying Cauchy-Schwartz

(can also be seen by going back to function space). ∎

## 4 Smoothed Normalized Kernel Perceptron

Define the distribution over the worst-classified points

 p(f) := argminp∈Δn⟨Y~f(X),p⟩ or \ \ \ p(α) := argminp∈Δn⟨α,p⟩\raisebox−3.0pt\emG. (10)
 Implicitly \ \ \ fk+1 = (1−θk)fk+θk⟨Y~ϕX,p(fk)⟩ = fk−θk(fk−⟨Y~ϕX,p(fk)⟩) = fk−θk∂L(fk)

and hence the Normalized Kernel Perceptron (NKP) is a subgradient algorithm to minimize from Eq. (6).

Remark. Lemma 3 yields deep insights. Since NKP can get arbitrarily close to the minimizer of strongly convex , it also gets arbitrarily close to a margin maximizer. It is known that it finds a perfect classifier in iterations - we now additionally infer that it will continue to improve to find an approximate max-margin classifier. While both classical and normalized Perceptrons find perfect classifiers in the same time, the latter is guaranteed to improve.

Remark. is always a probability distribution. Curiously, a guarantee that the solution will lie in is not made by the Representer Theorem in Eq. (8) - any could satisfy Lemma 1. However, since NKP is a subgradient method for minimizing Eq. (6), we know that we will approach the optimum while only choosing .

Define the smooth minimizer analogous to Eq. (10) as

 pμ(α) := argminp∈Δn{⟨α,p⟩\raisebox−3.0pt\emG+μd(p)} = e−Gα/μ∥e−Gα/μ∥1, where \ \ d(p) := ∑ipilogpi+logn (12)

is -strongly convex with respect to the -norm (Nesterov, 2005).

Define a smoothened loss function as in Eq. (9)

 Lμ(α)=supp∈Δn{−⟨α,p⟩\raisebox−3.0pt\emG−μd(p)}+12∥α∥2G.

Note that the maximizer above is precisely .

###### Lemma 4 (Lower Bound).

At any step , we have

 Lμk(αk)≥L(αk)−μklogn.
###### Proof.

First note that . Also,

 supp∈Δn{−⟨α,p⟩\raisebox−3.0pt\emG−μd(p)} ≥ supp∈Δn{−⟨α,p⟩\raisebox−3.0pt\emG}−supp∈Δn{μd(p)}.

Combining these two facts gives us the result. ∎

###### Lemma 5 (Upper Bound).

In any round , SNKP satisfies

 Lμk(αk)≤−12∥pk∥2G.
###### Proof.

We provide a concise, self-contained and unified proof by induction in the Appendix for Lemma 5 and Lemma 8, borrowing ideas from Nesterov’s excessive gap technique (Nesterov, 2005) for smooth minimization of structured non-smooth functions. ∎

Finally, we combine the above lemmas to get the following theorem about the performance of SNKP.

###### Theorem 1.

The SNKP algorithm finds a perfect classifier when one exists in iterations.

###### Proof.

Lemma 4 gives us for any round ,

 Lμk(αk)≥L(αk)−μklogn.

From Lemmas 3, 5 we get

 Lμk(αk)≤−12p⊤kGpk≤−12ρ2K.

Combining the two equations, we get that

 L(αk)≤μklogn−12ρ2K.

Noting that , we see that (and hence we solve the problem by Lemma 1) after at most steps. ∎

## 5 Infeasible Problems

What happens when the points are not separable by any function ? We would like an algorithm that terminates with a solution when there is one, and terminates with a certificate of non-separability if there isn’t one. The idea is based on theorems of the alternative like Farkas’ Lemma, specifically a version of Gordan’s theorem (Chvatal, 1983):

###### Lemma 6 (Gordan’s Thm).

Exactly one of the following two statements can be true

1. Either there exists a such that for all ,

 yi(w⊤xi)>0,
2. Or, there exists a such that

 ∥XYp∥2=0, (13)

or equivalently

As mentioned in the introduction, the primal problem can be interpreted as finding a vector in the interior of the dual cone of , which is infeasible the dual cone is flat i.e. if is not pointed, which happens when the origin is in the convex combination of s.

We will generalize the following algorithm for linear feasibility problems, that can be dated back to Von-Neumann, who mentioned it in a private communication with Dantzig, who later studied it himself (Dantzig, 1992).

This algorithm comes with a guarantee: If the problem (3) is infeasible, then the above algorithm will terminate with an -approximate solution to (13) in iterations.

Epelman and Freund (2000) proved an incomparable bound - Normalized Von-Neumann (NVN) can compute an -solution to (13) in and can also find a solution to the primal (using ) in when it is feasible.

We derive a smoothed variant of NVN in the next section, after we prove some crucial lemmas in RKHSs.

### 5.1 A Separation Theorem for RKHSs

While finite dimensional Euclidean spaces come with strong separation guarantees that come under various names like the separating hyperplane theorem, Gordan’s theorem, Farkas’ lemma, etc, the story isn’t always the same for infinite dimensional function spaces which can often be tricky to deal with. We will prove an appropriate version of such a theorem that will be useful in our setting.

What follows is an interesting version of the Hahn-Banach separation theorem, which looks a lot like Gordan’s theorem in finite dimensional spaces. The conditions to note here are that either or .

###### Theorem 2.

Exactly one of the following has a solution:

1. Either such that for all ,

 yif(xi)√K(xi,xi)=⟨f,yi~ϕxi⟩K>0\ \ \ i.e. \ \ Gα>0,
2. Or such that

 ∑ipiyi~ϕxi=0∈FK\ \ \ i.e. % \ \ ∥p∥G=0. (14)
###### Proof.

Consider the following set

 Q = {(f,t)=(∑ipiyi~ϕxi,∑ipi):p∈Δn} = conv[(y1~ϕx1,1),...,(yn~ϕxn,1)] ⊆ FK×R.

If (2) does not hold, then it implies that . Since is closed and convex, we can find a separating hyperplane between and , or in other words there exists such that

 ⟨(f,t),(g,s)⟩ ≥ 0  ∀(g,s)∈Q and ⟨(f,t),(0,1)⟩ < 0.

The second condition immediately yields . The first condition, when applied to yields

 ⟨f,yi~ϕxi⟩K+t ≥ 0 ⇔    yif(xi)√K(xi,xi) > 0

since , which shows that (1) holds.

It is also immediate that if (2) holds, then (1) cannot. ∎

Note that is positive semi-definite - infeasibility requires both that it is not positive definite, and also that the witness to must be a probability vector. Similarly, while it suffices that for some , but coincidentally in our case will also lie in the probability simplex.

### 5.2 The infeasible margin ρK

Note that constraining (or ) in Eq. (5) and Lemma 3 allows to be negative in the infeasible case. If it was , then would have been non-negative because (ie ) is always allowed.

So what is when the problem is infeasible? Let

 conv(Y~ϕX):={∑ipiyi~ϕxi|p∈Δn}⊂FK

be the convex hull of the s.

###### Theorem 3.

When the primal is infeasible, the margin111We thank a reviewer for pointing out that by this definition, might always be for infinite dimensional RKHSs because there are always directions perpendicular to the finite-dimensional hull - we conjecture the definition can be altered to restrict attention to the relative interior of the hull, making it non-zero. is

 |ρK|=δmax:=sup{δ ∣∣ ∥f∥K≤δ⇒f∈conv(Y~ϕX)}
###### Proof.

(1) For inequality . Choose any such that for any . Given an arbitrary with , put .

By our assumption on , we have implying there exists a such that . Also

 ⟨f′,⟨Y~ϕX,~p⟩⟩K = ⟨f′,~f⟩K = −δ∥f′∥2K=−δ.

Since this holds for a particular , we can infer

 infp∈Δn ⟨f′,⟨Y~ϕX,~p⟩⟩K≤−δ.

Since this holds for any with , we have

 sup∥f∥K=1infp∈Δn⟨f′,⟨Y~ϕX,~p⟩⟩K≤−δ \ i.e. \ % |ρK|≥δ.

(2) For inequality . It suffices to show . We will prove the contrapositive .

Since is compact and convex, is closed and convex. Therefore if , then there exists with that separates and , i.e. for all ,

 ⟨g,f⟩K < 0 and ⟨g,⟨Y~ϕX,p⟩⟩K≥0 i.e. ⟨g,f⟩K < infp∈Δn⟨g,⟨Y~ϕX,p⟩⟩K ≤ sup∥f∥K=1infp∈Δn⟨f,⟨Y~ϕX,p⟩⟩K=ρK.
 Since ρK<0 \ \ \ |ρK| < |⟨f,g⟩K| ≤ ∥f∥K∥g∥K=∥f∥K.

## 6 Kernelized Primal-Dual Algorithms

The preceding theorems allow us to write a variant of the Normalized VonNeumann algorithm from the previous section that is smoothed and works for RKHSs. Define

as the set of witnesses to the infeasibility of the primal. The following lemma bounds the distance of any point in the simplex from the witness set by its norm.

###### Lemma 7.

For all , the distance to the witness set

 dist(q,W):=minw∈W∥q−w∥2≤min{√2,√2∥q∥G|ρK|}.

As a consequence, iff .

###### Proof.

This is trivial for . For arbitrary , let so that .

Hence by Theorem 3, there exists such that

 ⟨Y~ϕX,α⟩=⟨Y~ϕX,~p⟩.

Let where . Then

 ⟨Y~ϕX,β⟩ = 1∥p∥G+|ρK|⟨Y~ϕX,∥p∥Gα+|ρK|p⟩ = 1∥p∥G+|ρK|⟨Y~ϕX,∥p∥G~p+|ρK|p⟩ = 0,

so (by definition of what it means to be in ) and

 ∥p−β∥2=λ∥p−α∥2≤λ√2≤min{√2,√2∥q∥G|ρK|}.

We take with because might be . ∎

Hence for the primal or dual problem, points with small G-norm are revealing - either Lemma 3 shows that the margin will be small, or if it is infeasible then the above lemma shows that it is close to the witness set.

We need a small alteration to the smoothing entropy prox-function that we used earlier. We will now use

 dq(p)=12∥p−q∥22

for some given , which is strongly convex with respect to the norm. This allows us to define

 pqμ(α) = argminp∈Δn⟨Gα,p⟩+μ2∥p−q∥22, Lqμ(α) = supp∈Δn{−⟨α,p⟩G