We are interested in the problem of finding a non-linear separator for a given set of points with labels
. Finding a linear separator can be stated as the problem of finding a unit vector(if one exists) such that for all
This is called the primal problem. In the more interesting non-linear setting, we will be searching for functions in a Reproducing Kernel Hilbert Space (RKHS) associated with kernel (to be defined later) such that for all
True to the paper’s title, margins of non-linear separators in an RKHS will be a central concept, and we will derive interesting smoothed accelerated variants of the Perceptron algorithm that have convergence rates (for the aforementioned primal and a dual problem introduced later) that are inversely proportional to the RKHS-margin as opposed to inverse squared margin for the Perceptron.
The linear setting is well known by the name of linear feasibility problems - we are asking if there exists any vector which makes an acute angle with all the vectors , i.e.
where . This can be seen as finding a vector inside the dual cone of .
When normalized, as we will see in the next section, the margin is a well-studied notion of conditioning for these problems. It can be thought of as the width of the feasibility cone as in Freund and Vera (1999), a radius of well-posedness as in Cheung and Cucker (2001), and its inverse can be seen as a special case of a condition number defined by Renegar (1995) for these systems.
1.1 Related Work
In this paper we focus on the famous Perceptron algorithm from Rosenblatt (1958) and the less-famous Von-Neumann algorithm from Dantzig (1992) that we introduce in later sections. As mentioned by Epelman and Freund (2000), in a technical report by the same name, Nesterov pointed out in a note to the authors that the latter is a special case of the now-popular Frank-Wolfe algorithm.
Our work builds on Soheili and Peña (2012, 2013b) from the field of optimization - we generalize the setting to learning functions in RKHSs, extend the algorithms, simplify proofs, and simultaneously bring new perspectives to it. There is extensive literature around the Perceptron algorithm in the learning community; we restrict ourselves to discussing only a few directly related papers, in order to point out the several differences from existing work.
We provide a general unified proof in the Appendix which borrows ideas from accelerated smoothing methods developed by Nesterov (2005) - while this algorithm and others by Nemirovski (2004), Saha et al. (2011) can achieve similar rates for the same problem, those algorithms do not possess the simplicity of the Perceptron or Von-Neumann algorithms and our variants, and also don’t look at the infeasible setting or primal-dual algorithms.
Accelerated smoothing techniques have also been seen in the learning literature like in Tseng (2008)
and many others. However, most of these deal with convex-concave problems where both sets involved are the probability simplex (as in game theory, boosting, etc), while we deal with hard margins where one of the sets is a unitball. Hence, their algorithms/results are not extendable to ours trivially. This work is also connected to the idea of -coresets by Clarkson (2010), though we will not explore that angle.
A related algorithm is called the Winnow by Littlestone (1991) - this works on the margin and is a saddle point problem over two simplices. One can ask whether such accelerated smoothed versions exist for the Winnow. The answer is in the affirmative - however such algorithms look completely different from the Winnow, while in our setting the new algorithms retain the simplicity of the Perceptron.
1.2 Paper Outline
Sec.2 will introduce the Perceptron and Normalized Perceptron algorithm and their convergence guarantees for linear separability, with specific emphasis on the unnormalized and normalized margins. Sec.3 will then introduce RKHSs and the Normalized Kernel Perceptron algorithm, which we interpret as a subgradient algorithm for a regularized normalized hard-margin loss function.
Sec.4 describes the Smoothed Normalized Kernel Perceptron algorithm that works with a smooth approximation to the original loss function, and outlines the argument for its faster convergence rate. Sec.5 discusses the non-separable case and the Von-Neumann algorithm, and we prove a version of Gordan’s theorem in RKHSs.
We finally give an algorithm in Sec.6 which terminates with a separator if one exists, and with a dual certificate of near-infeasibility otherwise, in time inversely proportional to the margin. Sec.7 has a discussion and some open problems.
2 Linear Feasibility Problems
The classical perceptron algorithm can be stated in many ways, one is in the following form
The algorithm works when updated with any arbitrary point that is misclassified; it has the same guarantees when is updated with the point that is misclassified by the largest amount,
. Alternately, one can define the probability distribution over examples
where is the -dimensional probability simplex.
Intuitively, picks the examples that have the lowest margin when classified by . One can also normalize the updates so that we can maintain a probability distribution over examples used for updates from the start, as seen below:
Normalized Perceptron has the same guarantees as perceptron - the Perceptron can perform its update online on any misclassified point, while the Normalized Perceptron performs updates on the most misclassified point(s), and yet there does not seem to be any change in performance. However, we will soon see that the ability to see all the examples at once gives us much more power.
2.2 Normalized Margins
If we normalize the data points by the norm, the resulting mistake bound of the perceptron algorithm is slightly different. Let represent the matrix with columns . Define the unnormalized and normalized margins as
Note that we have in the definition, this is equivalent to iff .
Normalized Perceptron has the following guarantee on : If , then it finds a perfect separator in iterations.
Consider the max-margin separator for (which is also a valid perfect separator for ). Then
Hence, it is always better to normalize the data as pointed out in Graepel et al. (2001). This idea extends to RKHSs, motivating the normalized Gram matrix considered later.
Example Consider a simple example in . Assume that points are located along the line , and the points along , for , where . The max-margin linear separator will be . If all the data were normalized to have unit Euclidean norm, then all the points would all be at and all the points at , giving us a normalized margin of . Unnormalized, the margin is and . Hence, in terms of bounds, we get a discrepancy of , which can be arbitrarily large.
Winnow The question arises as to which norm we should normalize by. There is a now classic algorithm in machine learning, called Winnow by Littlestone (1991) or Multiplicate Weights. It works on a slight transformation of the problem where we only need to search for . It comes with some very well-known guarantees - If there exists a such that , then feasibility is guaranteed in iterations. The appropriate notion of normalized margin here is
where is a matrix with columns . Then, the appropriate iteration bound is . We will return to this -margin in the discussion section. In the next section, we will normalize by using the kernel appropriately.
3 Kernels and RKHSs
The theory of Reproducing Kernel Hilbert Spaces (RKHSs) has a rich history, and for a detailed introduction, refer to Schölkopf and Smola (2002). Let be a symmetric positive definite kernel, giving rise to a Reproducing Kernel Hilbert Space with an associated feature mapping at each point called where i.e. . has an associated inner product . For any , we have .
Define the normalized feature map
For any function , we use the following notation
We analogously define the normalized margin here to be
Consider the following regularized empirical loss function
Denoting and writing , let us calculate the minimum value of this function
Since is some empirical loss function on the data and is an increasing function of , the Representer Theorem (Schölkopf et al., 2001) implies that the minimizer of the above function lies in the span of s (also the span of the s). Explicitly,
Substituting this back into Eq.(6), we can define
where is a normalized signed Gram matrix with ,
and , . One can verify that is a PSD matrix and the G-norm is a semi-norm, whose properties are of great importance to us.
3.1 Some Interesting and Useful Lemmas
The first lemma justifies our algorithms’ exit condition.
implies and there exists a perfect classifier iff .
. is perfect since
If a perfect classifier exists, then by definition and
where are the optimizers of . ∎
The second lemma bounds the G-norm of vectors.
For any , .
Using the triangle inequality of norms, we get
where we used . ∎
The third lemma gives a new perspective on the margin.
When , maximizes the margin iff optimizes . Hence, the margin is equivalently
Let be any function with that achieves the max-margin . Then, it is easy to plug into Eq. (6) and verify that and hence minimizes .
Similarly, let be any function that minimizes , i.e. achieves the value . Defining , and examining Eq. (7), we see that cannot achieve the value unless and which means that must achieve the max-margin.
Hence considering only is acceptable for both. Plugging this into Eq. (5) gives the equality and
(can also be seen by going back to function space). ∎
4 Smoothed Normalized Kernel Perceptron
Define the distribution over the worst-classified points
and hence the Normalized Kernel Perceptron (NKP) is a subgradient algorithm to minimize from Eq. (6).
Remark. Lemma 3 yields deep insights. Since NKP can get arbitrarily close to the minimizer of strongly convex , it also gets arbitrarily close to a margin maximizer. It is known that it finds a perfect classifier in iterations - we now additionally infer that it will continue to improve to find an approximate max-margin classifier. While both classical and normalized Perceptrons find perfect classifiers in the same time, the latter is guaranteed to improve.
Remark. is always a probability distribution. Curiously, a guarantee that the solution will lie in is not made by the Representer Theorem in Eq. (8) - any could satisfy Lemma 1. However, since NKP is a subgradient method for minimizing Eq. (6), we know that we will approach the optimum while only choosing .
Define the smooth minimizer analogous to Eq. (10) as
is -strongly convex with respect to the -norm (Nesterov, 2005).
Define a smoothened loss function as in Eq. (9)
Note that the maximizer above is precisely .
Lemma 4 (Lower Bound).
At any step , we have
First note that . Also,
Combining these two facts gives us the result. ∎
Lemma 5 (Upper Bound).
In any round , SNKP satisfies
Finally, we combine the above lemmas to get the following theorem about the performance of SNKP.
The SNKP algorithm finds a perfect classifier when one exists in iterations.
5 Infeasible Problems
What happens when the points are not separable by any function ? We would like an algorithm that terminates with a solution when there is one, and terminates with a certificate of non-separability if there isn’t one. The idea is based on theorems of the alternative like Farkas’ Lemma, specifically a version of Gordan’s theorem (Chvatal, 1983):
Lemma 6 (Gordan’s Thm).
Exactly one of the following two statements can be true
Either there exists a such that for all ,
Or, there exists a such that
As mentioned in the introduction, the primal problem can be interpreted as finding a vector in the interior of the dual cone of , which is infeasible the dual cone is flat i.e. if is not pointed, which happens when the origin is in the convex combination of s.
We will generalize the following algorithm for linear feasibility problems, that can be dated back to Von-Neumann, who mentioned it in a private communication with Dantzig, who later studied it himself (Dantzig, 1992).
We derive a smoothed variant of NVN in the next section, after we prove some crucial lemmas in RKHSs.
5.1 A Separation Theorem for RKHSs
While finite dimensional Euclidean spaces come with strong separation guarantees that come under various names like the separating hyperplane theorem, Gordan’s theorem, Farkas’ lemma, etc, the story isn’t always the same for infinite dimensional function spaces which can often be tricky to deal with. We will prove an appropriate version of such a theorem that will be useful in our setting.
What follows is an interesting version of the Hahn-Banach separation theorem, which looks a lot like Gordan’s theorem in finite dimensional spaces. The conditions to note here are that either or .
Exactly one of the following has a solution:
Either such that for all ,
Or such that
Consider the following set
If (2) does not hold, then it implies that . Since is closed and convex, we can find a separating hyperplane between and , or in other words there exists such that
The second condition immediately yields . The first condition, when applied to yields
since , which shows that (1) holds.
It is also immediate that if (2) holds, then (1) cannot. ∎
Note that is positive semi-definite - infeasibility requires both that it is not positive definite, and also that the witness to must be a probability vector. Similarly, while it suffices that for some , but coincidentally in our case will also lie in the probability simplex.
5.2 The infeasible margin
So what is when the problem is infeasible? Let
be the convex hull of the s.
When the primal is infeasible, the margin111We thank a reviewer for pointing out that by this definition, might always be for infinite dimensional RKHSs because there are always directions perpendicular to the finite-dimensional hull - we conjecture the definition can be altered to restrict attention to the relative interior of the hull, making it non-zero. is
(1) For inequality . Choose any such that for any . Given an arbitrary with , put .
By our assumption on , we have implying there exists a such that . Also
Since this holds for a particular , we can infer
Since this holds for any with , we have
(2) For inequality . It suffices to show . We will prove the contrapositive .
Since is compact and convex, is closed and convex. Therefore if , then there exists with that separates and , i.e. for all ,
6 Kernelized Primal-Dual Algorithms
The preceding theorems allow us to write a variant of the Normalized VonNeumann algorithm from the previous section that is smoothed and works for RKHSs. Define
as the set of witnesses to the infeasibility of the primal. The following lemma bounds the distance of any point in the simplex from the witness set by its norm.
For all , the distance to the witness set
As a consequence, iff .
This is trivial for . For arbitrary , let so that .
Hence by Theorem 3, there exists such that
Let where . Then
so (by definition of what it means to be in ) and
We take with because might be . ∎
Hence for the primal or dual problem, points with small G-norm are revealing - either Lemma 3 shows that the margin will be small, or if it is infeasible then the above lemma shows that it is close to the witness set.
We need a small alteration to the smoothing entropy prox-function that we used earlier. We will now use
for some given , which is strongly convex with respect to the norm. This allows us to define